auto-corpus: automated and consistent outputs from research publications auto-corpus: automated and consistent outputs from research publications yan hu ,a, shujian sun ,a, thomas rowlands , tim beck , ,b, and joram m. posma , ,b section of bioinformatics, division of systems medicine, department of metabolism, digestion and reproduction, imperial college london, sw az, united kingdom department of genetics and genome biology, university of leicester, le rh, united kingdom health data research (hdr) uk, united kingdom a these authors contributed equally. b these authors contributed equally. � abstract motivation: the availability of improved natural lan- guage processing (nlp) algorithms and models enable researchers to analyse larger corpora using open source tools. text mining of biomedical literature is one area for which nlp has been used in recent years with large untapped potential. however, in order to generate cor- pora that can be analyzed using machine learning nlp algorithms, these need to be standardized. summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. results: we present here an automated pipeline that cleans html files from biomedical literature. the output is a single json file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. we analyzed a total of , open access articles from pubmed central, from both genome-wide and metabolome-wide association studies, and developed a model to standardize the section headers based on the information artifact ontology. extraction of table data was developed on pubmed articles and fine-tuned using the equivalent publisher versions. availability: the auto-corpus package is freely available with detailed instructions from github at https://github.com/jmp /autocorpus/. information artefact ontology | natural language processing | text standard- ization correspondence: timbeck [at] leicester.ac.uk and jmp [at] ic.ac.uk introduction natural language processing (nlp) is a branch of artificial intelligence that uses computers to process, understand and use human language. nlp is applied in many different fields including language modelling, speech recognition, text min- ing and translation systems. in the biomedical realm, nlp has been applied to extract for example medication data from electronic health records and patient clinical history from clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts ( ). biomedical publications, unlike structured electronic health records, are semi-structured and this makes it difficult to extract and inte- grate the relevant information ( ). the format of research ar- ticles differs between publishers and sections describing the same entity, for example statistical methods, can be found in different locations in the document in different publica- tions. both unstructured text and semi-structured document elements, such as headings, main texts and tables, can con- tain important information that can be extracted using text mining ( ). the development of the genome-wide association study (gwas) has been led to by the on-going revolution in high- throughput genomic screening and a deeper understanding of the relationship between genetic variations and diseases/traits ( ). in a typical gwas, researchers collect data from study participants, use single nucleotide polymorphism (snp) ar- rays to detect the common variants among participants, and conduct statistical tests to determine if the association be- tween the variants and traits is significant. the results are mostly represented in publication tables, but can also be found in the main text, and there are multiple community ef- forts to store these reported associations in queryable, on- line databases ( , ). these efforts involve time-intensive and costly manual data curation to transcribe results from the publications, and supplementary information, into databases. summary-level gwas results are generally reported in the literature according to community norms (e.g. a snp asso- ciated to a phenotype with a probability value), hence nlp algorithms can be trained to recognize the formats in which data are reported to facilitate faster and scalable information extraction that is less prone to human error. development of effective automatic text mining algorithms for gwas literature can also potentially benefit other fields in biomedical research as the body of biomedical literature grows every day. yet previous attempts of mining scientific literature focused mainly on information extraction from ab- stracts and some on the main text, while for the most part ignoring tables. to facilitate the process of preparing a cor- pus for nlp tasks such as named-entity recognition (ner), text classification or relationship extraction, we have devel- oped an automated pipeline for consistent outputs from research publications (auto-corpus) as a python package. the main aims of auto-corpus are: • to provide clean text outputs for each publication sec- tion with standardized section names hu and sun, et al. | biorχiv | january , | – .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/jmp /autocorpus/ timbeck@leicester.ac.uk jmp @ic.ac.uk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / • to represent each publication’s tables in a javascript object notation (json) format to facilitate data im- port into databases • to use the text outputs to find abbreviations used in the text we exemplify the package on a corpus of , open access gwas publications whose data have been manually added to the gwas central database to list phenotypes, snps and p-values found in the cleaned text (figure ). in addition, we also include data on , + metabolome-wide association studies (mwas) to ensure the methods are not biased towards one domain. mwas focus on small molecules, some of which are end-products of cellular regulatory processes, that are the response of the human body to genetic or environmental variations ( ). materials and methods data. hypertext markup language (html) files for , open access gwas publications whose data exists in the gwas central database ( ) were downloaded from pubmed central (pmc) in march . a further , open access publications of mwas on cancer, gastrointestinal diseases, metabolic syndrome, sepsis and neurodegenerative, psychi- atric, and brain illnesses were also downloaded in the same format. publisher versions of ca. % of these publications were downloaded in july to test the algorithms on pub- lications with different html formats. the gwas dataset was randomly divided into training publications to de- velop algorithms, and a test set of the remaining publica- tions. processing. html files were loaded using the beautiful- soup html parser package (v . . ). beautifulsoup was used to convert html files to tree-like structures with each branch representing a html section and each leaf a html element. after html files were loaded, all superscripts, subscripts, and italics were converted to plain text. auto- corpus extracts h , h and h tags for titles and headings, and p tags for paragraph texts using the default configura- tion. the headings and paragraphs are saved in a structured javascript object notation (json) file for each html file. tables are extracted from the document using a different set of configuration files (separate configurations for different ta- ble structures can be defined and used) and saved in a new json model that ensures tables of all formats and origin, not only restricted to gwas publications, can be described in the same structured model, so that these can be used as in- put to rule-based or deep learning algorithms for data extrac- tion. the data cells are stored in the “result” key, and their corresponding section name and header names are stored in “section_name” and “columns” keys respectively. therefore, extracting relationships between cells only requires simple rules. fig. . workflow of the auto-corpus package. | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ontologies for entity recognition. the information arti- fact ontology (iao) was created to serve as a domain-neutral resource for the representation of types of information con- tent entities such as documents, databases, and digital im- ages ( ). we used the v - - model ( ) in which different terms exist that describe headers typically found in biomedical literature. the extracted headers in the json file were first mapped to the iao terms using the lexical owl ontology matcher ( ). we use fuzzy matching using the fuzzywuzzy package (v . . ) to map headers to the pre- ferred section header terms and synonyms, with a similarity threshold of . . this threshold was evaluated by confirming all matches were accurate by two independent researchers. after the direct iao mapping and fuzzy matching, unmapped headers still exist. to map these headings, we developed a new method using a directed graph (digraph) for representa- tion since headers are not repeated within a document, are se- quential and have a set order that can be exploited. digraphs consist of nodes (entities, headers) and edges (links between nodes) and the weight of the nodes and edges is propor- tional to the number of publications in which these are found. while digraphs from individual publications are acyclic, the combined graph can contain cycles hence digraphs opposed to directed acyclic graphs are used. unmapped headers are assigned a section based on the digraph and the headers in the publication that could be mapped (anchor points). for example, at this point in this article the main headers are ‘ab- stract’ followed by ‘introduction’ and ‘materials and meth- ods’ that could make up a digraph. another article with head- ers ‘abstract’, ‘background’ and ‘materials and methods’ has two anchor points that match the digraph, and the unmapped header (‘background’) can be inferred from appearing in be- tween the anchor points in the digraph (‘abstract’, ‘materials and methods’): ‘introduction’. we use this process to eval- uate new potential synonyms for existing terms and identify new potential terms for sections found in biomedical litera- ture. we used the human phenotype ontology (hpo) to identify disease traits in the full texts. the hpo was developed with the goal to cover all common phenotypic abnormalities in hu- man monogenic diseases ( ). use cases: regular expression algorithms. abbrevia- tions in the full text are found using an adaptation of a previ- ously published methodology ( ) based on regular expres- sions using the abbreviations package (v . . ). the brief principle of it is to find all brackets within a corpus. if the number of words in a bracket is < it considers if it could be an abbreviation. it searches the characters within the brackets in the text on either side of the brackets one by one. the first character of one of these words must contain the first charac- ter within that bracket. and the other characters within that bracket must be contained by other words followed by the previous word whose first character is the same as the first character in that bracket. we combine the output of the pack- age with abbreviations defined in the abbreviations section (if found) from the iao/digraph model. for phenotype entity recognition, first any abbreviations in paragraphs extracted from the full text are replaced by their definition. this text is then tokenized using the spacy pack- age (v . ) (model en_core_web_sm) and compared against phenotypes and their synonyms defined by hpo for disease traits matching. p-values and snps were identified in the full text and tables based on regular expressions as they have a standard form. pairs of p-value-snp associations are found in the text using dependency parse trees ( ). use cases: deep learning-based named-entity recog- nition. the first example of a use case is to recognize the assay with which the data was acquired, however no ex- isting models exist for this purpose. we fine-tuned a pre- existing model trained for biomedical ner, the biomedi- cal bidirectional encoder representations from transform- ers (biobert) ( ), using part of our corpus where only mwas assays were tagged. we applied our fine-tuned model only on the paragraphs in the materials and methods sec- tions to recognize the assays used. a second biobert-based model was fine-tuned on phenotypes, which already exist in the data, and enriched in phenotypes associated with the mwas publications. this model was applied on only the abstract and paragraphs from the results section. the third example was applied only on paragraphs from the results and discussion sections using an existing model specifically trained to recognize chemical entities, chemlistem (v . . ) ( ). use cases: paragraph classification. it is possible un- mapped headers are mapped to multiple sections if the an- chor points are far apart. in order to test the applicability of a machine learning model to classify paragraphs we trained a random forest classifier on a dataset consisting of , ab- stract paragraphs and non-abstract paragraphs. % of the data was used for training and the remainder as the test set. results the order of sections in biomedical literature. a total of , headers were extracted from the , publica- tions, mapped to iao (v - - ) terms and visualized by means of a digraph with unique nodes and directed edges (figure a). the major unmapped node is ‘associated data’, which is a header specific for pmc articles that ap- pears at the beginning of each article before the abstract. the main structure of biomedical articles that were analyzed is: abstract → introduction → materials → results → discus- sion → conclusion → acknowledgements → footnotes sec- tion → references. iao has separate definitions for ‘mate- rials’ (iao: ), ‘methods’ (iao: ) and ‘statis- tical methods’ (iao: ) sections, hence they are sepa- rate nodes in the graph and introduction is also often followed by headers to reflect the methods section (and synonyms). there is also a major directed edge from introduction directly to results, with materials and methods placed after the discus- sion and/or conclusion sections. hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / all unmapped headers were investigated and evaluated whether some could be used as synonym for existing cate- gories. the digraph was also inspected by means of visual- izing individual ego-networks which show the edges around a specific node mapped to an existing iao term. figure b shows the ego-network for abstract, and four main categories and one potential new synonym (precis, in red) were iden- tified. the majority of unmapped headers (in purple), that follow the abstract, relate to a document that is written as one coherent whole, with specific headers for each section or a general header for the full/main text. an additional four unmapped headers relate to ‘materials and methods’ in their broader sense and these are data, data description, par- ticipants and sample. the remaining two categories of un- mapped headers to/from abstract can be classified as new sections ‘graphical abstract’ and ‘highlights’. these head- ers were found alongside, and appear to be distinct from, the (textual) abstract. based on the digraph, we then assigned data and data descrip- tion to be synonyms of the materials section, and participants and sample as a new category termed ‘participants’ which is related to, but deemed distinct from, the existing patients sec- tion (iao: ). the same process was applied to ego- networks from other nodes linked to existing iao terms to add additional synonyms to simplify the digraph. figure c shows the resulting digraph with only existing and newly pro- posed section terms. new proposed elements for the iao. each existing iao term contains one or more synonyms and extracted head- ers were first mapped directly to these terms. any headers that could not be mapped directly are mapped in the second step using fuzzy matching (e.g. the typographical error ‘ex- peremintal section’ in pmc is correctly mapped to the methods section). the last step involves mapping remain- ing unmapped headers to existing terms based on the digraph and using the structure (anchor headers) of the publication. headers that can be mapped to existing terms in the second and third steps, are included as synonyms in the model. the existing categories for which new potential synonyms were identified are listed in table a and b with their existing synonyms and newly identified synonyms. from the analysis of ego-networks four new potential cate- gories were identified: disclosure, graphical abstract, high- lights and participants. table details the proposed defini- tion and synonyms for these categories. in the digraph in figure c this section is located towards the end of a pub- lication and in some instances is followed by the conflict of interest section. table data extraction with different configurations. pmc articles are standardized which makes data extraction more straightforward, however some publications are not deposited into pmc or other repositories and can only be found via publisher websites. while the package has been developed using a large set of pmc articles, we compared the auto-corpus output for pmc articles with the output for the equivalent articles made available by the publishers. we found no differences in how headers were extracted and paragraphs were classified based on the digraph. however, the representation of tables does differ substantially between publishers, hence a model developed on pmc articles alone will fail to extract the data. we circumvent this issue by defin- ing configuration files for different table formats and we com- pare the accuracy of the data represented in the json format (figure ) between pmc and publisher versions of the same papers. using the default (pmc) configuration on non-pmc arti- cles none of the tables are represented accurately in the json. auto-corpus allows to use a variety of configura- tion files (a single file, or all as batch) to be used to extract data from tables. one configuration file, different to the de- fault, correctly represented the data in json format of % ( ) of tables. the remaining tables could be repre- sented correctly using different configuration files. when the right configuration file is used for non-pmc articles, all tables ( %) are represented identically to the json output from the matching pmc version. use cases. the extracted paragraphs were classified as one (or more) categories based on the digraph. this is the purpose of the auto-corpus package, to prepare a corpus for analy- sis so that different sections can be used for specific purposes. we detail how these standardized texts can be used for entity recognition. paragraph classification. while many headers can be mapped using fuzzy matching plus the digraph structure, some headers remain unmapped (e.g. the headers in purple in figure b: full text, main text, etc.) while others can be assigned to multiple (possible) sections. the choice of as- signing multiple categories to unmapped headers based on the digraph is deliberate as it is to ensure the algorithm does not wrongly assign it to only one (e.g. ‘materials’ over ‘meth- ods’). the next step is to perform the paragraph classification using nlp algorithms to learn from the word usage and con- text. we show that random forests can be used to this end by training it to distinguish between abstracts and other para- graphs. paragraphs from the test set were predicted us- ing a random forest trained on , paragraphs. for the test set, we obtained an f -score of . for classifying abstracts (precision = . , recall = . ) and . for classifying non- abstracts (precision = . , recall = . ). abbreviation identification. the abbreviation detection algo- rithm searches through each paragraph using a rule-based ap- proach to find all abbreviations used. auto-corpus then investigates whether a paragraph is mapped to the abbrevia- tions category and, if found, it combines these two lists of ab- breviations found in the publication. for example, when ap- plied on an mwas publication ( ) which contains a header titled “abbreviations” the algorithm combines the ab- breviations listed by the authors and with a further identi- fied from the text (figure ), including an abbreviation used with two spellings in the text. | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . digraph generated from analyzing section headers from , open access publications from pubmed central. (a) digraph of the v - - iao model consists of unique nodes, of which could be directly mapped to section terms (in orange) and the remainder are unmapped headers (in grey), and directed edges. relative node sizes and edge widths are directly proportional to the number of publications with these (subsequent) headers. blue edges indicate the edge with the highest weight from the source node, edges that exist in fewer than % of publications are shown in light grey and the remainder in black. (b) unmapped nodes connected to ‘abstract’ as ego node, excluding corpus specific nodes, grouped into different categories. unlabeled nodes are titles of paragraphs in the main text. (c) final digraph model used in auto-corpus to classify paragraphs after fuzzy matching. this model includes new (proposed) section terms and each section contains new synonyms identified in this analysis. ‘associated data’ is included as this is a pmc-specific header found before abstracts and can be used to indicate the start of most articles. rule-based extraction of gwas summary-level data. gwas central relies on curated data extracted manually from pub- lications or other databases. we investigated whether a rule-based approach to recognize phenotypes, snps and p- values can correctly identify data from publications con- tained within the database. a rule-based approach by ap- plying the hpo on the gwas publications from the test set, identified a total of , unique disease traits (major and minor) in these publications. traits are recorded for these publications in gwas central and the rule-based approach found with a perfect match. for % of the publica- tions all traits were correctly identified. snps have standard- ized formats, hence rule-based approaches are well suited for their identification. likewise, p-values in gwas publica- tions are typically represented using scientific notation and can also be identified using rule-based methods. a total of , snp/p-value pairs were found across the main text and tables of the publications. for . % of publications all associations recorded in the gwas central database are also found using this approach. while . % of these pub- lications present results (snp/p-value pairs) only in tables, and . % of pairs are found in tables, associations were identified from the main text that are not represented in ta- bles. , pairs match those recorded in the database (total of , pairs for these publications), however many associ- ations in the database are not represented in main text/tables but in supplementary materials. auto-corpus includes a separate function to convert csv/tsv data to table json for- mat (figure ), as summary-level results are often saved in these file formats as part of the supplementary information. named-entity recognition. three different deep learning models were used for ner on specific paragraphs of publica- tions. a pre-trained biomedical entity recognition algorithm ( ) was fine-tuned using the results from the rule-based approach applied on gwas data. example sentences that contain hpo terms were used to fine-tune the transformer model and then applied on mwas publications from four broad and distinct phenotypes (cancer, gastrointestinal diseases, metabolic syndrome, and neurodegenerative, psy- chiatric and brain illnesses). the fine-tuned deep learning algorithm obtained accuracies between . and . , aver- aging around . % (table ). we then fine-tuned the same base model for recognizing as- says in text by training on sentences identified from the text that contain assays routinely used in mwas. the first pass consisted of a rule-based approach, with fuzzy matching, to find sentences with terms and these were then used to fine- tune the deep learning model. figure shows the result- ing output in json format for one mwas publication ( ). hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / category (iao identifier) existing synonyms (iao v - - ) new synonyms identified a abstract (iao: ) abstract precis acknowledgements (iao: ) acknowledgements, acknowledgments acknowledgement, acknowledgment, acknowledgments and disclaimer author contributions (iao: ) author contributions, contributions by the authors authors’ contribution, authors’ contributions, authors’ roles, contributorship, main authors by consortium and author contributions discussion (iao: ) discussion, discussion section discussions footnote (iao: ) endnote, footnote footnotes introduction (iao: ) background, introduction introductory paragraph methods (iao: ) experimental, experimental procedures, experimental section, materials and methods, methods analytical methods, concise methods, experimental methods, method, method validation, methodology, methods and design, methods and procedures, methods and tools, methods/design, online methods, star methods, study design, study design and methods references (iao: ) bibliography, literature cited, references literature cited, reference, references, reference list, selected references, web site references supplementary material (iao: ) additional information, appendix, supplemental information, supplementary material, supporting information additional file, additional files, additional information and declarations, additional points, electronic supplementary material, electronic supplementary materials, online content, supplemental data, supplemental material, supplementary data, supplementary figures and tables, supplementary files, supplementary information, supplementary materials, supplementary materials figures, supplementary materials figures and tables, supplementary materials table, supplementary materials tables table a. newly identified synonyms for existing iao terms ( xx) from the digraph mapping of , publications. elements in italics have previously been submitted by us for inclusion into iao and added in the latest release (v - - ). lastly, we applied a domain specific algorithm for recogniz- ing chemical entities in the text and tables ( ) to identify metabolites in the same publication (figure ). discussion the analysis of our corpus of , open access publica- tions has resulted in identifying well over new synonyms for existing terms used in biomedical literature to indicate what a paragraph is about. in addition, we identified four new potential categories not previously included in the iao. we previously submitted a subset of synonyms reported here and one of the new categories for inclusion in the iao. these have been accepted by the iao and are included in the lat- est release (v - - ), hence we presented our analyses using the previous version of iao that does not include part of our work. in the latest release, the ‘graphical abstract’ section has been added (iao: ) based on our contri- bution. also, a new ‘research participants’ (iao: ) section has been added as contribution by others in the same release; therefore synonyms found here for the new category ‘participants’ section will be proposed in future as synonyms for the ‘research participants’ section. while the disclosure section appears to be distinct from the conflict of interest sec- tion due to a directed edge in the digraph, its synonyms could also be proposed to be part of the existing conflict of interest section in iao. standardization of text for nlp is an important step in preparing a corpus. auto-corpus outputs a json file of cleaned text, with standardized headers as well as all data presented in tables in json format. standardizing headers is important because some sections are more important than | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / category (iao identifier) existing synonyms (iao v - - ) new synonyms identified a abbreviations (iao: ) abbreviations, abbreviations list, abbreviations used, list of abbreviations, list of abbreviations used abbreviation and acronyms, abbreviation list, abbreviations and acronyms, abbreviations used in this paper, definitions for abbreviations, glossary, key abbreviations, non-standard abbreviations, nonstandard abbreviations, nonstandard abbreviations and acronyms author information (iao: ) author information, authors’ information biographies, contributor information availability (iao: ) availability, availability and requirements availability of data, availability of data and materials, data archiving, data availability, data availability statement, data sharing statement conclusion (iao: ) concluding remarks, conclusion, conclusions, findings, summary conclusion and perspectives, summary and conclusion conflict of interest (iao: ) competing interests, conflict of interest, conflict of interest statement, declaration of competing interests, disclosure of potential conflicts of interest authors’ disclosures of potential conflicts of interest, competing financial interests, conflict of interests, conflicts of interest, declaration of competing interest, declaration of interest, declaration of interests, disclosure of conflict of interest, duality of interest, statement of interest consent (iao: ) consent informed consent ethical approval (iao: ) ethical approval ethics approval and consent to participate, ethical requirements, ethics, ethics statement funding source declaration (iao: ) funding, funding information, funding sources, funding statement, funding/support, source of funding, sources of funding financial support, grants, role of the funding source, study funding future directions (iao: ) future challenges, future considerations, future developments, future directions, future outlook, future perspectives, future plans, future prospects, future research, future research directions, future studies, future work outlook materials (iao: ) materials data, data description statistical analysis (iao: ) statistical analysis statistical methods, statistical methods and analysis, statistics study limitations (iao: ) limitations, study limitations strengths and limitations, study strengths and limitations table b. newly identified synonyms for existing iao terms ( xx) from the digraph mapping of , publications. elements in italics have previously been submitted by us for inclusion into iao and added in the latest release (v - - ). hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / proposed category proposed definition proposed synonyms disclosure “a part of a document used to disclose any associations by authors that might be perceived as to potentially interfere with or prevent them from reporting research with complete objectivity.” author disclosure statement, declarations, disclosure, disclosure statement, disclosures graphical abstract “an abstract that is a pictorial summary of the main findings described in a document.” central illustration, graphical abstract, toc image, visual abstract highlights “a short collection of key messages that describe the core findings and essence of the article in concise form. it is distinct and separate from the abstract and only conveys the results and concept of a study. it is devoid of jargon, acronyms and abbreviations and targeted at a broader, non-technical audience.” author summary, editors’ summary, highlights, key points, overview, research in context, significance, toc participants “a section describing the recruitment of subjects into a research study. this section is distinct from the ‘patients’ section and mostly focusses on healthy volunteers.” participants, sample table . newly proposed categories of entities found in , publications in the biomedical literature that could not be mapped to existing terms in iao. elements in italics have previously been submitted by us for inclusion into iao and added in the latest release (v - - ). known phenotype papers accuracy cancer . gastrointestinal diseases . metabolic syndrome . neurodegenerative, psychiatric, brain illnesses . table . summary of results for named-entity recognition (ner) of phenotypes in mwas papers. others for specific tasks. for example, no new findings can be found in an introduction however it is well suited to discover the main phenotypes under study, only in materials/methods can details be found on how these phenotypes are studied and using what technologies, and findings can only be found in results (and discussion) sections. hence it is important to classify these paragraphs and auto-corpus does this by using the structure of the publication and the digraph. we showed that we can further improve the assignment by train- ing machine learning models with good accuracy to distin- guish between different types of texts in cases where there may be ambiguity - this can be further improved by using a multi-class classifier and using all paragraphs. these data are then available for use in downstream analyses using ded- icated algorithms for entity recognition or other methods. auto-corpus is able to process all html formatted tables from both gwas and mwas corpora, as opposed to pre- vious methods which could only operate on % of , tables ( ). it takes auto-corpus on average . seconds to process all tables within a publication compared to several minutes if this is done manually. moreover, auto-corpus also supports parallel computing, thereby further reducing the time needed to process publications as these can be run in batch. the structured json output is machine readable and can be used to support data import into database. here we used the json output of auto-corpus in several examples to demonstrate some potential use cases. we demonstrated that existing algorithms trained on biomedical data can be fine- tuned to recognize new entities such as assays and pheno- types, which also opens up the possibility of using these data to train new deep learning algorithms for recognizing new entities such as metabolites (opposed to chemical entities), snps and p-values, as well as identifying the relationships between them from text. ner algorithms have difficulty with recognizing terms that are abbreviated, therefore the list of abbreviations found by auto-corpus can be used to replace all abbreviations in the text to their definitions. conclusion the auto-corpus package is freely available and can be de- ployed on local machines as well as using high-performance computing to process publications in batch. a step-by-step guide to detail how to use auto-corpus is supplied with the package. the key features of auto-corpus are that it: . outputs all text and table data in a standardized json format, . classifies each paragraph into separate categories of text, and | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . example of json format for table data from this work (shown for table ). the auto-corpus output for tables consists of ‘status’, ‘error message’ and ‘tables’ as top level fields, ‘tables’ has fields ‘identifier’, ‘title’, ‘columns’, ‘section’ and ‘footer’, and ‘section’ contains ‘section name’ and ‘results’. fig. . example of json output of abbreviation detection using a rule-based ap- proach on an mwas publication ( ). fig. . example of json output of named-entity recognition (ner) on an mwas publication ( ) using a fine-tuned transformer-based deep learning model for as- says and bidirectional long-short term memory network for chemical entity recogni- tion. . is implemented in pure python code and does not have non-python dependencies. acknowledgements we thank mohamed ibrahim (university of leicester) for identifying different configu- rations of tables for different html formats, and joy li and filip makraduli (imperial college london) for testing the package and providing feedback. author contributions tb and jmp designed and supervised the research. ss and yh developed the pipeline and analyzed data. ss developed the initial table extraction algorithm and implemented the phenotype recognition algorithm. yh developed the section header standardization algorithm and implemented the abbreviation recognition al- gorithm. ss fine-tuned the table extraction algorithm for use on non-pmc texts. tr refined standardization of full texts and contributed algorithms for utf- and utf- conversions of non-ascii characters to unicode. ss, yh, tb and jmp wrote the manuscript. funding this work has been supported by health data research (hdr) uk and the medical research council via an ukri innovation fellowship to tb (mr/s / ) and a rutherford fund fellowship to jmp (mr/s / ). footnote orcid: - - - (jmp). bibliography . seyedmostafa sheikhalishahi, riccardo miotto, joel t dudley, alberto lavelli, fabio rinaldi, and venet osmani. natural language processing of clinical notes on chronic diseases: systematic review. jmir med inform, ( ):e , . issn - . doi: . / . . ramón a-a. erhardt, reinhard schneider, and christian blaschke. status of text-mining techniques applied to biomedical text. drug discovery today, ( ): – , . issn - . doi: https://doi.org/ . /j.drudis. . . . . nikola milosevic, cassie gregson, robert hernandez, and goran nenadic. a frame- work for information extraction from tables in biomedical literature. international jour- nal on document analysis and recognition (ijdar), ( ): – , . doi: . / s - - - . . peter m. visscher, naomi r. wray, qian zhang, pamela sklar, mark i. mccarthy, matthew a. brown, and jian yang. years of gwas discovery: biology, function, and translation. the american journal of human genetics, ( ): – , . issn - . doi: https://doi.org/ . /j.ajhg. . . . . tim beck, tom shorter, and anthony j brookes. gwas central: a comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide as- sociation studies. nucleic acids research, (d ):d –d , . issn - . doi: . /nar/gkz . . annalisa buniello, jacqueline a l macarthur, maria cerezo, laura w harris, james hay- hurst, cinzia malangone, aoife mcmahon, joannella morales, edward mountjoy, elliot sol- lis, daniel suveges, olga vrousgou, patricia l whetzel, ridwan amode, jose a guillen, harpreet s riat, stephen j trevanion, peggy hall, heather junkins, paul flicek, tony bur- dett, lucia a hindorff, fiona cunningham, and helen parkinson. the nhgri-ebi gwas hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gtr.ukri.org/projects?ref=mr/s / https://gtr.ukri.org/projects?ref=mr/s / https://orcid.org/ - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / catalog of published genome-wide association studies, targeted arrays and summary statis- tics . nucleic acids research, (d ):d –d , . issn - . doi: . /nar/gky . . jeremy k. nicholson, elaine holmes, and paul elliott. the metabolome-wide association study: a new look at human disease risk factors. journal of proteome research, ( ): – , . doi: . /pr . pmid: . . werner ceusters. an information artifact ontology perspective on data collections and asso- ciated representational artifacts. studies in health technology and informatics, : – , . issn - . . alan ruttenberg, adam goldstein, albert goldfain, barry smith, bjoern peters, carlo tor- niai, chris mungall, chris stoeckert, christian a. boelling, darren natale, david osumi- sutherland, gwen frishkoff, holger stenzhorn, james a. overton, james malone, jen- nifer fostel, jie zheng, jonathan rees, larisa soldatova, lawrence hunter, mathias brochhausen, matt brush, melanie courtot, michel dumontier, paolo ciccarese, pat hayes, philippe rocca-serra, randy dipert, ron rudnicki, satya sahoo, sivaram ara- bandi, werner ceusters, william duncan, william hogan, and yongqun (oliver) he. infor- mation artefact ontology (v - - ). https://raw.githubusercontent.com/ information-artifact-ontology/iao/v - - /iao.owl, . ac- cessed: - - . . a. ghazvinian, n. f. noy, and m. a. musen. creating mappings for ontologies in biomedicine: simple methods work. amia annu symp proc, : – , . . peter n. robinson, sebastian köhler, sebastian bauer, dominik seelow, denise horn, and stefan mundlos. the human phenotype ontology: a tool for annotating and analyzing hu- man hereditary disease. the american journal of human genetics, ( ): – , . issn - . doi: https://doi.org/ . /j.ajhg. . . . . ariel schwartz and marti hearst. a simple algorithm for identifying abbreviation definitions in biomedical text. pacific symposium on biocomputing. pacific symposium on biocomputing, : – , . doi: . / _ . . katrin fundel, robert küffner, and ralf zimmer. relex—relation extraction using de- pendency parse trees. bioinformatics, ( ): – , . issn - . doi: . /bioinformatics/btl . . jinhyuk lee, wonjin yoon, sungdong kim, donghyeon kim, sunkyu kim, chan ho so, and jaewoo kang. biobert: a pre-trained biomedical language representation model for biomedical text mining. bioinformatics, ( ): – , . issn - . doi: . /bioinformatics/btz . . peter corbett and john boyle. chemlistem: chemical named entity recognition using recurrent neural networks. journal of cheminformatics, ( ), . doi: . / s - - - . . charles r. evans, alla karnovsky, melissa a. kovach, theodore j. standiford, charles f. burant, and kathleen a. stringer. untargeted lc–ms metabolomics of bronchoalveolar lavage fluid differentiates acute respiratory distress syndrome from health. journal of pro- teome research, ( ): – , . doi: . /pr . . nikola milosevic, cassie gregson, robert hernandez, and goran nenadic. disentangling the structure of tables in scientific literature. in elisabeth métais, farid meziane, mohamad saraee, vijayan sugumaran, and sunil vadera, editors, natural language processing and information systems, pages – . springer international publishing, . isbn - - - - . doi: https://doi.org/ . / - - - - _ . | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://raw.githubusercontent.com/information-artifact-ontology/iao/v - - /iao.owl https://raw.githubusercontent.com/information-artifact-ontology/iao/v - - /iao.owl https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / apobec mediated c-to-u rna editing: target sequence and trans-acting factor contribution to rna editing events in murine transcripts in-vivo. saeed soleymanjahi , valerie blanc and nicholas o. davidson , division of gastroenterology, department of medicine, washington university school of medicine, st. louis, mo to whom communication should be addressed: email: nod@wustl.edu running title: apobec mediated c to u rna editing keywords: rna folding; a cf; rbm ; january , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract ( words) mammalian c-to-u rna editing was described more than years ago as a single nucleotide modification in apob rna in small intestine, later shown to be mediated by the rna-specific cytidine deaminase apobec . reports of other examples of c-to-u rna editing, coupled with the advent of genome-wide transcriptome sequencing, identified an expanded range of apobec targets. here we analyze the cis-acting regulatory components of verified murine c- to-u rna editing targets, including nearest neighbor as well as flanking sequence requirements and folding predictions. we summarize findings demonstrating the relative importance of trans- acting factors (a cf, rbm ) acting in concert with apobec . using this information, we developed a multivariable linear regression model to predict apobec dependent c-to-u rna editing efficiency, incorporating factors independently associated with editing frequencies based on sanger-confirmed editing sites, which accounted for % of the observed variance. co- factor dominance was associated with editing frequency, with rnas targeted by both rbm and a cf observed to be edited at a lower frequency than rbm dominant targets. the model also predicted a composite score for available human c-to-u rna targets, which again correlated with editing frequency. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction mammalian c-to-u rna editing was identified as the molecular basis for human intestinal apob production more than three decades ago (chen et al. ; hospattankar et al. ; powell et al. ). a site-specific enzymatic deamination of c to u of apob mrna was originally considered the sole example of mammalian c-to-u rna editing, occurring at a single nucleotide in a kilobase transcript and mediated by an rna specific cytidine deaminase (apobec ) (teng et al. ). with the advent of massively parallel rna sequencing technology we now appreciate that apobec mediated rna editing targets hundreds of sites (rosenberg et al. ; blanc et al. ) mostly within ’ untranslated regions of mrna transcripts. this expanded range of targets of c-to-u rna editing prompted us to reexamine key functional attributes in the regulatory motifs (both cis-acting elements and trans-acting factors) that impact editing frequency, focusing primarily on data emerging from studies of mouse cell and tissue-specific c-to-u rna editing. earlier studies identified rna motifs (davies et al. ) contained within a -nucleotide segment flanking the edited cytidine base in vivo (in cell lines) or within nucleotides using s extracts from rat hepatoma cells (bostrom et al. ; driscoll et al. ). those, and other studies, established that apob rna editing reflects both the tissue/cell of origin as well as rna elements remote and adjacent to the edited base (bostrom et al. ; davies et al. ). a granular examination of the regions flanking the edited base in apob rna demonstrated a critical ’ sequence - , downstream of c , in which mutations reduced or abolished editing activity (shah et al. ). this ’ site, termed a “mooring sequence” was associated with a s- “editosome” complex (smith et al. ), which was both necessary and sufficient for site-specific apob rna editing and editosome assembly (backus and smith ). other cis-acting elements include a nucleotide spacer region between the edited cytidine and the mooring sequence, and also sequences ’ of the editing site that regulate editing efficiency (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (backus and smith ; backus et al. ) along with au-rich regions both ’ and ’ of the edited cytidine that together function in concert with the mooring sequence (hersberger and innerarity ). advances in our understanding of physiological apob rna editing emerged in parallel from both the delineation of key rna regions (summarized above) and also with the identification of components of the apob rna editosome (sowden et al. ). apobec , the catalytic deaminase (teng et al. ) is necessary for physiological c-to-u rna editing in vivo (hirano et al. ) and in vitro (giannoni et al. ). using the mooring sequence of apob rna as bait, two groups identified apobec complementation factor (a cf), an rna-binding protein sufficient in vitro to support efficient editing in presence of apobec and apob mrna (lellek et al. ; mehta et al. ). those findings reinforced the importance of both the mooring sequence and an rna binding component of the editosome in promoting apob rna editing. however, while a cf and apobec are sufficient to support in vitro apob rna editing, neither heterozygous (blanc et al. ) or homozygous genetic deletion of a cf impaired apob rna editing in vivo in mouse tissues (snyder et al. ), suggesting that an alternate complementation factor was likely involved. other work identified a homologous rna binding protein, rbm , that functioned to promote apob rna editing both in vivo and in vitro (fossat et al. ), and more recent studies utilizing conditional, tissue-specific deletion of a cf and rbm indicate that both factors play distinctive roles in apobec -mediated c-to-u rna editing, including apob as well as a range of other apobec targets (blanc et al. ). these findings together establish important regulatory roles for both cis-acting elements and trans-acting factors in c-to-u mrna editing. however, the majority of studies delineating cis- acting elements reflect earlier, in vitro experiments using apob mrna and relatively little is known regarding the role of cis-acting elements in tissue-specific c-to-u rna editing of other transcripts, in vivo. here we use statistical modeling to investigate the independent roles of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . candidate regulatory factors in mouse c-to-u mrna editing using data from in vivo studies from over editing sites in transcripts (meier et al. ; rosenberg et al. ; gu et al. ; blanc et al. ; rayon-estrada et al. ; snyder et al. ; blanc et al. ; kanata et al. ). we also examined these regulatory factors in known human mrna targets (chen et al. ; powell et al. ; skuse et al. ; mukhopadhyay et al. ; grohmann et al. ; schaefermeier and heinze ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results descriptive data c-to-u rna editing sites were identified based on eight studies that met inclusion and exclusion criteria (meier et al. ; rosenberg et al. ; gu et al. ; blanc et al. ; rayon-estrada et al. ; snyder et al. ; blanc et al. ; kanata et al. ), representing distinct rna editing targets. % ( / ) of rna targets were edited at one chromosomal location (figure c) and % ( / ) of mrna targets were edited at both a single chromosomal location and also within a single tissue (figure d). the majority of editing sites occur in the ` untranslated region ( / ; %), with exonic editing sites the next most abundant subgroup ( / ; %, figure e). chromosome x harbors the highest number of editing sites ( / ; %), followed by chromosomes and ( / ; . % for both, supplemental figure ). / editing sites were confirmed by sanger sequencing, with a mean editing frequency of ± %. base content of sequences flanking edited and mutated cytidines au content was enriched (~ %) in nucleotides both immediately upstream and downstream of the edited cytidine across mouse rna editing targets (figure a and c). the average au content across the region nucleotides upstream to nucleotides downstream of the edited cytidine was ~ % ( - %). because apobec has been shown to be a dna mutator (harris et al. ; wolfe et al. ; wolfe et al. ), we determined the au content of the mutated deoxycytidine region flanking human dna targets (nik-zainal et al. ) to be ~ % at a site one nucleotide downstream of the edited base (figure b, c). the average au content in the sequence nucleotides upstream and nucleotides downstream of mutated deoxycytidines is % ( - . %). the average au content was % and % in nucleotides immediately upstream and downstream, respectively, of the targeted deoxycytidine in a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . subgroup of over dna editing events of the c to t type (nik-zainal et al. ), which is closer to the distribution found in c to u rna editing targets. these features suggest that au enrichment is an important component to editing function of apobec on both rna and dna targets, especially for the c/dc to u/dt change. factors influencing editing frequency regulatory-spacer-mooring cassette: we observed no significant associations between editing frequency and mismatches in motif a (r=- . , p=. ) or motif b (r=- . , p=. ) (supplemental figure ), while mismatches in motif c and d negatively impacted editing frequency (r=- . , p=. ) (motif d r=- . , p=. , figure b). au content of motif b showed a trend towards negative association with editing frequency (r=- . , p=. figure c), but au contents of motifs a (r= . , p=. ), c (r=- . , p=. ), and d (r=- . , p=. ) did not impact editing frequency (supplemental figure ). the abundance of g in motif c (r= . , p=. ), abundance of c in motif b (r= . , p=. ), and g/c fraction in motif c (r= . , p=. ) showed either significance or a trend to associations with editing frequency. the spacer sequence averaged ± nucleotides, ranging from to , with trend of association between length and editing frequency (r=- . , p=. ). the mean spacer sequence au content was ± %, with no association between editing frequency and au content (r=- . , p=. , supplemental figure ). however, g abundance (r=- . , p=. ) and g/c fraction (r=- . , p=. ) of spacer showed significant associations with editing frequency in sanger-confirmed targets. the mean number of mismatches in the first nucleotides of the spacer sequence was . ± with higher number of mismatches exerting a significant negative impact on editing frequency (r=- . , p=. ) (figure d). the mean number of mismatches in the mooring sequence was . ± . , ranging from to nucleotides. the number of mismatches showed a significant negative association with editing frequency (r=- . , p=. , figure e). the base content of individual nucleotides surrounding the edited cytidine showed significant associations with editing frequency, which (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . was more emphasized in nucleotides closer to the edited cytidine (figure f, supplemental table ). furthermore, overall au content of downstream sequence + to + had positive impact on editing frequency (r= . , p=. ) (supplemental figure ). however, g abundance in downstream nucleotides (r=- . , p=. ) and g/c fraction in downstream nucleotides (r=- . , p=. ) showed significant or a trend of significant negative associations with editing frequency in sanger-confirmed targets. secondary structure: we generated a predicted secondary structure for editing sites, with four subgroups based on overall structure and location of the edited cytidine: loop (cloop), stem (cstem), tail (ctail), and non-canonical structure (nc). the majority of editing sites were in the cloop subgroup ( %), followed by cstem ( %), ctail ( %), and nc ( %) subgroups (figure a). editing sites in the ctail subgroup exhibited lower editing frequencies compared to editing sites in cloop ( ± vs ± %, p=. ) or cstem ( ± %, p=. ) subgroups. no significant differences were detected in other comparisons (figure b). the edited cytidine was located in loop, stem, and tail of the secondary structure in ( %), ( %), and ( %) of the edited rnas, respectively. editing sites with the edited cytidine within the loop exhibited significantly higher editing frequency compared to those with the edited cytidine in the tail ( ± % vs ± %, p=. ). other subgroups exhibited comparable editing frequencies (supplemental figure ). the majority ( %) of editing sites contained a mooring sequence located in main stem-loop structure (figure c), with the remainder located in the tail or secondary loop. average editing efficiency was significantly higher in targets where the mooring sequence was located in the main stem-loop (figure d). we also calculated the proportion of total nucleotides that constitute the main stem-loop in the secondary structure. the average ratio was . ± . ranging from . to (supplemental table ) with higher ratios associated with higher editing frequency of the corresponding editing site (r= . , p=. ) (figure e). finally, we considered the orientation of free tails in the secondary structure in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . terms of length and symmetry. symmetric free tails were observed in % of editing sites (supplemental figure ). the length of ’ free tail showed negative association with editing frequency (r=- . , p=. , figure f) while no significant associations were detected between either the length of ’ tail or symmetry of tails and editing frequency (supplemental figure ). trans-acting factors and tissue specificity: data for relative dominance of cofactors in apobec - dependent rna editing were available for editing sites for targets in small intestine or liver (blanc et al. ). rbm was identified as the dominant factor in / ( %) sites; a cf was the dominant factor in / ( %) editing sites with the remaining sites ( / ; %), exhibiting equal codominancy (figure a). the average editing frequencies at editing sites revealed differences across the groups with ± % in rbm -dominant targets, ± % in a cf-dominant, and ± % in the co-dominant group (p=. ) (figure b). the majority of rna editing targets were edited in one tissue ( / ; % figure c), while the maximum number of tissues in which an editing target is edited (at the same site) is (cd ). the small intestine harbors the highest number of verified editing sites ( / ; %), followed by liver ( / ; %), and adipose tissue ( / ; % figure d). sites edited in brain tissue showed the highest average editing frequency ( ± %, n= ), followed by bone marrow myeloid cells ( ± %, n= ), and kidney ( ± %, n= figure e). we then developed a multivariable linear regression model to predict apobec dependent c- to-u rna editing efficiency, incorporating factors independently associated with editing frequencies (table ). this model, based on sanger-confirmed editing sites with available data for all of the parameters mentioned, accounted for % of variance in editing frequency of editing sites included (r = . , p<. table ). the final multivariable model revealed several factors independently associated with editing frequency, specifically the number of mismatches in mooring sequence; regulatory sequence motif d; au content of regulatory sequence motif b; overall secondary structure for group ctail vs group cloop; location of mooring sequence in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . secondary structure; “base content score” parameter that represents base content of the sequences flanking edited cytidine (table ). removing “base content score” from the model reduced the power from r = . to r = . . next, we added a co-factor dominance variable and fit the model using the editing sites with available data for cofactor dominance. along with other factors mentioned above, co-factor dominance showed significant association with editing frequency (table ) with rnas targeted by both rbm and a cf observed to be edited at a lower frequency than rbm dominant targets. factors associated with co-factor dominance (figure , supplemental table , supplemental figure ), included tissue-specificity, with higher frequency of rbm -dominant sites in small intestine compared to liver ( vs %, p=. ) and a cf-dominant and co-dominant editing sites more prevalent in liver. the number of mooring sequence mismatches also varied among three subgroups: . ± . in rbm -dominant subgroup; . ± . in a cf-dominant subgroup; and . ± . in co-dominant subgroup (p=. ). this was also the case regarding mismatches in the spacer: . ± . in rbm -dominant subgroup; . ± . in a cf-dominat subgroup; . ± . in co-dominant subgroup (p=. ). au content (%) of downstream sequence + to + was higher in rbm -dominant subgroup (p=. ). finally, the location of the edited cytidine in secondary structure of mrna strand was different across three subgroups (p=. , figure ). we used pairwise multinomial logistic regression to determine factors independently associated with co-factor dominance (figure c, supplemental table ). ctail editing sites, those with more mismatches in mooring and regulatory motif c, lower au content in downstream sequence, and higher au content in regulatory motif d were more likely co-dominant. editing sites from small intestine and those with higher au content of downstream sequence were more likely rbm -dominant. editing sites from liver and those with higher mismatches in regulatory motif b were more likely a cf-dominant (figure c). human mrna targets (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . finally, we turned to an analysis of human c-to-u rna editing targets for which this same panel of parameters was available (table ). aside from apob rna, which is known to be edited in the small intestine (chen et al. ; powell et al. ), other targets have been identified in central or peripheral nervous tissue (skuse et al. ; mukhopadhyay et al. ; meier et al. ; schaefermeier and heinze ). the human targets were categorized into low editing (nf , glyrα , glyrα ) and high editing (apob, tph b exon , tph b exon ) subgroups using % as cut-off. a composite score (maximum= ) was generated based on six parameters introduced in the mouse model with notable variance between the two subgroups including mismatches in mooring sequence, spacer length, location of the edited cytidine, and relative abundance of stem-loop bases (table ). high editing targets exhibited a significantly higher composite score ( . vs , p=. ) compared to low editing targets and the composite score significantly correlated with editing frequency in individual targets (r= . , p=. ). the canonical editing target apob (chen et al. ; powell et al. ) achieved a score of (out of ), reflecting the observation that one of the six parameters (au% of regulatory motifs) in human apob is non-preferential compared to the editing-promoting features identified in the mouse multivariable model. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion the current study reflects our analysis of c-to-u rna editing sites from target mrnas, with the majority residing within the ’ untranslated region. our multivariable model identified several key factors influencing editing frequency, including host tissue, base content of nucleotides surrounding the edited cytidine, number of mismatches in regulatory and mooring sequences, au content of the regulatory sequence, overall secondary structure, location of the mooring sequence, and co-factor dominance. these factors, each exerting independent effects, together accounted for % of the variance in editing frequency. our findings also showed that mismatches in the mooring and regulatory sequences, au content of regulatory and downstream sequences, host tissue and secondary structure of target mrna were associated with the pattern of co-factor dominance. several aspects of these primary conclusions merit further discussion. previous studies investigating the key factors that regulate c-to-u mrna editing were confined to in vitro studies and predicated on a single mrna target (apob) (backus and smith ; shah et al. ; smith et al. ; backus and smith ; hersberger and innerarity ). with the expanded range of verified c-to-u rna editing targets now available for interrogation, we revisited the original assumptions to understand more globally the determinants of c-to-u mrna editing efficiency. in undertaking this analysis, we were reminded that the requirements for c-to-u mrna editing in vitro often appear more stringent than in vivo (backus and smith ; shah et al. ), which further emphasizes the importance of our findings. in addition, our approach included both cis-acting sequence- and folding-related predictions along with the role of trans-acting factors and took advantage of statistical modeling to adjust for confounding or modifier effects between these factors to identify their role in editing frequency. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we began with the assumptions established for apob rna editing which identified a nucleotide segment encompassing the edited base, spacer, mooring sequence, and part of regulatory sequence as the minimal sequence competent for physiological editing in vitro and in vivo (davies et al. ; shah et al. ; backus and smith ). those studies identified an -nucleotide mooring sequence as essential and sufficient for editosome assembly and site- specific c-to-u editing (backus and smith ; shah et al. ; backus and smith ) and established optimal positioning of the mooring sequence relative to the edited base in apob rna (backus and smith ). the current work supports the key conclusions of this original mooring sequence model as applied to the entire range of c-to-u rna editing targets. we observed that mismatches in either the mooring or regulatory sequences were independent factors governing editing frequency. by contrast, while mismatches in the spacer sequence also showed negative association with editing frequency, the impact of spacer mismatches were not retained in the final model, nor was the length of the spacer associated with editing frequency. furthermore, we found mismatches in the regulatory sequence motif c to be more important than mismatches in motif b. these inconsistencies might conceivably reflect the context in which an rna segment is studied (backus and smith ). for example, our analysis reflects physiological conditions in which naturally occurring mrna targets are edited, while the aforementioned study used in vitro data based on varying lengths of apob mrna embedded within different mrna contexts (apoe rna) (backus and smith ). in addition to the components of mooring sequence model, we examined variations in the base content in different segments/motifs as well as among individual nucleotides surrounding the edited cytidine. as expected, we found that sequences flanking the edited cytidine exhibited high au content. we further observed a similarly high au content in the flanking sequences of a range of proposed apobec-mediated dna mutation targets in human cancer tissues and cell lines (alexandrov et al. ; petljak et al. ), especially in targets with dc/dt change (nik- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . zainal et al. ). this observation implies that apobec-mediated dna and rna editing frequency may each be functionally modified by au enrichment in the flanking sequences surrounding modifiable bases. the base content in individual nucleotides surrounding the edited cytidine also exerted significant impact on editing frequency, particularly in a - nucleotide segment spanning the edited cytidine (supplemental table ), accounting for % of the variance in editing frequency independent of the mooring sequence model. our findings regarding individual nucleotides surrounding the edited cytidine are consistent with findings for both dna and rna editing targets, particularly in the setting of cancers (backus and smith ; conticello ; roberts et al. ; saraconi et al. ; gao et al. ; arbab et al. ). recent work examining the sequence-editing relationship of a large in vitro library of dna targets edited by different synthetic cytidine base editor (cbe)s (arbab et al. ) showed that the base content of a -nucleotide window spanning the edited cytidine explained - % of the editing variance, in particular one or two nucleotides immediately ’ of the edited nucleotide. that study also demonstrated that occurrence of t and c nucleotides at the position - increased, while a g nucleotide at that position decreased editing frequency (arbab et al. ). however, in contrast to our findings, the presence of a at position - had either a negative or null effect on dna editing activity (arbab et al. ). this latter finding is consistent with the lower au content observed in nucleotides adjacent to the edited cytidine in apobec- dna targets compared to the au content in rna targets. our findings assign a greater importance of adjacent nucleotides in rna editing frequency, similar to earlier reports that the five bases immediately ’ of the edited cytidine in apob mrna exert a greater impact on editing activity compared to nucleotides further upstream of this segment (backus and smith ; shah et al. ; backus and smith ). g/c fraction of a -nucleotide window spanning the edited cytidine in dna targets is associated with editing activity of the synthetic cbes (arbab et al. ). although we found significant associations of rna editing with g/c fraction in segments surrounding the edited cytidine in univariate analyses, these associations (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . were not retained in the final model. in contrast, the au content of regulatory sequence motif b remained as an independent factor determining editing frequency in the final model. the conserved -nucleotide sequence around the edited c forms a stem-loop secondary structure, where the editing site is in an octa-loop (richardson et al. ) as predicted for the -nucleotide sequence of apob mrna (shah et al. ). this stem-loop structure is predicted to play an important role in recognition of the editing site by the editing factors (bostrom et al. ; davies et al. ; driscoll et al. ; chen et al. ). mutations resulting in loss of base pairing in peripheral parts of the stem did not impact the editing frequency (shah et al. ). editing sites with the cytidine located in central parts (e.g. loop) exhibited higher editing frequencies than those with the edited cytidine located in peripheral parts (e.g. tail) and it is worth noting that the computer-based stem-loop structure was independently confirmed by nmr studies of a -nucleotide human apob mrna (maris et al. ). those studies demonstrated that the location of the mooring sequence in the apob mrna secondary structure plays a critical role in the rna recognition by a cf (maris et al. ). in line with those findings, the current findings emphasize that the location of the mooring sequence in secondary structure of the target mrna exerts significant independent impact on editing frequency. these predictions were confirmed in crystal structure studies of the carboxyl-terminal domain of apobec- and its interaction with cofactors and substrate rna (wolfe et al. ). our conclusions regarding murine c-to-u editing frequency, such as mooring sequence, base content, and secondary structure appear consistent with a similar regulatory role among the smaller number of verified human targets. that being said, further study and expanded understanding of the range of c-to-u editing targets in human tissues will be needed as recently suggested (destefanis et al. ), analogous to that for a-to-i editing (bahn et al. ; bazak et al. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we recognize that other factors likely contribute to the variance in rna editing frequency not covered by our model. we did not consider the role of naturally occurring variants in apobec , for example, which may be a relevant consideration since mutations in apobec family genes were shown to modify the editing activity of related hybrid dna cytosine base editors (arbab et al. ). furthermore, genetic variants of apobec in humans were associated with altered frequency of glyr editing (kankowski et al. ). other factors not included in our approach included entropy-related features, tertiary structure of the mrna target and other regulatory co-factors. another limitation in the tissue-specific designation used to categorize editing frequency is that cell specific features of editing frequency may have been overlooked. for example, small intestinal and liver preparations are likely a blend of cell types (macparland et al. ; elmentaite et al. ) and tumor tissues are highly heterogeneous in cellular composition (barker et al. ). the current findings provide a platform for future approaches to resolve these questions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . materials and methods search strategy a comprehensive literature review from (when apob rna editing was first reported (chen et al. ; powell et al. )) to november , using studies published in english reporting c-to-u mrna editing frequencies of individual or transcriptome-wide target genes. databases searched included medline, scopus, web of science, google scholar, and proquest (for thesis). the references of full texts retrieved were also scrutinized for additional papers not indexed in the initial search. study selection primary records (n= ) were screened for relevance and in vivo studies reporting editing frequencies of individual or transcriptome-wide apobec -dependent c-to-u mrna targets selected, using a threshold of % editing frequency. for analyses based on rna sequence information, only targets with available sequence information or chromosomal location for the edited cytidine were included. exclusion criteria included: studies that reported c-to-u mrna editing frequencies of target genes in other species, studies reporting editing frequencies of target genes in animal models overexpressing apobec , exclusively in vitro studies, and conference abstracts. human targets we included studies reporting human c-to-u mrna targets (chen et al. ; powell et al. ; skuse et al. ; mukhopadhyay et al. ; grohmann et al. ; schaefermeier and heinze ). we also included work describing apobec -mediated mutagenesis in human breast cancer (nik-zainal et al. ). data extraction (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . two reviewers (ss and vb) conducted the extraction process independently and discrepancies were addressed upon consensus and input from a third reviewer (nod). the parameters were categorized as follows: general parameters: gene name (rna target), chromosomal and strand location of the edited cytidine, tissue site, editing frequency determined by rna-seq or sanger sequencing as illustrated for apob (figure a). editing frequency was highly correlated by both approaches (r= . p< . ), and where both methodologies were available we used rna- seq. we also defined relative dominance of editing co-factors (a cf-dominant, rbm - dominant, or co-dominant), relative mrna expression (edited gene vs unedited gene) by rna- seq or quantitative rt-pcr, and abundance of corresponding protein (edited gene vs unedited gene) by western blotting or proteomic comparison. co-factor dominancy was determined based on the relative contribution of each co-factor to editing frequency. in each editing site, editing frequencies in mouse tissues deficient in a cf or rbm were compared to that of wild- type mice. the relative contribution of each co-factor was calculated by subtracting the editing frequency for each target in a cf or rbm knockout tissue from the total editing frequency in wild-type control. editing sites with < % difference between contributions of rbm and a cf were considered co-dominant. sites with ≥ % difference were considered either rbm - or a cf-dominant, depending on the co-factor with higher contribution (blanc et al. ). sequence-related parameters: a sequence spanning nucleotides upstream and nucleotides downstream of the edited cytidine was extracted for each c-to-u mrna editing site. these sequences were extracted either directly from the full-text or using online ucsc genome browser on mouse (ncbi /mm ) and human (grch /hg ) (https://genome.ucsc.edu/cgi- bin/hggateway) . using the mooring sequence model (backus and smith ), three cis-acting elements were considered for each site. these elements included ) a -nucleotide segment immediately upstream of the edited cytidine as “regulatory sequence”; ) a -nucleotide segment downstream of the edited cytidine with complete or partial consensus with the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . canonical “mooring sequence” of apob mrna; ) the sequence between the edited cytidine and the ’ end of the mooring sequence, referred to as “spacer”. we used an unbiased approach to identify potential mooring sequences by taking the nearest segment to the edited cytidine with lowest number of mismatch(es) compared to the canonical mooring sequence of apob rna. for each of the three segments, we investigated the number of mismatches compared to the corresponding segment of apob gene (blanc et al. ), as well as length of spacer, the abundance of a and u nucleotides (au content) and the g to c abundance ratio (g/c fraction (arbab et al. )). we also calculated relative abundance of a, g, c, and u individually across a region nucleotides upstream and nucleotides downstream of the edited cytidine across all editing sites. for comparison, we examined the base content of a sequence spanning nucleotides upstream and downstream of mutated deoxycytidine for over proposed c to x (t, a, and g) dna mutation targets of apobec family in human breast cancer (nik-zainal et al. ) along with relative deoxynucleotide distribution in proximity to the edited site. secondary structure parameters: we used rna-structure (reuter and mathews ) and mfold (zuker ) to determine the secondary structure of an rna cassette consisting of regulatory sequence, edited cytidine, spacer, and mooring sequence. secondary structures similar to that of the cassette for apob chr : consisting of one loop and stem (with or without unassigned nucleotides with ≤ unpaired bases inside the stem) as the main stem-loop with or without free tail(s) in one or both ends of the stem were considered as canonical. two other types of secondary structure were considered as non-canonical structures (figure b), with ≥ loops located either at ends of the stem or inside the stem. loops inside the stem were circular open structures with ≥ unpaired bases. editing sites with canonical structure were further categorized into three subgroups based on location of the edited cytidine: specifically (cloop), stem (cstem), or tail (ctail). in addition to overall secondary structure, we considered (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . location of the edited cytidine, location of mooring sequence, symmetry of the free tails, and proportion of the nucleotides in the target cassette that constitute the main stem-loop. this proportion is . in the case of apob chr : where all the bases are part of the main stem-loop structure. symmetry was defined based on existence of free tails in both ends of the rna strand. statistical methodology continuous variables are reported as means ± sd with relative proportions for binary and categorical variables. t-test and anova tests were used to compare continuous parameters of interest between two or more than two groups, respectively. chi-squared testing was used to compare binary or categorical variables among different groups. pearson r testing was used to investigate correlation of two continuous variables. we used linear regression analyses to develop the final model of independent factors that correlate with editing frequency. we used the hosmer and lemeshow approach for model building (hosmer jr et al. ) to fit the multivariable regression model. in brief, we first used bivariate and/or simple regression analyses with p value of . as the cut-off point to screen the variables and detect primary candidates for the multivariable model. subsequently, we fitted the primary multivariable model using candidate variables from the screening phase. a backward elimination method was employed to reach the final multivariable model. parameters with p values < . or those that added to the model fitness were retained. next, the eliminated parameters were added back individually to the final model to determine their impact. plausible interaction terms between final determinants were also checked. the final model was screened for collinearity. we used the same approach to develop a multinomial logistic regression model to identify factors that were independently associated with co-factor dominance in rna editing sites. squared r and pseudo squared r were used to estimate the proportion of variance in responder parameter that could be explained by multivariable linear regression and multinomial logistic regression models, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . respectively. the same screening and retaining methods were used to investigate association of base content in a sequence nucleotides upstream and nucleotides downstream of the edited cytidine, with editing frequency. however, after determining the nucleotides that were retained in final regression model, a proxy parameter named “base content score” was calculated for each editing site based on the β coefficient values retrieved for individual nucleotides in the model. this parameter was used in the final model as representative variable for base content of the aforementioned sequence in each editing site. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgments this work was supported by grants from the national institutes of health grants dk- , dk- , washington university digestive diseases research core center p dk- (to nod) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references ucsc genome browser on mouse (ncbi /mm ; ) and human (grch /hg ; ) assemblies. alexandrov lb, nik-zainal s, wedge dc, aparicio sa, behjati s, biankin av, bignell gr, bolli n, borg a, borresen-dale al et al. . signatures of mutational processes in human cancer. nature : - . arbab m, shen mw, mok b, wilson c, matuszek z, cassa ca, liu dr. . determinants of base editing outcomes from target library analysis and machine learning. cell : - e . backus jw, schock d, smith hc. . only cytidines ' of the apolipoprotein b mrna mooring sequence are edited. biochim biophys acta : - . backus jw, smith hc. . apolipoprotein b mrna sequences ' of the editing site are necessary and sufficient for editing and editosome assembly. nucleic acids res : - . -. . three distinct rna sequence elements are required for efficient apolipoprotein b (apob) rna editing in vitro. nucleic acids res : - . bahn jh, lee jh, li g, greer c, peng g, xiao x. . accurate identification of a-to-i rna editing in human by transcriptome sequencing. genome res : - . barker n, ridgway ra, van es jh, van de wetering m, begthel h, van den born m, danenberg e, clarke ar, sansom oj, clevers h. . crypt stem cells as the cells-of-origin of intestinal cancer. nature : - . bazak l, haviv a, barak m, jacob-hirsch j, deng p, zhang r, isaacs fj, rechavi g, li jb, eisenberg e et al. . a-to-i rna editing occurs at over a hundred million genomic sites, located in a majority of human genes. genome res : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . blanc v, henderson jo, newberry ep, kennedy s, luo j, davidson no. . targeted deletion of the murine apobec- complementation factor (acf) gene results in embryonic lethality. molecular and cellular biology : - . blanc v, park e, schaefer s, miller m, lin y, kennedy s, billing am, ben hamidane h, graumann j, mortazavi a et al. . genome-wide identification and functional analysis of apobec- -mediated c-to-u rna editing in mouse small intestine and liver. genome biol : r . blanc v, xie y, kennedy s, riordan jd, rubin dc, madison bb, mills jc, nadeau jh, davidson no. . apobec complementation factor (a cf) and rbm interact in tissue-specific regulation of c to u rna editing in mouse intestine and liver. rna : - . bostrom k, lauer sj, poksay ks, garcia z, taylor jm, innerarity tl. . apolipoprotein b rna editing in chimeric apolipoprotein eb mrna. j biol chem : - . chen sh, habib g, yang cy, gu zw, lee br, weng sa, silberman sr, cai sj, deslypere jp, rosseneu m et al. . apolipoprotein b- is the product of a messenger rna with an organ-specific in-frame stop codon. science : - . chen sh, li xx, liao ws, wu jh, chan l. . rna editing of apolipoprotein b mrna. sequence specificity determined by in vitro coupled transcription editing. j biol chem : - . conticello sg. . creative deaminases, self-inflicted damage, and genome evolution. annals of the new york academy of sciences : - . davies ms, wallis sc, driscoll dm, wynne jk, williams gw, powell lm, scott j. . sequence requirements for apolipoprotein b rna editing in transfected rat hepatoma cells. j biol chem : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . destefanis e, avsar g, groza p, romitelli a, torrini s, pir p, conticello sg, aguilo f, dassi e. . a mark of disease: how mrna modifications shape genetic and acquired pathologies. rna. driscoll dm, wynne jk, wallis sc, scott j. . an in vitro system for the editing of apolipoprotein b mrna. cell : - . elmentaite r, ross adb, roberts k, james kr, ortmann d, gomes t, nayak k, tuck l, pritchard s, bayraktar oa et al. . single-cell sequencing of developing human gut reveals transcriptional links to childhood crohn's disease. dev cell. fossat n, tourle k, radziewic t, barratt k, liebhold d, studdert jb, power m, jones v, loebel da, tam pp. . c to u rna editing mediated by apobec requires rna-binding protein rbm . embo rep : - . gao j, choudhry h, cao w. . apolipoprotein b mrna editing enzyme catalytic polypeptide-like family genes activation and regulation during tumorigenesis. cancer science : - . giannoni f, bonen dk, funahashi t, hadjiagapiou c, burant cf, davidson no. . complementation of apolipoprotein b mrna editing by human liver accompanied by secretion of apolipoprotein b . j biol chem : - . grohmann m, hammer p, walther m, paulmann n, buttner a, eisenmenger w, baghai tc, schule c, rupprecht r, bader m et al. . alternative splicing and extensive rna editing of human tph transcripts. plos one : e . gu t, buaas fw, simons ak, ackert-bicknell cl, braun re, hibbs ma. . canonical a-to-i and c-to-u rna editing is enriched at 'utrs and microrna target sites in multiple mouse tissues. plos one : e . harris rs, bishop kn, sheehy am, craig hm, petersen-mahrt sk, watt in, neuberger ms, malim mh. . dna deamination mediates innate immunity to retroviral infection. cell : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . hersberger m, innerarity tl. . two efficiency elements flanking the editing site of cytidine in the apolipoprotein b mrna support mooring-dependent editing. j biol chem : - . hirano k, young sg, farese rv, jr., ng j, sande e, warburton c, powell-braxton lm, davidson no. . targeted disruption of the mouse apobec- gene abolishes apolipoprotein b mrna editing and eliminates apolipoprotein b . j biol chem : - . hosmer jr dw, lemeshow s, sturdivant rx. . applied logistic regression. john wiley & sons. hospattankar av, higuchi k, law sw, meglin n, brewer hb, jr. . identification of a novel in-frame translational stop codon in human intestine apob mrna. biochem biophys res commun : - . kanata e, llorens f, dafou d, dimitriadis a, thune k, xanthopoulos k, bekas n, espinosa jc, schmitz m, marin-moreno a et al. . rna editing alterations define manifestation of prion diseases. proc natl acad sci u s a : - . kankowski s, forstera b, winkelmann a, knauff p, wanker ee, you xa, semtner m, hetsch f, meier jc. . a novel rna editing sensor tool and a specific agonist determine neuronal protein expression of rna-edited glycine receptors and identify a genomic apobec dimorphism as a new genetic risk factor of epilepsy. front mol neurosci : . lellek h, kirsten r, diehl i, apostel f, buck f, greeve j. . purification and molecular cloning of a novel essential component of the apolipoprotein b mrna editing enzyme- complex. j biol chem : - . macparland sa, liu jc, ma xz, innes bt, bartczak am, gage bk, manuel j, khuu n, echeverri j, linares i et al. . single cell rna sequencing of human liver reveals distinct intrahepatic macrophage populations. nat commun : . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . maris c, masse j, chester a, navaratnam n, allain fh. . nmr structure of the apob mrna stem-loop and its interaction with the c to u editing apobec complementary factor. rna : - . mehta a, kinter mt, sherman ne, driscoll dm. . molecular cloning of apobec- complementation factor, a novel rna-binding protein involved in the editing of apolipoprotein b mrna. mol cell biol : - . meier jc, henneberger c, melnick i, racca c, harvey rj, heinemann u, schmieden v, grantyn r. . rna editing produces glycine receptor alpha (p l), resulting in high agonist potency. nat neurosci : - . mukhopadhyay d, anant s, lee rm, kennedy s, viskochil d, davidson no. . c-->u editing of neurofibromatosis mrna occurs in tumors that express both the type ii transcript and apobec- , the catalytic subunit of the apolipoprotein b mrna-editing enzyme. am j hum genet : - . nik-zainal s, alexandrov lb, wedge dc, van loo p, greenman cd, raine k, jones d, hinton j, marshall j, stebbings la et al. . mutational processes molding the genomes of breast cancers. cell : - . petljak m, alexandrov lb, brammeld js, price s, wedge dc, grossmann s, dawson kj, ju ys, iorio f, tubio jmc et al. . characterizing mutational signatures in human cancer cell lines reveals episodic apobec mutagenesis. cell : - e . powell lm, wallis sc, pease rj, edwards yh, knott tj, scott j. . a novel form of tissue- specific rna processing produces apolipoprotein-b in intestine. cell : - . rayon-estrada v, harjanto d, hamilton ce, berchiche ya, gantman ec, sakmar tp, bulloch k, gagnidze k, harroch s, mcewen bs et al. . epitranscriptomic profiling across cell types reveals associations between apobec -mediated rna editing, gene expression outcomes, and cellular function. proc natl acad sci u s a : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . reuter js, mathews dh. . rnastructure: software for rna secondary structure prediction and analysis. bmc bioinformatics : . richardson n, navaratnam n, scott j. . secondary structure for the apolipoprotein b mrna editing site. au-binding proteins interact with a stem loop. j biol chem : - . roberts sa, lawrence ms, klimczak lj, grimm sa, fargo d, stojanov p, kiezun a, kryukov gv, carter sl, saksena g et al. . an apobec cytidine deaminase mutagenesis pattern is widespread in human cancers. nat genet : - . rosenberg br, hamilton ce, mwangi mm, dewell s, papavasiliou fn. . transcriptome- wide sequencing reveals numerous apobec mrna-editing targets in transcript ' utrs. nat struct mol biol : - . saraconi g, severi f, sala c, mattiuz g, conticello sg. . the rna editing enzyme apobec induces somatic mutations and a compatible mutational signature is present in esophageal adenocarcinomas. genome biol : . schaefermeier p, heinze s. . hippocampal characteristics and invariant sequence elements distribution of glra and glra c-to-u editing. mol syndromol : - . shah rr, knott tj, legros je, navaratnam n, greeve jc, scott j. . sequence requirements for the editing of apolipoprotein b mrna. j biol chem : - . skuse gr, cappione aj, sowden m, metheny lj, smith hc. . the neurofibromatosis type i messenger rna undergoes base-modification rna editing. nucleic acids res : - . smith hc, kuo sr, backus jw, harris sg, sparks ce, sparks jd. . in vitro apolipoprotein b mrna editing: identification of a s editing complex. proc natl acad sci u s a : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . snyder em, mccarty c, mehalow a, svenson kl, murray sa, korstanje r, braun re. . apobec complementation factor (a cf) is dispensable for c-to-u rna editing in vivo. rna : - . sowden m, hamm jk, spinelli s, smith hc. . determinants involved in regulating the proportion of edited apolipoprotein b rnas. rna : - . teng b, burant cf, davidson no. . molecular cloning of an apolipoprotein b messenger rna editing protein. science : - . wolfe ad, arnold db, chen xs. . comparison of rna editing activity of apobec -a cf and apobec -rbm complexes reconstituted in hek t cells. j mol biol : - . wolfe ad, li s, goedderz c, chen xs. . the structure of apobec and insights into its rna and dna substrate selectivity. nar cancer : zcaa . zuker m. . mfold web server for nucleic acid folding and hybridization prediction. nucleic acids res : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table . multivariable linear regression model for determinant factors of editing frequency in mouse apobec -dependent c-to-u mrna editing sites. determinant of editing frequency subgroup ß ( % ci) p value model without co-factor group n= ; r = . ; p<. base content score per unit increments . [ . , . ] < . count of mismatches in mooring sequence per unit increments - . [- . , - . ] <. count of mismatches in regulatory sequence motif d (whole sequence) per unit increments - . [- . , - . ] . au content of regulatory sequence motif b per % increments - . [- . , - . ] . overall secondary structure c loop reference c stem . [- . , . ] . c tail - . [- . , - . ] . non-canonical - . [- . , - . ] . location of mooring sequence stem-loop reference other - . [- . , - . ] <. after adding co-factor group to the model n= ; r = . ; p<. co-factor group rbm dominant reference co-dominant - . [- . , - . ] . a cf dominant . [- . , . ] . ß: represents average change (%) in the editing frequency compared to the reference group ci: confidence interval (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table : characteristics of human c-to-u mrna editing targets parameter low editing high editing nf glycra glycra tph b tph b apob editing location c c c c (exon ) c (exon ) c tissue neural sheath / cns tumor hippocampus hippocampus amygdala amygdala small intestine editing frequency %) > mismatches in regulatory motif a mismatches in regulatory motif b mismatches in regulatory motif c mismatches in regulatory motif d au content (%) in regulatory motif a au content (%) in regulatory motif b au content (%) in regulatory motif c* au content (%) in regulatory motif d spacer length* spacer au content (%) mismatches in spacer mismatches in mooring* au content (%) of downstream bases* au content (%) of downstream bases overall secondary structure canonical canonical canonical canonical canonical canonical location of edited c* loop tail tail stem loop loop location of mooring sequence stem-loop stem-loop stem-loop stem-loop stem-loop stem-loop ratio of stem-loop bases* . . . . . . free tail orientation symmetric symmetric asymmetric symmetric asymmetric asymmetric composite score cns: central nervous system * these items were used to calculate the composite score (total score = ) as follows: au content (%) in regulatory motif c: < %: , ≥ %: spacer length: ≤ : , > : mismatches in mooring: < : , ≥ : au content (%) of downstream bases: > %: , ≤ %: location of edited c in secondary structure: stem-loop: , tail: ratio of stem-loop bases: > %: , ≤ %: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure legends figure . characteristics of murine apobec -mediated c-to-u mrna editing sites. a: schematic presentation of mrna target, chromosomal editing location, and editing sites considered. each mrna target could be edited at one or more chromosomal location(s) (blue boxes). each editing location could be edited in one or more tissues giving rise to one or more editing site(s) per location (green boxes). editing site(s) of each mrna target are the sum of editing sites from all editing locations reported for that target. b: examples of canonical (apob chr : , top) and two types of non-canonical (kctd chr : and dcn chr : ) secondary structures. c: distribution of number of chromosomal editing location(s), or targeted cytidine(s), per mrna target. d: distribution of number of total editing sites per mrna target considering all chromosomal location(s) edited at different tissue(s). e: distribution of location of editing sites within gene structure. figure . base content of sequences flanking modified cytidine in rna editing and dna mutation targets. a: base content of nucleotides upstream and nucleotides downstream of edited cytidine in mouse apobec -mediated c-to-u mrna editing targets. b: base content of nucleotides upstream and nucleotides downstream of mutated cytidine in proposed human apobec-mediated dna mutation targets in patients with breast cancer. c: comparison of au base content (%) of nucleotides flanking modified cytidine in rna editing targets and dna mutation targets in mouse and human breast cancer patients, respectively. figure . characteristics of regulatory-spacer-mooring cassette and base content of individual nucleotides flanking edited cytidine in association with editing frequency. a: schematic illustration of regulatory-spacer-mooring cassette. four motifs were defined for (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . regulatory sequence: motif a for nucleotides - to - ; motif b for nucleotides - to - ; motif c for nucleotides - to - ; motif d representative of the whole sequence. b: association of the mismatches in motif d of regulatory sequence with editing frequency. c: association between the au content (%) of regulatory sequence (motif b) and editing frequency. d: association of the mismatches in spacer (nucleotides + to + downstream of the edited cytidine) with editing frequency. e: association of the mismatches in mooring sequence with editing frequency. f: heatmap plot illustrating the association between base content of nucleotides flanking the edited cytidine with editing frequency. red color density in each cell represents the beta coefficient value of corresponding base in the multivariable linear regression model fit including that nucleotide. the asteriska refer to the nucleotides that were retained in the final model. mismatches in regulatory, spacer, and mooring sequences were determined in comparison to the corresponding sequences in apob mrna (as reference). r: pearson correlation coefficient. figure . secondary structure-related features in association with editing frequency. a: distribution of different types of overall secondary structure in editing sites. c loop, c stem, c tail are three subtypes of canonical secondary structure based on the location of the edited cytidine. b: association between type of secondary structure and editing frequency. c: distribution of the mooring sequence location in editing sites. “other” refers to mooring sequences located in tail or stem/loop and not part of the main stem-loop structure. d: association of mooring sequence location with editing frequency. e: association between ratio of main stem-loop bases to total bases count and editing frequency. f: association of the ’ free tail length with editing frequency. * p<. ; ** p<. . r: pearson correlation coefficient. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . dominance and tissue-specific cofactor patterns among editing sites. a: distribution of dominant co-factor in editosomes of editing sites. b: association of dominant co- factor with editing frequency. c: distribution of number of editing tissue(s) per mrna target. d: tissue distribution of editing sites. e: average editing frequency of editing sites edited at different tissues. si, small intestine. figure . co-factor pattern and tissue-specific role in murine c-to-u mrna editing sites. a: distribution of editing tissue across subgroups of editing sites with different dominant co- factor patterns. b: location of edited cytidine in secondary structure of editing sites with different dominant co-factor patterns. c: schematic presentation of factors that correlate with dominant co-factor pattern in editing sites. this graph is based on the findings derived from pairwise multinomial logistic regression models. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental figure legends supplemental figure . chromosomal distribution of murine apobec -mediated c-to-u mrna editing sites. the black curve corresponds to left y-axis and represents average editing frequencies of editing sites related to each chromosome. the blue curve corresponds to right y axis and represents number of editing sites related to each chromosome. supplemental figure . association of editing frequency with characteristics of regulatory sequence in murine apobec -mediated c-to-u mrna editing sites. a-c. association of editing frequency with number of mismatches and au content (%). d-f association of editing frequency with different regulatory sequence motifs. mismatches were determined in comparison to the same regulatory sequence motif in apob mrna (as reference). supplemental figure . association of editing frequency with characteristics of downstream sequence in murine apobec -mediated c-to-u mrna editing sites. a. association of editing frequency with spacer length. b. association of editing frequency with spacer au content (%). c-f. association of editing frequency with and au content of successive segments downstream of the edited cytidine. supplemental figure . association of editing frequency with secondary structure- related characteristics in c-to-u mrna editing sites. a: distribution of edited cytidine location in secondary structure regardless of the overall secondary structure. b: association of editing frequency with edited cytidine location in secondary structure. c: distribution of free tail (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . orientation in editing sites. d: association of editing frequency with free tail orientation in editing sites. e: association of editing frequency with ’ free tail length. * p<. ; *** p<. . r: pearson correlation coefficient. supplemental figure . association of secondary structure-related characteristics with dominant co-factor pattern in apobec -mediated c-to-u mrna editing sites. a. distribution of mooring sequence location presented in the context of different dominant co- factor patterns. b. distribution of free tail orientation in secondary structure among editing sites, presented in the context of different dominant co-factor patterns. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . multivariable linear regression model for individual nucleotides surrounding edited cytosine (- to + ) in mouse apobec -dependent c-to-u mrna editing sites. location of nucleotide relative to edited c base preference ß ( % ci) p value nucleotide - gu . [ . , . ] . nucleotide - c . [ . , . ] . nucleotide - g . [ . , . ] . nucleotide - u . [ . , . ] . nucleotide - auc . [ . , . ] < . nucleotide - au . [ . , . ] . nucleotide + agu . [ . , . ] < . nucleotide + g . [ . , . ] < . nucleotide + g . [ . , . ] < . nucleotide + c . [ . , . ] . nucleotide + g . [ . , . ] . nucleotide + auc . [ . , . ] . nucleotide + ac . [ . , . ] . nucleotide + au . [ . , . ] . nucleotide + au . [ . , . ] . nucleotide + ac . [ . , . ] . ß: represents average change (%) in the editing frequency compared to the reference group (non- preferred group) ci: confidence interval (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . descriptive data of regulatory-spacer-mooring cassette in mouse apobec - dependent c-to-u mrna editing sites. parameter n mean sd min max sequence-related features mismatches in regulatory (motif a) . . mismatches in regulatory (motif b) . . mismatches in regulatory (motif c) . . mismatches in regulatory (motif d) . . au content (%) of regulatory (motif a) . . au content (%) of regulatory (motif b) . . au content (%) of regulatory (motif c) . . au content (%) of regulatory (motif d) . . spacer length . . mismatches in spacer . . au content (%) of spacer . . mismatches in mooring . . au content (%) of downstream sequence + to + . . au content (%) of downstream sequence + to + . . au content (%) of downstream sequence + to + . . au content (%) of downstream sequence + to + . . secondary structure-related features proportion of the bases that constitute main stem- loop . . . length of ’ free tail . . length of ’ free tail . . sd: standard deviation (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . comparing three subgroups of mouse apobec -dependent c-to-u mrna editing sites based on co-factor dominance. parameter rbm -dominant a cf-dominant co-dominant p value n mean sd n mean sd n mean sd mismatches in regulatory (motif a) . . . . . . . mismatches in regulatory (motif b) . . . . . . . mismatches in regulatory (motif c) . . . . . . . mismatches in regulatory (motif d) . . . . . . . au content (%) of regulatory (motif a) . . . . . . . au content (%) of regulatory (motif b) . . . . . . . au content (%) of regulatory (motif c) . . . . . . . au content (%) of regulatory (motif d) . . . . . . . spacer length . . . . . . . mismatches in spacer (in -base cassette) . . . . . . . mismatches in spacer (relative abundance (%)) . . . . . . . au content (%) of spacer . . . . . . . mismatches in mooring . . . . . . . au content (%) of downstream sequence + to + . . . . . . . au content (%) of downstream sequence + to + . . . . . . . au content (%) of downstream sequence + to + . . . . . . . au content (%) of downstream sequence + to + . . . . . . . proportion of the bases that constitute main stem-loop . . . . . . . length of ’ free tail . . . . . . . length of ’ free tail . . . . . . . sd: standard deviation (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . multinomial logistic regression model for determinant factors of co-factor dominancy in mouse apobec -dependent c-to-u mrna editing sites. determinant of co-factor dominancy subgroup coefficient ( % ci) p value a cf-dominant vs rbm -dominant tissue small intestine reference liver . [ . , . ] . location of edited cytosine loop reference stem - . [- . , . ] . tail - . [- . , - . ] < . mismatches in mooring sequence per unit increments . [- . , . ] . mismatches in regulatory sequence motif b per unit increments . [ . , . ] . mismatches in regulatory sequence motif c per unit increments . [- . , . ] . au content (%) of regulatory sequence motif d per unit increments . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , - . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . co-dominant vs rbm -dominant tissue small intestine reference liver - . [- . , . ] . location of edited cytosine in secondary structure c loop reference c stem . [- . , . ] . c tail . [ . , . ] . mismatches in mooring sequence per unit increments . [ . , . ] . mismatches in regulatory sequence motif b per unit increments - . [- . , - . ] . mismatches in regulatory sequence motif c per unit increments . [ . , . ] . au content (%) of regulatory sequence motif d per unit increments . [ . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , - . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , - . ] . co-dominant vs a cf -dominant tissue small intestine reference liver - . [- . , - . ] . location of edited cytosine in secondary structure c loop reference c stem . [ . , . ] . c tail . [ . , . ] < . mismatches in mooring sequence per unit increments . [- . , . ] . mismatches in regulatory sequence motif b per unit increments - . [- . , - . ] . mismatches in regulatory sequence motif c per unit increments . [ . , . ] . au content (%) of regulatory sequence motif d per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . model parameters: n= ; pseudo r = . ; p<. ci: confidence interval (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . evaluating the transcriptional fidelity of cancer models da peng *, rachel gleyzer *, wen-hsin tai , pavithra kumar , qin bian , bradley issacs , edroaldo lummertz da rocha , stephanie cai , kathleen dinapoli , , franklin w huang , patrick cahan , , department of biomedical engineering, johns hopkins university school of medicine, baltimore md usa institute for cell engineering, johns hopkins university school of medicine, baltimore md usa department of microbiology, immunology and parasitology, federal university of santa catarina, florianópolis sc, brazil department of cell biology, johns hopkins university school of medicine, baltimore, md usa department of electrical and computer engineering, johns hopkins university, baltimore md usa division of hematology/oncology, department of medicine; helen diller family cancer center; bakar computational health sciences institute; institute for human genetics; university of california, san francisco, san francisco, ca department of molecular biology and genetics, johns hopkins university school of medicine, baltimore md usa * these authors made equal contributions. correspondence to: patrick.cahan@jhmi.edu article type: research website: http://www.cahanlab.org/resources/cancercellnet_web code: https://github.com/pcahan /cancercellnet .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract background: cancer researchers use cell lines, patient derived xenografts, engineered mice, and tumoroids as models to investigate tumor biology and to identify therapies. the generalizability and power of a model derives from the fidelity with which it represents the tumor type under investigation, however, the extent to which this is true is often unclear. the preponderance of models and the ability to readily generate new ones has created a demand for tools that can measure the extent and ways in which cancer models resemble or diverge from native tumors. methods: we developed a machine learning based computational tool, cancercellnet, that measures the similarity of cancer models to naturally occurring tumor types and subtypes, in a platform and species agnostic manner. we applied this tool to cancer cell lines, patient derived xenografts, distinct genetically engineered mouse models, and tumoroids. we validated cancercellnet by application to independent data, and we tested several predictions with immunofluorescence. results: we have documented the cancer models with the greatest transcriptional fidelity to natural tumors, we have identified cancers underserved by adequate models, and we have found models with annotations that do not match their classification. by comparing models across modalities, we report that, on average, genetically engineered mice and tumoroids have higher transcriptional fidelity than patient derived xenografts and cell lines in four out of five tumor types. however, several patient derived xenografts and tumoroids have classification scores that are on par with native tumors, highlighting both their potential as faithful model classes and their heterogeneity. conclusions: cancercellnet enables the rapid assessment of transcriptional fidelity of tumor models. we have made cancercellnet available as freely downloadable software and as a web application that can be applied to new cancer models that allows for direct comparison to the cancer models evaluated here. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction models are widely used to investigate cancer biology and to identify potential therapeutics. popular modeling modalities are cancer cell lines (ccls) , genetically engineered mouse models (gemms) , patient derived xenografts (pdxs) , and tumoroids . these classes of models differ in the types of questions that they are designed to address. ccls are often used to address cell intrinsic mechanistic questions , gemms to chart progression of molecularly defined-disease , and pdxs to explore patient-specific response to therapy in a physiologically relevant context . more recently, tumoroids have emerged as relatively inexpensive, physiological, in vitro d models of tumor epithelium with applications ranging from measuring drug responsiveness to exploring tumor dependence on cancer stem cells. models also differ in the extent to which the they represent specific aspects of a cancer type . even with this intra- and inter-class model variation, all models should represent the tumor type or subtype under investigation, and not another type of tumor, and not a non-cancerous tissue. therefore, cancer- models should be selected not only based on the specific biological question but also based on the similarity of the model to the cancer type under investigation , . various methods have been proposed to determine the similarity of cancer models to their intended subjects. domcke et al devised a 'suitability score' as a metric of the molecular similarity of ccls to high grade serous ovarian carcinoma based on a heuristic weighting of copy number alterations, mutation status of several genes that distinguish ovarian cancer subtypes, and hypermutation status . other studies have taken analogous approaches by either focusing on transcriptomic or ensemble molecular profiles (e.g. transcriptomic and copy number alterations) to quantify the similarity of cell lines to tumors – . these studies were tumor-type specific, focusing on ccls that model, for example, hepatocellular carcinoma or breast cancer. notably, yu et al compared the transcriptomes of ccls to the cancer genome atlas (tcga) by correlation analysis, resulting in a panel of ccls recommended as most representative of tumor types . most recently, najgebauer et al and salvadores et al .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / have developed methods to assess ccls using molecular traits such as copy number alterations (cna), somatic mutations, dna methylation and transcriptomics. while all of these studies have provided valuable information, they leave two major challenges unmet. the first challenge is to determine the fidelity of gemms, pdxs, and tumoroids, and whether there are stark differences between these classes of models and ccls. the other major unmet challenge is to enable the rapid assessment of new, emerging cancer models. this challenge is especially relevant now as technical barriers to generating models have been substantially lowered , , and because new models such as pdxs and tumoroids can be derived on patient-specific basis therefore should be considered a distinct entity requiring individual validation , . to address these challenges, we developed cancercellnet (ccn), a computational tool that uses transcriptomic data to quantitatively assess the similarity between cancer models and naturally occurring tumor types and subtypes in a platform- and species-agnostic manner. here, we describe ccn’s performance, and the results of applying it to assess ccls, pdxs, gemms, and tumoroids. this has allowed us to identify the most faithful models currently available, to document cancers underserved by adequate models, and to find models with inaccurate tumor type annotation. moreover, because ccn is open-source and easy to use, it can be readily applied to newly generated cancer models as a means to assess their fidelity. results cancercellnet classifies samples accurately across species and technologies previously, we had developed a computational tool using the random forest classification method to measure the similarity of engineered cell populations to their in vivo counterparts based on transcriptional profiles , . more recently, we elaborated on this approach to allow for classification of single cell rna-seq data in a manner that allows for cross-platform and cross-species analysis . here, we used an analogous approach to build a .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / platform that would allow us to quantitatively compare cancer models to naturally occurring patient tumors (fig a). in brief, we used tcga rna-seq expression data from solid tumor types to train a top-pair multi-class random forest classifier (fig b). we combined training data from rectal adenocarcinoma (read) and colon adenocarcinoma (coad) into one coad_read category because read and coad are considered to be virtually indistinguishable at a molecular level . we included an ‘unknown’ category trained using randomly shuffled gene-pair profiles generated from the training data of tumor types to identify query samples that are not reflective of any of the training data. to estimate the performance of ccn and how it is impacted by parameter variation, we performed a parameter sweep with a -fold / cross-validation strategy (i.e. / of the data sampled across each cancer type was used to train, / was used to validate) (fig c). the performance of ccn, as measured by the mean area under the precision recall curve (auprc), did not fall below . and remained relatively stable across parameter sets (supp fig a). the optimal parameters resulted in , features. the mean auprcs exceeded . in most tumor types with this optimal parameter set (fig d, supp fig b). the auprcs of ccn applied to independent data rna-seq data from tumors across five tumor types from the international cancer genome consortium (icgc) ranged from . to . , supporting the notion that the platform is able to accurately classify tumor samples from diverse sources (fig e). as one of the central aims of our study is to compare distinct cancer models, including gemms, our method needed to be able to classify samples from mouse and human samples equivalently. we used the top-pair transform to achieve this and we tested the feasibility of this approach by assessing the performance of a normal (i.e. non-tumor) cell and tissue classifier trained on human data as applied to mouse samples. consistent with prior applications , we found that the cross-species classifier performed well, achieving mean auprc of . when applied to mouse data (supp fig c). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to evaluate cancer models at a finer resolution, we also developed an approach to perform tumor subtype classifications (supp fig d). we constructed different cancer subtype classifiers based on the availability of expression or histological subtype information , – . we also included non-cancerous, normal tissues as categories for several subtype classifiers when sufficient data was available: breast invasive carcinoma (brca), coad_read, head and neck squamous cell carcinoma (hnsc), kidney renal clear cell carcinoma (kirc) and uterine corpus endometrial carcinoma (ucec). the subtype classifiers all achieved high overall average auprs ranging from . to . (supp fig e). fidelity of cancer cell lines having validated the performance of ccn, we then used it to determine the fidelity of ccls. we mined rna-seq expression data of different cell lines across cancer types from the cancer cell line encyclopedia (ccle) and applied ccn to them, finding a wide classification range for cell lines of each tumor type (fig a, supp tab ). to verify the classification results, we applied ccn to expression profiles from ccle generated through microarray expression profiling . to ensure that ccn would function on microarray data, we first tested it by applying a ccn classifier created to test microarray data to expression profiles of tumor types. the cross-platform ccn classifier performed well, based on the comparison to study-provided annotation, achieving a mean auprc of . (supp fig a). next, we applied this cross-platform classifier to microarray expression profiles from ccle (supp fig b). from the classification results of cell lines that have both rna-seq and microarray expression profiles, we found a strong overall positive association between the classification scores from rna-seq and those from microarray (supp fig c). this comparison supports the notion that the classification scores for each cell line are not artifacts of profiling methodology. moreover, this comparison shows that the scores are consistent between the times that the cell lines were first assayed by microarray expression profiling in and by .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rna-seq in . we also observed high level of correlation between our analysis and the analysis done by yu et al (supp fig d), further validating the robustness of the ccn results. next, we assessed the extent to which ccn classifications agreed with their nominal tumor type of origin, which entailed translating quantitative ccn scores to classification labels. to achieve this, we selected a decision threshold that maximized the macro f measure, harmonic mean of precision and recall, across cross validations. then, we annotated cell lines based their ccn score profile as follows. cell lines with ccn scores > threshold for the tumor type of origin were annotated as 'correct'. cell lines with ccn scores > threshold in the tumor type of origin and at least one other tumor type were annotated as 'mixed'. cell lines with ccn scores > threshold for tumor types other than that of the cell line's origin were annotated as 'other'. cell lines that did not receive a ccn score > threshold for any tumor type were annotated as 'none' (fig b). we found that majority of cell lines originally annotated as breast invasive carcinoma (brca), cervical squamous cell carcinoma and endocervical adenocarcinoma (cesc), skin cutaneous melanoma (skcm), colorectal cancer (coad_read) and sarcoma (sarc) fell into the 'correct' category (fig b). on the other hand, no esophageal carcinoma (esca), pancreatic adenocarcinoma (paad) or brain lower grade glioma (lgg) were classified as 'correct', demonstrating the need for more transcriptionally faithful cell lines that model those general cancer types. there are several possible explanations for cell lines not receiving a 'correct' classification. one possibility is that the sample was incorrectly labeled in the study from which we harvested the expression data. consistent with this explanation, we found that colorectal cancer line nci-h , , a cell line labelled as liver hepatocellular carcinoma (lihc) by ccle, was classified strongly as coad_read (supp tab ). another possibility to explain low ccn score is that cell lines were derived from subtypes of tumors that are not well-represented in tcga. to explore this hypothesis, we first performed tumor subtype classification on ccls from tumor types for which we had trained subtype classifiers (supp tab ). we reasoned that if .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a cell was a good model for a rarer subtype, then it would receive a poor general classification but a high classification for the subtype that it models well. therefore, we counted the number of lines that fit this pattern. we found that of the lines with no general classification, ( %) were classified as a specific subtype, suggesting that derivation from rare subtypes is not the major contributor to the poor overall fidelity of ccls. another potential contributor to low scoring cell lines is intra-tumor stromal and immune cell impurity in the training data. if impurity were a confounder of ccn scoring, then we would expect a strong positive correlation between mean purity and mean ccn classification scores of ccls per general tumor type. however, the pearson correlation coefficient between the mean purity of general tumor type and mean ccn classification scores of ccls in the corresponding general tumor type was low ( . ), suggesting that tumor purity is not a major contributor to the low ccn scores across ccls (supp fig e). comparison of skcm and gbm ccls to scrna-seq to more directly assess the impact of intra-tumor heterogeneity in the training data on evaluating cell lines, we constructed a classifier using cell types found in human melanoma and glioblastoma scrna-seq data , . previously, we have demonstrated the feasibility of using our classification approach on scrna-seq data . our scrna-seq classifier achieved a high average auprc ( . ) when applied to held-out data and high mean auprc ( . ) when applied to few purified bulk testing samples (supp fig a-b). comparing the ccn score from bulk rna-seq general classifier and scrna-seq classifier, we observed a high level of correlation (pearson correlation of . ) between the skcm ccn classification scores and scrna-seq skcm malignant ccn classification scores for skcm cell lines (fig c, supp fig c). of the skcm cell lines that were classified as skcm by the bulk classifier, were also classified as skcm malignant cells by the scrna-seq classifier. interestingly, we also observed a high correlation between the sarc ccn classification score and scrna-seq cancer .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / associated fibroblast (caf) ccn classification scores (pearson correlation of . ). six of the seven skcm cell lines that had been classified as exclusively sarc by ccn were classified as caf by the scrna-seq classifier (fig d, supp fig c), which suggests the possibility that these cell lines were derived from caf or other mesenchymal populations, or that they have acquired a mesenchymal character through their derivation. the high level of agreement between scrna-seq and bulk rna-seq classification results shows that heterogeneity in the training data of general ccn classifier has little impact in the classification of skcm cell lines. in contrast, we observed a weaker correlation between gbm ccn classification scores and scrna-seq gbm neoplastic ccn classification scores (pearson correlation of . ) for gbm cell lines (fig e, supp fig d). of the gbm lines that were not classified as gbm with ccn, were classified as gbm neoplastic cells with the scrna-seq classifier. among the gbm lines that were classified as sarc with ccn, cell lines were classified as caf (fig f), which were classified as both gbm neoplastic and caf in the scrna-seq classifier. similar to the situation with skcm lines that classify as caf, this result is consistent with the possibility that some gbm lines classified as sarc by ccn could be derived from mesenchymal subtypes exhibiting both strong mesenchymal signatures and glioblastoma signatures or that they have acquired a mesenchymal character through their derivation. the lower level of agreement between scrna-seq and bulk rna-seq classification results for gbm models suggests that the heterogeneity of glioblastomas can impact the classification of gbm cell lines, and that the use of scrna-seq classifier can resolve this deficiency. immunofluorescence confirmation of ccn predictions to experimentally explore some of our computational analyses, we performed immunofluorescence on three cell lines that were not classified as their labelled categories: the ovarian cancer line sk-ov- had a high ucec ccn score ( . ), the ovarian cancer line a had a high testicular germ cell tumors (tgct) ccn score ( . ), and the prostate .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cancer line pc- had a high bladder cancer (blca) score ( . ) (supp tab ). we reasoned that if sk-ov- , a and pc- were classified most strongly as ucec, tgct and blca, respectively, then they would express proteins that are indicative of these cancer types. first, we measured the expression of the uterine-associated transcription factor hoxb , , and the ucec serous ovarian tumor biomarker wt in sk-ov- , in the ov cell line caov- , and in the ucec cell line hec- . we chose caov- as our positive control for ov biomarker expression because it was determined by our analysis and others , to be a good model of ov. likewise, we chose hec- to be a positive control for ucec. we found that sk- ov- has a small percentage ( %) of cells that expressed the uterine marker hoxb and a large proportion ( %) of cells that expressed wt (fig a). in contrast, no caov- cells expressed hoxb , whereas % of cells expressed wt . this suggests that sk-ov- exhibits both biomarkers of ovarian tumor and uterine tissue. from our computational analysis and experimental validation, sk-ov- is most likely an endometrioid subtype of ovarian cancer. this result is also consistent with prior classification of sk-ov- , and the fact that sk-ov- lacks p mutations, which is prevalent in high-grade serous ovarian cancer , and it harbors an endometrioid-associated mutation in arid a , , . next, we measured the expression of markers of ov and germ cell cancers (lin a ) in the ov-annotated cell line a , which received a high tcgt ccn score. we found that % of a cells expressed lin a whereas it was not detected in caov- (fig b). the ov marker wt was also expressed in fewer a cells as compared to caov- ( % vs %), which suggests that a could be a germ cell derived ovarian tumor. taken together, our results suggest that sk-ov- and a could represent ov subtypes of that are not well represented in tcga training data, which resulted in a low ov score and higher ccn score in other categories. lastly, we examined pc- , annotated as a prad cell line but classified to be most similar to blca. we found that % of the pc- cells expressed pparg, a contributor to urothelial differentiation that is not detected in the prad vcap cell line but is highly expressed .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in the blca rt cell line (fig c). pc- cells also expressed the prad biomarker folh suggesting that pc- has an prad origin and gained urothelial or luminal characteristics through the derivation process. in short, our limited experimental data support the ccn classification results. subtype classification of cancer cell lines next, we explored the subtype classification of ccls from three general tumor types in more depth. we focused our subtype visualization (fig a-c) on ccl models with general ccn score above . in their nominal cancer type as this allowed us to analyze those models that fell below the general threshold but were classified as a specific sub-type (supp tab - ). focusing first on ucec, the histologically defined subtypes of ucec, endometrioid and serous, differ in prevalence, molecular properties, prognosis, and treatment. for instance, the endometrioid subtype, which accounts for approximately % of uterine cancers, retains estrogen receptor and progesterone receptor status and is responsive towards progestin therapy , . serous, a more aggressive subtype, is characterized by the loss of estrogen and progesterone receptor and is not responsive to progestin therapy , . ccn classified the majority of the ucec cell lines as serous except for jhuem- which is classified as mixed, with similarities to both endometrioid and serous (fig a). the preponderance ccle lines of serous versus endometroid character may be due to properties of serous cancer cells that promote their in vitro propagation, such as upregulation of cell adhesion transcriptional programs . some of our subtype classification results are consistent with prior observations. for example, hec- a, hec- b, and kle were previously characterized as type ii endometrial cancer, which includes a serous histological subtype . on the other hand, our subtype classification results contradict prior observations in at least one case. for instance, the ishikawa cell line was derived from type i endometrial cancer (endometrioid histological subtype) , , however ccn classified a derivative of this line, ishikawa er-, as serous. the high serous ccn score .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / could result from a shift in phenotype of the line concomitant with its loss of estrogen receptor (er) as this is a distinguishing feature of type ii endometrial cancer (serous histological subtype) . taken together, these results indicate a need for more endometroid-like ccls. next, we examined the subtype classification of lung squamous cell carcinoma (lusc) and lung adenocarcinoma (luad) cell lines (fig b-c). all the lusc lines with at least one subtype classification had an underlying primitive subtype classification. this is consistent either with the ease of deriving lines from tumors with a primitive character, or with a process by which cell line derivation promotes similarity to more primitive subtype, which is marked by increased cellular proliferation . some of our results are consistent with prior reports that have investigated the resemblance of some lines to lusc subtypes. for example, hcc- , previously been characterized as classical , , had a maximum ccn score in the classical subtype ( . ) . similarly, ludlu- and eplc- h, previously reported as classical and basal respectively, had maximal tumor subtype ccn scores for these sub-types ( . and . ) (fig b, supp tab ) despite classified as unknown. lastly, the luad cell lines that were classified as a subtype were either classified as proximal inflammation or proximal proliferation (fig c). rerf-lc-ad had the highest general classification score and the highest proximal inflammation subtype classification score. taken together, these subtype classification results have revealed an absence of cell lines models for basal and secretory lusc, and for the terminal respiratory unit (tru) luad subtype. cancer cell lines’ popularity and transcriptional fidelity finally, we sought to measure the extent to which cell line transcriptional fidelity related to model prevalence. we used the number of papers in which a model was mentioned, normalized by the number of years since the cell line was documented, as a rough approximation of model prevalence. to explore this relationship, we plotted the normalized citation count versus general classification score, labeling the highest cited and highest .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / classified cell lines from each general tumor type (fig d). for most of the general tumor types, the highest cited cell line is not the highest classified cell line except for hep g , ags and ml- , representing liver hepatocellular carcinoma (lihc), stomach adenocarcinoma (stad), and thyroid carcinoma (thca), respectively. on the other hand, the general scores of the highest cited cell lines representing blca (t ), brca (mda-mb- ), and prad (pc- ) fall below the classification threshold of . . notably, each of these tumor types have other lines with scores exceeding . , which should be considered as more faithful transcriptional models when selecting lines for a study (supp tab and http://www.cahanlab.org/resources/cancercellnet_results/). evaluation of patient derived xenografts next, we sought to evaluate a more recent class of cancer models: pdx. to do so, we subjected the rna-seq expression profiles of pdx models from different types of cancer types generated previously to ccn. similar to the results of ccls, the pdxs exhibited a wide range of classification scores (fig a, supp tab ). by categorizing the ccn scores of pdx based on the proportion of samples associated with each tumor type that were correctly classified, we found that sarc, skcm, coad_read and brca have higher proportion of correctly classified pdx than those of other cancer categories (fig b). in contrast to ccls, we found a higher proportion of correctly classified pdx in stad, paad and kirc (fig b). however, similar to ccls, no esca pdxs were classified as such. this held true when we performed subtype classification on pdx samples: none of the pdx in esca were classified as any of the esca subtypes (supp tab ). ucec pdxs had both endometrioid subtypes, serous subtypes, and mixed subtypes, which provided a broader representation than ccls (fig c). several lusc pdxs that were classified as a subtype were also classified as head and neck squamous cell carcinoma (hnsc) or mix hnsc and lusc (fig d). this could be due to the similarity in expression profiles of basal and classical subtypes of hnsc and lusc , , which is .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / consistent with the observation that these pdxs were also subtyped as classical. no lusc pdxs were classified as the secretory subtype. in contrast to luad ccls, four of the five luad pdxs with a discernible sub-type were classified as proximal inflammatory (fig e). on the other hand, similar to the ccls, there were no tru subtypes in the luad pdx cohort. in summary, we found that while individual pdxs can reach extremely high transcriptional fidelity to both general tumor types and subtypes, many pdxs were not classified as the general tumor type from which they originated. evaluation of gemms next, we used ccn to evaluate gemms of six general tumor types from nine studies for which expression data was publicly available – . as was true for ccls and pdxs, gemms also had a wide range of ccn scores (fig a, supp tab ). we next categorized the ccn scores based on the proportion of samples associated with each tumor type that were correctly classified (fig b). in contrast to lgg ccls, lgg gemms, generated by nf mutations expressed in different neural progenitors in combination with pten deletion , consistently were classified as lgg (fig a-b). the gemm dataset included multiple replicates per model, which allowed us to examine intra-gemm variability. both at the level of ccn score and at the level of categorization, gemms were invariant. for example, replicates of ucec gemms driven by prg(cre/+)pten(lox/lox) received almost identical general ccn scores (fig c, supp tab ). gemms sharing genotypes across studies, such as luad gemms driven by kras mutation and loss of p , , , also received similar general and subtype classification scores (fig a,b,e). next, we explored the extent to which genotype impacted subtype classification in ucec, lusc, and luad. prg(cre/+)pten(lox/lox) gemms had a mixed subtype classification of both serous and endometrioid, consistent with the fact that pten loss occurs in both subtypes (albeit more frequently in endometrioid). we also analyzed prg(cre/+)pten(lox/lox)csf r-/- gemms. polymorphonuclear neutrophils (pmns), which play anti-tumor roles in endometrioid .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cancer progression, are depleted in these animals. interestingly, prg(cre/+)pten(lox/lox)csf r-/- gemms had a serous subtype classification, which could be explained by differences in pmn involvement in endometrioid versus serous uterine tumor development that are reflected in the respective transcriptomes of the tcga ucec training data. we note that the tumor cells were sorted prior to rna-seq and thus the shift in subtype classification is not due to contamination of gemms with non-tumor components. in short, this analysis supports the argument that tumor- cell extrinsic factors, in this case a reduction in anti-tumor pmns, can shift the transcriptome of a gemm so that it more closely resembles a serous rather than endometrioid subtype. the lusc gemms that we analyzed were lkb fl/fl and they either overexpressed of sox (via two distinct mechanisms) or were also ptenfl/fl . we note that the eight lenti-sox - cre-infected;lkb fl/fl and rosa lsl-sox -ires-gfp;lkb fl/fl samples that classified as 'unknown' had lusc ccn scores only modestly lower than the decision threshold (fig d) (mean ccn score = . ). thirteen out of the of the sox gemms classified as the secretory subtype of lusc. the consistency is not surprising given both models overexpress sox and lose lkb . on the other hand, the lkb fl/fl;ptenfl/fl gemms had substantially lower general lusc ccn scores and our subtype classification indicated that this gemm was mostly classified as 'unknown', in contrast to prior reports suggesting that it is most similar to a basal subtype . none of the three lusc gemms have strong classical ccn scores. most of the luad gemms, which were generated using various combinations of activating kras mutation, loss of trp , and loss of smarca l , , , were correctly classified (fig e). those that were not classified have modestly lower ccn score than the decision threshold (mean ccn score = . ) . there were no substantial differences in general or subtype classification across driver genotypes. although the sub-type of all luad gemms was 'unknown', the subtypes tended to have a mixture of high ccn proximal proliferation, proximal inflammation and tru scores. taken together, this analysis suggests that there is a degree of similarity, and perhaps plasticity between the primitive and secretory (but not basal or classical) subtypes of lusc. on the other .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / hand, while the luad gemms classify strongly as luad, they do not have strong particular subtype classification -- a result that does not vary by genotype. evaluation of tumoroids lastly, we used ccn to assess a relatively novel cancer model: tumoroids. we downloaded and assessed distinct tumoroid expression profiles spanning cancer categories from the nci patient-derived models repository (pdmr) and from three individual studies – (fig a, supp tab ). we note that several categories have three or fewer samples (brca, cesc, kirp, ov, lihc, and blca from pdmr). among the cancer categories represented by more than three samples, only lusc and paad have fewer than % classified as their annotated label (fig b). in contrast to gbm ccls, all three induced pluripotent stem cell-derived gbm tumoroids were classified as gbm with high ccn scores (mean = . ). to further characterize the tumoroids, we performed subtype classification on them (supp tab ). ucec tumoroids from pdmr contains a wide range of subtypes with two endometrioid, two serous and one mixed type (fig c). on the other hand, lusc tumoroids appear to be predominantly of classical subtypes with one tumoroid classified as a mix between classical and primitive (fig d). lastly, similar to the ccl and pdx counterparts, luad tumoroids are classified as proximal inflammatory and proximal proliferation with no tumoroids classified as tru subtype (fig e). comparison of ccls, pdxs, gemms and tumoroids finally, we sought to estimate the comparative transcriptional fidelity of the four cancer models modalities. we compared the general ccn scores of each model on a per tumor type basis (fig ). in the case of gemms, we used the mean classification score of all samples with shared genotypes. we also used mean classification of technical replicates found in lihc tumoroids . we evaluated models based on both the maximum ccn score, as this represents .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the potential for a model class, and the median ccn score, as this indicates the current overall transcriptional fidelity of a model class. pdxs achieved the highest ccn scores in three (ucec, paad, luad) out of the five cancer categories in which all four modalities were available (fig ), despite having low median ccn scores. notably, pdxs have a median ccn score above the . threshold in paad while none of the other three modalities have any samples above the threshold. in lihc, the highest ccn score for pdx ( . ) is only slightly lower than the highest ccn score for tumoroid ( . ). this suggest that certain individual pdxs most closely mimic the transcriptional state of native patient tumors despite a portion of the pdxs having low ccn scores. similarly, while the majority of the ccls have low ccn scores, several lines achieve high transcriptional fidelity in lusc, luad and lihc (fig ). collectively, gemms and tumoroids had the highest median ccn scores in four of the five model classes (lusc and luad for gemms and ucec and lihc for tumoroids). notably, both of the lihc tumoroids achieved ccn scores on par with patient tumors (fig ). in brief, this analysis indicates that pdxs and ccls are heterogenous in terms of transcriptional fidelity, with a portion of the models highly mimicking native tumors and the majority of the models having low transcriptional fidelity (with the exception of paad for pdxs). on the other hand, gemms and tumoroids displayed a consistently high fidelity across different models. because the ccn score is based on a moderate number of gene features (i.e. , gene pairs consisting of , unique genes) relative to the total number of protein-coding genes in the genome, it is possible that a cancer model with a high ccn score might not have a high global similarity to a naturally occurring tumor. therefore, we also calculated the grn status, a metric of the extent to which tumor-type specific gene regulatory network is established , for all models (supp fig ). we observed high level of correlation between the two similarity metrics, which suggests that although ccn classifies on a selected set of genes, its scores are highly correlated with global assessment of transcriptional similarity. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we also sought to compare model modalities in terms of the diversity of subtypes that they represent (supp fig ). as a reference, we also included in this analysis the overall subtype incidence, as approximated by incidence in tcga. replicates in gemms and tumoroids were averaged into one classification profile. in models of ucec, there is a notable difference in endometroid incidence, and the proportion of models classified as endometroid, with pdx and tumoroids having any representatives (supp fig ). all of the ccl, gemm, and tumoroid models of paad have an unknown subtype classification and no correct general classification. however, the majority of pdxs are subtyped as either a mixture of basal and classical, or classical alone. luad have proximal inflammation and proximal proliferation subtypes modelled by ccls and pdx (supp fig ). likewise, lusc have basal, classical and primitive subtypes modelled by ccls and pdxs, and secretory subtype modelled by gemms exclusively (supp fig ). taken together, these results demonstrate the need to carefully select different model systems to more suitably model certain cancer subtypes. discussion a major goal in the field of cancer biology is to develop models that mimic naturally occurring tumors with enough fidelity to enable therapeutic discoveries. however, methods to measure the extent to which cancer models resemble or diverge from native tumors are lacking. this is especially problematic now because there are many existing models from which to choose, and it has become easier to generate new models. here, we present cancercellnet (ccn), a computational tool that measures the similarity of cancer models to naturally occurring tumor types and subtypes. while the similarity of ccls to patient tumors has already been explored in previous work, our tool introduces the capability to assess the transcriptional fidelity of pdxs, gemms, and tumoroids. because ccn is platform- and species-agnostic, it represents a consistent platform to compare models across modalities including ccls, pdxs, gemms and tumoroids. here, we applied ccn to cancer cell lines, patient derived .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / xenografts, distinct genetically engineered mouse models and tumoroids. several insights emerged from our computational analyses that have implications for the field of cancer biology. first, pdxs have the greatest potential to achieve transcriptional fidelity with three out of five general tumor types for which data from all modalities was available, as indicated by the high scores of individual pdxs. notably pdxs are the only modality with samples classified as paad. at the same time, the median ccn scores of pdxs were lower than that of gemms and tumoroids in the other four tumor types. it is unclear what causes such a wide range of ccn scores within pdxs. we suspect that some pdxs might have undergone selective pressures in the host that distort the progression of genomic alterations away from what is observed in natural tumor . future work to understand this heterogeneity is important so as to yield consistently high fidelity pdxs, and to identify intrinsic and host-specific factors that so powerfully shape the pdx transcriptome. second, in general gemms and tumoroids have higher median ccn scores than those of pdxs and ccls. this is also consistent with that fact that gemms are typically derived by recapitulating well-defined driver mutations of natural tumors, and thus this observation corroborates the importance of genetics in the etiology of cancer . moreover, in contrast to most pdxs, gemms are typically generated in immune replete hosts. therefore, the higher overall fidelity of gemms may also be a result of the influence of a native immune system on gemm tumors . the high median ccn scores of tumoroids can be attributed to several factors including the increased mechanical stimuli and cell-cell interactions that come from d self- organizing cultures , . third, we have found that none of the samples that we evaluated here are transcriptionally adequate models of esca. this may be due to an inherent lability of the esca transcriptome that is often preceded by a metaplasia that has obscured determining its cell type(s) of origin . therefore, this tumor type requires further attention to derive new models. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fourth, we found that in several tumor types, gemms tend to reflect mixtures of subtypes rather than conforming strongly to single subtypes. the reasons for this are not clear but it is possible that in the cases that we examined the histologically defined subtypes have a degree of plasticity that is exacerbated in the murine host environment. lastly, we recognize that many ccls are not classified as their annotated labels. while we have suggested that the lack of immune component is not a major confounder, we suspect that the ccls could undergo genetic divergence due to high number of passages, chemotherapy before biopsy, culture condition and genetic instability – , which could all be factors that drive ccls away from their labelled tumors. currently, there are several limitations to our ccn tool, and caveats to our analyses which indicate areas for future work and improvement. first, ccn is based on transcriptomic data but other molecular readouts of tumor state, such as profiles of the proteome , epigenome , non-coding rna-ome , and genome would be equally, if not more important, to mimic in a model system. therefore, it is possible that some models reflect tumor behavior well, and because this behavior is not well predicted by transcriptome alone, these models have lower ccn scores. to both measure the extent that such situations exist, and to correct for them, we plan in the future to incorporate other omic data into ccn so as to make more accurate and integrated model evaluation possible. as a first step in this direction, we plan to incorporate dna methylation and genomic sequencing data as additional features for our random forest classifier as this data is becoming more readily available for both training and cancer models. we expect that this will allow us to both refine our tumor subtype categories and it will enable more accurate predictions of how models respond to perturbations such as drug treatment. a second limitation is that in the cross-species analysis, ccn implicitly assumes that homologs are functionally equivalent. the extent to which they are not functionally equivalent determines how confounded the ccn results will be. this possibility seems to be of limited .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / consequence based on the high performance of the normal tissue cross-species classifier and based on the fact that gemms have the highest median ccn scores (in addition to tumoroids). a third caveat to our analysis is that there were many fewer distinct gemms and tumoroids than ccls and pdxs. as more transcriptional profiles for gemms and tumoroids emerge, this comparative analysis should be revisited to assess the generality of our results. finally, the tcga training data is made up of rna-seq from bulk tumor samples, which necessarily includes non-tumor cells, whereas the ccls are by definition cell lines of tumor origin. therefore, ccls theoretically could have artificially low ccn scores due to the presence of non-tumor cells in the training data. this problem appears to be limited as we found no correlation between tumor purity and ccn score in the ccle samples. however, this problem is related to the question of intra-tumor heterogeneity. we demonstrated the feasibility of using ccn and single cell rna-seq data to refine the evaluation of cancer cell lines contingent upon availability of scrna-seq training data. as more training single cell rna-seq data accrues, ccn would be able to not only evaluate models on a per cell type basis, but also based on cellular composition. we have made the results of our analyses available online so that researchers can easily explore the performance of selected models or identify the best models for any of the general tumor types and the subtypes presented here. to ensure that ccn is widely available we have developed a free web application, which performs ccn analysis on user- uploaded data and allows for direct comparison of their data to the cancer models evaluated here. we have also made the ccn code freely available under an open source license and as an easily installed r package, and we are actively supporting its further development. included in the web application are instructions for training ccn and reproducing our analysis. the documentation describes how to analyze models and compare the results to the panel of models that we evaluated here, thereby allowing researchers to immediately compare their models to the broader field in a comprehensive and standard fashion. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / online methods training general cancercellnet classifier to generate training data sets, we downloaded , patient tumor rna-seq expression count matrix and their corresponding sample table across different tumor types from tcga using tcgaworkflowdata, tcgabiolinks and summarizedexperiment packages. we used all the patient tumor samples for training the general ccn classifier. we limited training and analysis of rna-seq data to the , genes in common between the tcga dataset and all the query samples (ccls, pdxs, gemms, and tumoroids). to train the top pair random forest classifier, we used a method similar to our previous method . ccn first normalized the training counts matrix by down-sampling the counts to , counts per sample. to significantly reduce the execution time and memory of generating gene pairs for all possible genes, ccn then selected n up-regulated genes, n down-regulated genes and n least differentially expressed genes (ccn training parameter ntopgenes = n) for each of the cancer categories using template matching as the genes to generate top scoring gene pairs. in short, for each tumor type, ccn defined a template vector that labelled the training tumor samples in cancer type of interest as and all other tumor samples as ccn then calculated the pearson correlation coefficient between template vector and gene expressions for all genes. the genes with strong match to template as either upregulated or downregulated had large absolute pearson correlation coefficient. ccn chose the upregulated, downregulated and least differentially expressed genes based on the magnitude of pearson correlation coefficient. after ccn selected the genes for each cancer type, ccn generated gene pairs among those genes. gene pair transformation was a method inspired by the top-scoring pair classifier to allow compatibility of classifier with query expression profiles that were collected through different platforms (e.g. microarray query data applied to rna-seq training data). in brief, the gene pair transformation compares genes within an expression sample and encodes the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / “gene _gene ” gene-pair as if the first gene has higher expression than the second gene. otherwise, gene pair transformation would encode the gene-pair as . using all the gene pair combinations generated through the gene sets per cancer type, ccn then selected top m discriminative gene pairs (ccn training parameter ntopgenepairs = m) for each category using template matching (with large absolute pearson correlation coefficient) described above. to prevent any single gene from dominating the gene pair list, we allowed each gene to appear at maximum of three times among the gene pairs selected as features per cancer type. after the top discriminative gene pairs were selected for each cancer category, ccn grouped all the gene pairs together and gene pair transformed the training samples into a binary matrix with all the discriminative gene pairs as row names and all the training samples as column names. using the binary gene pair matrix, ccn randomly shuffled the binary values across rows then across columns to generate random profiles that should not resemble training data from any of the cancer categories. ccn then sampled random profiles, annotated them as “unknown” and used them as training data for the “unknown” category. using gene pair binary training matrix, ccn constructed a multi-class random forest classifier of trees and used stratified sampling of sample size to ensure balance of training data in constructing the decision trees. to identify the best set of genes and gene-pair parameters (n and m), we used a grid- search cross-validation strategy with cross-validations at each parameter set. the specific parameters for the final ccn classifier using the function “broadclass_train” in the package cancercellnet are in supp tab . the gene-pairs are in supp tab . validating general cancercellnet classifier two thirds of patient tumor data from each cancer type were randomly sampled as training data to construct a ccn classifier. based on the training data, ccn selected the classification genes and gene-pairs and trained a classifier. after the classifier was built, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / held-out samples from each cancer category were sampled and “unknown” profiles were generated for validation. the process of randomly sampling training set from / of all patient tumor data, selecting features based on the training set, training classifier and validating was repeated times to have a more comprehensive assessment of the classifier trained with the optimal parameter set. to test the performance of final ccn on independent testing data, we applied it to profiles from icgc spanning projects that do not overlap with tcga (brca- kr, liri-jp, ov-au, paca-au, paca-ca, prad-fr). selecting decision thresholds our strategy for selecting a decision threshold was to find the value that maximizes the average macro f measure for each of the cross-validations that were performed with the optimal parameter set, testing thresholds between and with a . increment. the f measure is defined as: 𝑀𝑎𝑐𝑟𝑜 𝐹 = × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 we selected the most commonly occurring threshold above . that maximized the average macro f measure across the cross-validations as the decision threshold for the final classifier (threshold = . ). the same approach was applied for the subtype classifiers. the thresholds and the corresponding average precision, recall and f measures are recorded in (supp tab ). classifying query data into general cancer categories we downloaded the rna-seq cancer cell lines expression profiles and sample table from (https://portals.broadinstitute.org/ccle/data), and microarray cancer cell lines expression profiles and sample table from barretina et al . we extracted two wt control nccit rna-seq expression profiles from grow et al . we received pdx expression estimates and sample .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / annotations from the authors of gao et al . we gathered gemm expression profiles from nine different studies – . we downloaded tumoroid expression profiles from the nci patient- derived models repository (pdmr) and from three individual studies – . to use ccn classifier on gemm data, the mouse genes from gemm expression profiles were converted into their human homologs. the query samples were classified using the final ccn classifier. each query classification profile was labelled as one of the four classification categories: “correct”, “mixed”, “none” and “other” based on classification profiles. if a sample has a ccn score higher than the decision threshold in the labelled cancer category, we assigned that as “correct”. if a sample has ccn score higher than the decision threshold in labelled cancer category and in other cancer categories, we assigned that as “mixed”. if a sample has no ccn score higher than the decision threshold in any cancer category or has the highest ccn score in ‘unknown’ category, then we assigned it as “none”. if a sample has ccn score higher than the decision threshold in a cancer category or categories not including the labelled cancer category, we assigned it as ”other”. we analyzed and visualized the results using r and r packages pheatmap and ggplot . cross-species assessment to assess the performance of cross-species classification, we downloaded labelled human tissue/cell type and labelled mouse tissue/cell type rna-seq expression profiles from github (https://github.com/pcahan /cellnet). we first converted the mouse genes into human homologous genes. then we found the intersecting genes between mouse tissue/cell expression profiles and human tissue/cell expression profiles. limiting the input of human tissue rna-seq profiles to the intersecting genes, we trained a ccn classifier with all the human tissue/cell expression profiles. the parameters used for the function “broadclass_train” in the package cancercellnet are in supp tab . we randomly sampled .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / samples from each tissue category in mouse tissue/cell data and applied the classifier on those samples to assess performance. cross-technology assessment to assess the performance of ccn in applications to microarray data, we gathered , patient tumor microarray profiles across different cancer types from more than different projects (supp tab ). we found the intersecting genes between the microarray profiles and tcga patient rna-seq profiles. limiting the input of rna-seq profiles to the intersecting genes, we created a ccn classifier with all the tcga patient profiles using parameters for the function “broadclass_train” listed in supp tab . after the microarray specific classifier was trained, we randomly sampled microarray patient samples from each cancer category and applied ccn classifier on them as assessment of the cross-technology performance in supp fig a. the same ccn classifier was used to assess microarray ccl samples supp fig b. training and validating scrna-seq classifier we extracted labelled human melanoma and glioblastoma scrna-seq expression profiles , , and compiled the two datasets excluding cell types t.cd , t.cd and myeloid due to low number of cells for training. cells from each of the cell types were sampled for training a scrna-seq classifier. the parameters for training a general scrna-seq classifier using the function “broadclass_train” are in supp tab . cells from each of the cell types from the held-out data were selected to assess the single cell classifier. using maximization of average macro f measure, we selected the decision threshold of . . the gene-pairs that were selected to construct the classifier are in supp tab . to assess the cross-technology capability of applying scrna-seq classifier to bulk rna-seq, we downloaded expression .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / profiles spanning purified cell types (b cells, endothelial cells, monocyte/macrophage, fibroblast) from https://github.com/pcahan /cellnet. training subtype cancercellnet we found cancer types (brca, coad, esca, hnsc, kirc, lgg, paad, ucec, stad, luad, lusc) which have meaningful subtypes based on either histology or molecular profile and have sufficient samples to train a subtype classifier with high aupr. we also included normal tissues samples from brca, coad, hnsc, kirc, ucec to create a normal tissue category in the construction of their subtype classifiers. training samples were either labelled as a cancer subtype for the cancer of interest or as “unknown” if they belong to other cancer types. similar to general classifier training, ccn performed gene pair transformation and selected the most discriminate gene pairs for each cancer subtype. in addition to the gene pairs selected to discriminate cancer subtypes, ccn also performed general classification of all training data and appended the classification profiles of training data with gene pair binary matrix as additional features. the reason behind using general classification profile as additional features is that many general cancer types may share similar subtypes, and general classification profile could be important features to discriminate the general cancer type of interest from other cancer types before performing finer subtype classification. the specific parameters used to train individual subtype classifiers using “subclass_train” function of cancercellnet package can be found in supp tab and the gene pairs are in supp tab . validating subtype cancercellnet similar to validating general class classifier, we randomly sampled / of all samples in each cancer subtype as training data and sampled an equal amount across subtypes in the / held-out data for assessing subtype classifiers. we repeated the process times for more comprehensive assessment of subtype classifiers. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / classifying query data into subtypes we assigned subtype to query sample if the query sample has ccn score higher than the decision threshold. the table of decision threshold for subtype classifiers are in supp tab . if no ccn scores exceed the decision threshold in any subtype or if the highest ccn score is in ‘unknown’ category, then we assigned that sample as ‘unknown’. analysis was performed in r and visualizations were generated with the complexheatmap package . cells culture, immunohistochemistry and histomorphometry caov- (atcc® htb- ™), sk-ov- (atcc® htb- ™), rt (atcc® htb- ™), and nccit(atcc® crl- ™) cell lines were purchased from atcc. hec- (c ) and a ( - vl) were obtained from addexbio technologies and sigma-aldrich. vcap and pc- . sk-ov- , vcap, and rt were cultured in dulbecco's modified eagle medium (dmem, high glucose, , gibco) with % penicillin-streptomycin-glutamine ( , life technologies); caov- , pc- , nccit, and a were cultured using rpmi- medium ( , gibco) while hec- was in iscove's modified dulbecco's medium (imdm, , gibco). both media were supplemented with % penicillin-streptomycin ( , gibco). all medium included % fetal bovine serum (fbs). cells cultured in -well plate were washed twice with pbs and fixed in % buffered formalin for hrs at °c. immunostaining was performed using a standard protocol. cells were incubated with primary antibodies to goat hoxb ( µg/ml, pa - , invitrogen), mouse wt ( µg/ml, ma - , invitrogen), rabbit pparg ( : , abn , millipore), mouse folh ( µg/ml, um , origene), and rabbit lin a ( : , # , cell signaling) in antibody diluent (s - , dako), at °c overnight followed with three min washes in tbst. the slides were then incubated with secondary antibodies conjugated with fluorescence at room temperature for h while avoiding light followed with three min washes in tbst and .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / nuclear stained with mounting medium containing dapi. images were captured by nikon eclipse ti-s, ds-u and ds-qi . histomorphometry was performed using imagej (version . . -rc- / . i). % n.positive cells was calculated by the percentage of the number of positive stained cells divided by the number of dapi-positive nucleus within three of randomly chosen areas. the data were expressed as means ± sd. tumor purity analysis we used the r package estimate to calculate the estimate scores from tcga tumor expression profiles that we used as training data for ccn classifier. to calculate tumor purity we used the equation described in yoshihara et al., : tumour purity = cos ( . + . × estimate score) extracting citation counts we used the r package rismed to extract the number of citations for each cell line through query search of “cell line name[text word] and cancer[text word]” on pubmed. the citation counts were normalized by dividing the citation counts with the number of years since first documented. 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 = 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 # 𝑦𝑒𝑎𝑟𝑠 𝑠𝑖𝑛𝑐𝑒 𝑓𝑖𝑟𝑠𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 grn construction and grn status grn construction was extended from our previous method . samples per cancer type were randomly sampled and normalized through down sampling as training data for the clr grn construction algorithm. cancer type specific grns were identified by determining the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / differentially expressed genes per each cancer type and extracting the subnetwork using those genes. to extend the original grn status algorithm across different platforms and species, we devised a rank-based grn status algorithm. like the original grn status, rank based grn status is a metric of assessing the similarity of cancer type specific grn between training data in the cancer type of interest and query samples. hence, high grn status represents high level of establishment or similarity of the cancer specific grn in the query sample compared to those of the training data. the expression profiles of training data and query data were transformed into rank expression profiles by replacing the expression values with the rank of the expression values within a sample (highest expressed gene would have the highest rank and lowest expressed genes would have a rank of ). cancer type specific mean and standard deviation of every gene’s rank expression were learned from training data. the modified z-score values for genes within cancer type specific grn were calculated for query sample’s rank expression profiles to quantify how dissimilar the expression values of genes in query sample’s cancer type specific grn compared to those of the reference training data: 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)xyz = [ , 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑢𝑝𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 , 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑑𝑜𝑤𝑛𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 𝑎𝑏𝑠(𝑍𝑠𝑐𝑜𝑟𝑒), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 if a gene in the cancer type specific grn is found to be upregulated in the specific cancer type relative to other cancer types, then we would consider query sample’s gene to be similar if the ranking of the query sample’s gene is equal to or greater than the mean ranking of the gene in training sample. as a result of similarity, we assign that gene of a z-score of . the same principle applies to cases where the gene is downregulated in cancer specific subnetwork. grn status for query sample is calculated as the weighted mean of the ( − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)xyz) across genes in cancer type specific grn. is an arbitrary .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / large number, and larger dissimilarity between query’s cancer type specific grn indicate high z-scores for the grn genes and low grn status. 𝑅𝐺𝑆 = e( − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)xyz)𝑤𝑒𝑖𝑔ℎ𝑡fghg i h ijk 𝐺𝑅𝑁 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑅𝐺𝑆 ∑ 𝑤𝑒𝑖𝑔ℎ𝑡fghg ihijk the weight of individual genes in the cancer specific network is determined by the importance of the gene in the random forest classifier. finally, the grn status gets normalized with respect to the grn status of the cancer type of interest and the cancer type with the lowest mean grn status. 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 = 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 mngop − 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 xih qrhqgo) 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) where “min cancer” represents the cancer type where its training data have the lowest mean grn status in the cancer type of interest, and 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 xih qrhqgo) represents the lowest average grn status in the cancer type of interest. 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) represents average grn status of the cancer type of interest in the training data. code availability cancercellnet code and documentation is available at github: https://github.com/pcahan /cancercellnet acknowledgements this work was supported by the national institutes of health nci ovarian cancer spore p ca via a development research program award to pc. fwh was supported by a prostate cancer foundation young investigator award, department of defense w xwh- - pcrp-hd (f.w.h.), the national institutes of health/national cancer institute p ca - .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (f.w.h.) u ca (f.w.h.). we would like to thank john powers, hao zhu, tian-li wang, charles eberhart, and kaloyan tsanov for comments on the manuscript and helpful discussions. some figures were created in part with biorender.com. figure legends fig. cancercellnet (ccn) workflow, training, and performance. (a) schematic of ccn usage. ccn was designed to assess and compare the expression profiles of cancer models such as ccls, pdxs, gemms, and tumoroids with native patient tumors. to use trained classifier, ccn inputs the query samples (e.g. expression profiles from ccls, pdxs, gemms, tumoroids) and generates a classification profile for the query samples. the column names of the classification heatmap represent sample annotation and the row names of the classification heatmap represent different cancer types. each grid is colored from black to yellow representing the lowest classification score (e.g. ) to highest classification score (e.g. ). (b) schematic of ccn training process. ccn uses patient tumor expression profiles of different cancer types from tcga as training data. first, ccn identifies n genes that are upregulated, n that are downregulated, and n that are relatively invariant in each tumor type versus all of the others. then, ccn performs a pair transform on these genes and subsequently selects the most discriminative set of m gene pairs for each cancer type as features (or predictors) for the random forest classifier. lastly, ccn trains a multi-class random forest classifier using gene- pair transformed training data. (c) parameter optimization strategy. cross-validations of each parameter set in which / of tcga data was used to train and / to validate was used search for the values of n and m that maximized performance of the classifier as measured by area under the precision recall curve (auprc). (d) mean and standard deviation of classifiers based on cross-validations with the optimal parameter set. (e) auprc of the final ccn classifier when applied to independent patient tumor data from icgc. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. evaluation of cancer cell lines. (a) general classification heatmap of ccls extracted from ccle. column annotations of the heatmap represent the labelled cancer category of the ccls given by ccle and the row names of the heatmap represent different cancer categories. ccls’ general classification profiles are categorized into categories: correct (red), correct mixed (pink), no classification (light green) and other classification (dark green) based on the decision threshold of . . (b) bar plot represents the proportion of each classification category in ccls across cancer types ordered from the cancer types with the highest proportion of correct and correct mixed ccls to lowest proportion. (c) comparison between skcm general ccn scores from bulk rna-seq classifier and skcm malignant ccn scores from scrna-seq classifier for skcm ccls. (d) comparison between sarc general ccn scores from bulk rna- seq classifier and caf ccn scores from scrna-seq classifier for skcm ccls. (e) comparison between gbm general ccn scores from bulk rna-seq classifier and gbm neoplastic ccn scores from scrna-seq classifier for gbm ccls. (f) comparison between sarc general ccn scores and caf ccn scores from scrna-seq classifier for gbm ccls. the green lines indicate the decision threshold for scrna-seq classifier and general classifier. fig. immunofluorescence of selected cell lines. (a) classification profiles (left) and if expression (middle) of caov- (ov positive control), hec- (ucec positive control) and sk- ov- for wt (ov biomarker) and hoxb (uterine biomarker). the bar plots quantify the average percentage of positive cells for wt (top-right) and hoxb (bottom-right). (b) classification profiles (left) and if expression (middle) of caov- , nccit (germ cell tumor positive control) and a for wt and lin a (germ cell tumor biomarker). classification of nccit were performed using rna-seq profiles of wt control nccit duplicate from grow et al . the bar plots quantify the average percentage of positive cells for wt (top-right) and lin a (bottom-right). (c) classification profiles (left) and if expression (middle) of vcap (prad positive control), rt (blca positive control) and pc- for folh (prostate biomarker) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / and pparg (urothelial biomarker). the bar plots quantify the average percentage of positive cells for folh (top-right) and pparg (bottom-right). fig. subtype classification of ccls and ccl prevalence. the heatmap visualizations represent subtype classification of (a) ucec ccls, (b) lusc ccls and (c) luad ccls. only samples with ccn scores > . in their nominal tumor type are displayed. (d) comparison of normalized citation counts and general ccn classification scores of ccls. labelled cell lines either have the highest ccn classification score in their labelled cancer category or highest normalized citation count. each citation count was normalized by number of years since first documented on pubmed. fig. evaluation of patient derived xenografts. (a) general classification heatmap of pdxs. column annotations represent annotated cancer type of the pdxs, and row names represent cancer categories. (b) proportion of classification categories in pdxs across cancer types is visualized in the bar plot and ordered from the cancer type with highest proportion of correct and mixed correct classified pdxs to the lowest. subtype classification heatmaps of (c) ucec pdxs, (d) lusc pdxs and (e) luad pdxs. only samples with ccn scores > . in their nominal tumor type are displayed. fig. evaluation of genetically engineered mouse models. (a) general classification heatmap of gemms. column annotations represent annotated cancer type of the gemms, and row names represent cancer categories. (b) proportion of classification categories in gemms across cancer types is visualized in the bar plot and ordered from the cancer type with highest proportion of correct and mixed correct classified gemms to the lowest. subtype classification heatmap of (c) ucec gemms, (d) lusc gemms and (e) luad gemms. only samples with ccn scores > . in their nominal tumor type are displayed. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. evaluation of tumoroid models. (a) general classification heatmap of tumoroids. column annotations represent annotated cancer type of the tumoroids, and row names represent cancer categories. (b) proportion of classification categories in tumoroids across cancer types is visualized in the bar plot and ordered from the cancer type with highest proportion of correct and mixed correct classified tumoroids to the lowest. subtype classification heatmap of (c) ucec tumoroids, (d) lusc tumoroids and (e) luad tumoroids. only samples with ccn scores > . in their nominal tumor type are displayed. fig. comparison of ccls, pdxs, and gemms. box-and-whiskers plot comparing general ccn scores across ccls, gemms, pdxs of five general tumor types (ucec, paad, lusc, luad, lihc). supplementary information supplementary figure assessment of ccn general classifier and subtype classifier. (a) mean auprc of repeated grid-search cross-validation for each parameter grid. (b) mean and range of ccn classifier’s pr curves from cross validations based on the optimal feature selection parameters n and m. (c) auprc of ccn human tissue classifier when applied to mouse tissue data. (d) the schematic of training a subtype classifier in ccn. ccn uses patient tumor expression profiles from cancer of interest as training data. ccn performs gene-pair transformation and selects the most discriminative gene pairs among the cancer subtypes from training data as features. ccn then applies the general classification on training data and uses the general classification profile as features in addition to gene pairs for training a random forest classifier. the weight of the general classification profiles as features can be tuned to improve auprc. (e) the mean and standard deviation of auprc for subtype classifiers based on iterations of random sampling of training and held-out data, training subtype .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / classifier using training data, classification of held-out data, and calculation of recall and precision. supplementary figure further validation of ccn and classification results. to validate the cross-platform classification performance of ccn, a new classifier specifically trained to classify microarray data was trained using rna-seq data from tcga as training data and intersecting genes between rna-seq data and microarray data. (a) auprc of ccn classifier when applied to tumor profiles assayed on microarrays. (b) classification heatmap of ccls using microarray expression data. (c) pearson correlation between ccn scores of ccle lines generated from rna-seq data and microarray data. (d) comparison between ccls’ ccn scores and the similarity metric from yu et al , median correlations of transcriptional profiles between ccls and tcga tumors from ccls’ labelled cancer category. (e) comparison of mean tumor purity of training data and mean ccn scores of ccls for each cancer category. supplementary figure single-cell classification of skcm and gbm cell lines. (a) auprc of the single-cell classifier when applied to scrna-seq held-out data. (b) auprc of the scrna- seq classifier when applied to purified bulk rna samples. (c) single-cell classification of skcm ccls. red bar-plot (top) represents general ccn scores in sarc and blue bar-plot (bottom) represents general ccn scores in skcm. (d) single-cell classification of gbm ccls. red bar- plot (top) represents general ccn scores in sarc and yellow bar-plot (bottom) represents general ccn scores in gbm. supplementary figure correlation between cancer type specific network grn status and general ccn scores. supplementary figure proportion of cancer subtypes in different cancer models and tcga tumor data across general cancer types. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary table general classification profiles of ccls. supplementary table subtype classification profiles of ccls. supplementary table general classification profiles of pdxs. supplementary table subtype classification profiles of pdxs. supplementary table general classification profiles of gemms supplementary table subtype classification profiles of gemms. supplementary table general classification profiles of tumoroids. supplementary table subtype classification profiles of tumoroids. supplementary table specific parameters used for training of all classifiers. supplementary table gene-pairs selected for final training of ccn general, subtype classifiers and single-cell classifier. supplementary table decision thresholds and the corresponding precision and recall for the general classifier and subtype classifier. supplementary table accessions of tumor microarray data used in validation. references . sharma, s. v., haber, d. a. & settleman, j. cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. nat. rev. cancer , – ( ). . kersten, k., de visser, k. e., van miltenburg, m. h. & jonkers, j. genetically engineered mouse models in oncology research and cancer medicine. embo mol. med. , – ( ). . hidalgo, m. et al. patient-derived xenograft models: an emerging platform for translational cancer research. cancer discov. , – ( ). . drost, j. & clevers, h. organoids in cancer research. nat. rev. cancer , – ( ). . klijn, c. et al. a comprehensive transcriptional portrait of human cancer cell lines. nat. biotechnol. , – ( ). . koren, s. et al. pik ca(h r) induces multipotency and multi-lineage mammary tumours. nature , – ( ). . derose, y. s. et al. tumor grafts derived from women with breast cancer authentically reflect tumor pathology, growth, metastasis and disease outcomes. nat. med. , – ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . sharpless, n. e. & depinho, r. a. the mighty mouse: genetically engineered mouse models in cancer drug development. nat. rev. drug discov. , – ( ). . mouradov, d. et al. colorectal cancer cell lines are representative models of the main molecular subtypes of primary cancer. cancer res. , – ( ). . stuckelberger, s. & drapkin, r. precious gemms: emergence of faithful models for ovarian cancer research. j. pathol. , – ( ). . domcke, s., sinha, r., levine, d. a., sander, c. & schultz, n. evaluating cell lines as tumour models by comparison of genomic profiles. nat. commun. , ( ). . jiang, g. et al. comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer. bmc genomics suppl , ( ). . chen, b., sirota, m., fan-minogue, h., hadley, d. & butte, a. j. relating hepatocellular carcinoma tumor samples and cell lines using gene expression data in translational research. bmc med. genomics suppl , s ( ). . vincent, k. m., findlay, s. d. & postovit, l. m. assessing breast cancer cell lines as tumour models by comparison of mrna expression profiles. breast cancer res. , ( ). . yu, k. et al. comprehensive transcriptomic analysis of cell lines as models of primary tumors across tumor types. nat. commun. , ( ). . najgebauer, h. et al. cellector: genomics-guided selection of cancer in vitro models. cell syst. , – .e ( ). . salvadores, m., fuster-tormo, f. & supek, f. matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns. sci. adv. , ( ). . guernet, a. & grumolato, l. crispr/cas editing of the genome for cancer modeling. methods - , – ( ). . gargiulo, g. next-generation in vivo modeling of human cancers. front. oncol. , ( ). . gao, h. et al. high-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response. nat. med. , – ( ). . cahan, p. et al. cellnet: network biology applied to stem cell engineering. cell , – ( ). . radley, a. h. et al. assessment of engineered cells using cellnet and rna-seq. nat. protoc. , – ( ). . tan, y. & cahan, p. singlecellnet: a computational tool to classify single cell rna-seq data across platforms and across species. cell syst. , – .e ( ). . cancer genome atlas network. comprehensive molecular characterization of human colon and rectal cancer. nature , – ( ). . zhang, j. et al. international cancer genome consortium data portal--a one-stop shop for cancer genomics data. database (oxford) , bar ( ). . cancer genome atlas network. comprehensive molecular portraits of human breast tumours. nature , – ( ). . parker, j. s. et al. supervised risk predictor of breast cancer based on intrinsic subtypes. j. clin. oncol. , – ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . wilkerson, m. d. et al. lung squamous cell carcinoma mrna expression subtypes are reproducible, clinically important, and correspond to normal cell types. clin. cancer res. , – ( ). . cancer genome atlas research network. electronic address: andrew_aguirre@dfci.harvard.edu & cancer genome atlas research network. integrated genomic characterization of pancreatic ductal adenocarcinoma. cancer cell , – .e ( ). . cancer genome atlas research network et al. integrated genomic characterization of endometrial carcinoma. nature , – ( ). . cancer genome atlas research network et al. integrated genomic characterization of oesophageal carcinoma. nature , – ( ). . cancer genome atlas network. comprehensive genomic characterization of head and neck squamous cell carcinomas. nature , – ( ). . cancer genome atlas research network. comprehensive molecular characterization of clear cell renal cell carcinoma. nature , – ( ). . verhaak, r. g. w. et al. integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh , egfr, and nf . cancer cell , – ( ). . cancer genome atlas research network. comprehensive molecular profiling of lung adenocarcinoma. nature , – ( ). . hu, b. et al. gastric cancer: classification, histology and application of molecular pathology. j. gastrointest. oncol. , – ( ). . barretina, j. et al. the cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. nature , – ( ). . medico, e. et al. the molecular landscape of colorectal cancer cell lines unveils clinically actionable kinase targets. nat. commun. , ( ). . park, j.-g. et al. characteristics of cell lines established from human colorectal carcinoma. cancer res. ( ). . jerby-arnon, l. et al. a cancer cell program promotes t cell exclusion and resistance to checkpoint blockade. cell , – .e ( ). . darmanis, s. et al. single-cell rna-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. cell rep. , – ( ). . patel, a. p. et al. single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. science , – ( ). . xu, b. et al. regulation of endometrial receptivity by the highly expressed hoxa , hoxa and hoxd hox-class homeobox genes. hum. reprod. , – ( ). . raines, a. m. et al. recombineering-based dissection of flanking and paralogous hox gene functions in mouse reproductive tracts. development , – ( ). . netinatsunthorn, w., hanprasertpong, j., dechsukhum, c., leetanaporn, r. & geater, a. wt gene expression as a prognostic marker in advanced serous epithelial ovarian carcinoma: an immunohistochemical study. bmc cancer , ( ). . kelly, z. et al. the prognostic significance of specific hox gene expression patterns in ovarian cancer. int. j. cancer , – ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . cancer genome atlas research network. integrated genomic analyses of ovarian carcinoma. nature , – ( ). . wiegand, k. c. et al. arid a mutations in endometriosis-associated ovarian carcinomas. n. engl. j. med. , – ( ). . murray, m. j. et al. lin expression in malignant germ cell tumors downregulates let- and increases oncogene levels. cancer res. , – ( ). . biton, a. et al. independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes. cell rep. , – ( ). . fair, w. r., israeli, r. s. & heston, w. d. prostate-specific membrane antigen. prostate , – ( ). . black, j. d., english, d. p., roque, d. m. & santin, a. d. targeted therapy in uterine serous carcinoma: an aggressive variant of endometrial cancer. womens health (lond. engl.) , – ( ). . yang, s., thiel, k. w. & leslie, k. k. progesterone: the ultimate endometrial tumor suppressor. trends endocrinol. metab. , – ( ). . huszar, m. et al. up-regulation of l cam is linked to loss of hormone receptors and e-cadherin in aggressive subtypes of endometrial carcinomas. j. pathol. , – ( ). . kozak, j., wdowiak, p., maciejewski, r. & torres, a. a guide for endometrial cancer cell lines functional assays using the measurements of electronic impedance. cytotechnology , – ( ). . korch, c. et al. dna profiling analysis of endometrial and ovarian cell lines reveals misidentification, redundancy and contamination. gynecol. oncol. , – ( ). . wu, d. et al. gene-expression data integration to squamous cell lung cancer subtypes reveals drug sensitivity. br. j. cancer , – ( ). . walter, v. et al. molecular subtypes in head and neck cancer exhibit distinct patterns of chromosomal gain and loss of canonical cancer genes. plos one , e ( ). . adeegbe, d. o. et al. bet bromodomain inhibition cooperates with pd- blockade to facilitate antitumor response in kras-mutant non-small cell lung cancer. cancer immunol res , – ( ). . blaisdell, a. et al. neutrophils oppose uterine epithelial carcinogenesis via debridement of hypoxic tumor cells. cancer cell , – ( ). . fitamant, j. et al. yap inhibition restores hepatocyte differentiation in advanced hcc, leading to tumor regression. cell rep. , – ( ). . jia, d. et al. crebbp loss drives small cell lung cancer and increases sensitivity to hdac inhibition. cancer discov. , – ( ). . kress, t. r. et al. identification of myc-dependent transcriptional programs in oncogene-addicted liver tumors. cancer res. , – ( ). . li, l. et al. gkap acts as a genetic modulator of nmdar signaling to govern invasive tumor growth. cancer cell , – .e ( ). . mollaoglu, g. et al. the lineage-defining transcription factors sox and nkx - determine lung cancer cell fate and shape the tumor immune microenvironment. immunity , – .e ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . pan, y. et al. whole tumor rna-sequencing and deconvolution reveal a clinically- prognostic pten/pi k-regulated glioma transcriptional signature. oncotarget , – ( ). . lissanu deribe, y. et al. mutations in the swi/snf complex induce a targetable dependence on oxidative phosphorylation in lung cancer. nat. med. , – ( ). . xu, c. et al. loss of lkb and pten leads to lung squamous cell carcinoma with elevated pd-l expression. cancer cell , – ( ). . nci-frederick, frederick, md. national laboratory for cancer research. the nci patient-derived models repository (pdmr). ( ). at . broutier, l. et al. human primary liver cancer-derived organoid cultures for disease modeling and drug screening. nat. med. , – ( ). . lee, s. h. et al. tumor evolution and drug response in patient-derived organoid models of bladder cancer. cell , – .e ( ). . ogawa, j., pao, g. m., shokhirev, m. n. & verma, i. m. glioblastoma model using human cerebral organoids. cell rep. , – ( ). . ben-david, u. et al. patient-derived xenografts undergo mouse-specific tumor evolution. nat. genet. , – ( ). . stratton, m. r., campbell, p. j. & futreal, p. a. the cancer genome. nature , – ( ). . balkwill, f. r., capasso, m. & hagemann, t. the tumor microenvironment at a glance. j. cell sci. , – ( ). . lancaster, m. a. & knoblich, j. a. organogenesis in a dish: modeling development and disease using organoid technologies. science , ( ). . bregenzer, m. e. et al. integrated cancer tissue engineering models for precision medicine. plos one , e ( ). . wang, d. h. & souza, r. f. biology of barrett’s esophagus and esophageal adenocarcinoma. gastrointest endosc clin n am , – ( ). . lee, j. et al. tumor stem cells derived from glioblastomas cultured in bfgf and egf more closely mirror the phenotype and genotype of primary tumors than do serum-cultured cell lines. cancer cell , – ( ). . wenger, s. l. et al. comparison of established cell lines at different passages by karyotype and comparative genomic hybridization. biosci. rep. , – ( ). . ben-david, u. et al. genetic and transcriptional evolution alters cancer cell line drug response. nature , – ( ). . cooke, s. l. et al. genomic analysis of genetic heterogeneity and evolution in high- grade serous ovarian carcinoma. oncogene , – ( ). . hristova, v. a. & chan, d. w. cancer biomarker discovery and translation: proteomics and beyond. expert rev proteomics , – ( ). . dawson, m. a. & kouzarides, t. cancer epigenetics: from mechanism to therapy. cell , – ( ). . silva, t. c. et al. tcga workflow: analyze cancer genomics and epigenomics data using bioconductor packages. [version ; peer review: approved, approved with reservations]. f res. , ( ). . morgan, m., obenchain, v., hester, j. & pag`es, h. summarizedexperiment: summarizedexperiment container. ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . pavlidis, p. & noble, w. s. analysis of strain and regional variation in gene expression in mouse brain. genome biol. , research ( ). . geman, d., d avignon, c., naiman, d. q. & winslow, r. l. classifying gene expression profiles from pairwise mrna comparisons. stat appl genet mol biol , article ( ). . krstajic, d., buturovic, l. j., leahy, d. e. & thomas, s. cross-validation pitfalls when selecting and assessing regression and classification models. j. cheminform. , ( ). . lipton, z. c., elkan, c. & naryanaswamy, b. optimal thresholding of classifiers to maximize f measure. mach. learn. knowl. discov. databases , – ( ). . grow, e. j. et al. intrinsic retroviral reactivation in human preimplantation embryos and pluripotent cells. nature , – ( ). . kolde, r. pheatmap: pretty heatmaps. (cran, ). . wickham, h. ggplot - elegant graphics for data analysis . (springer-verlag new york, ). doi: . / - - - - . gu, z., eils, r. & schlesner, m. complex heatmaps reveal patterns and correlations in multidimensional genomic data. bioinformatics , – ( ). . yoshihara, k. et al. inferring tumour purity and stromal and immune cell admixture from expression data. nat. commun. , ( ). . kovalchik, s. rismed: download content from ncbi databases. (cran.r-project, ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b figure highlow c an ce r t yp es cancer models classification score cancer cell lines (ccl) patient derived xenograft (pdx) genetically engineered mouse model (gemm) tumoroids select parameter set with maximum mean auprc. train on all tcga data cancercellnet set parameters n, m randomly select / tcga data; run training process assess performance on / held out data repeat steps ( - ) times ( ) ( ) ( ) ( ) repeat steps ( - ) for each parameter set ( ) cancercellnet rna-seq from … g en e pa irs training data training process train random forest classifier g en es samples g en es labeled rna-seq data select n genes gene pair transform select m gene pairs g en e pa irs g en es samples samples samples samples samples cancercellnet c d e .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure a f c d e ccn score b .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ccn score a b c figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / d a b figure c general classification general ccn score (ucec) sub-type classification endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification prox.-inflam prox.-prolif tru unknown .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ccn score figure a b c d e general classification general ccn score (ucec) sub-type classification endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification prox.-inflam prox.-prolif tru unknown .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure c ba d e general classification general ccn score (ucec) sub-type classification genotype endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification genotype basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification genotype prox.-inflam prox.-prolif tru unknown ccn score .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure a b c d e general classification general ccn score (ucec) sub-type classification endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification prox.-inflam prox.-prolif tru unknown ccn score .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure ba d e training data samples g en es rna-seq tcga training process gene pair transform feature selection train random forest classifier g en es g en e p ai rs cancercellnetbroad class classification add on to gene pairs as additional features c c n s co re s g en e p ai rs c .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure a b d e c .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure c d a b .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / biorxiv.org - the preprint server for biology skip to main content home about submit alerts / rss search for this keyword advanced search subject areas all articles animal behavior and cognition biochemistry bioengineering bioinformatics biophysics cancer biology cell biology clinical trials developmental biology ecology epidemiology evolutionary biology genetics genomics immunology microbiology molecular biology neuroscience paleontology pathology pharmacology and toxicology physiology plant biology scientific communication and education synthetic biology systems biology zoology view by month a global cancer data integrator reveals principles of synthetic lethality, sex disparity and immunotherapy. christopher yogodzinski , ,#*, abolfazl arab - , justin r. pritchard , hani goodarzi - , luke a. gilbert , , * department of urology, university of california, san francisco, san francisco, ca, usa helen diller family comprehensive cancer center, san francisco, san francisco, ca, usa department of biochemistry and biophysics, university of california, san francisco, ca, usa department of biomedical engineering, pennsylvania state university, university park, pa department of cellular & molecular pharmacology, university of california, san francisco, ca, usa # current address: university of north carolina chapel hill school of medicine, chapel hill, nc, usa *corresponding authors correspondence: cyogodzi@unc.edu (c.y.), luke.gilbert@ucsf.edu (l.a.g) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract advances in cancer biology are increasingly dependent on integration of heterogeneous datasets. large scale efforts have systematically mapped many aspects of cancer cell biology; however, it remains challenging for individual scientists to effectively integrate and understand this data. we have developed a new data retrieval and indexing framework that allows us to integrate publicly available data from different sources and to combine publicly available data with new or bespoke datasets. beyond a database search, our approach empowered testable hypotheses of new synthetic lethal gene pairs, genes associated with sex disparity, and immunotherapy targets in cancer. our approach is straightforward to implement, well documented and is continuously updated which should enable individual users to take full advantage of efforts to map cancer cell biology. introduction large scale but often independent efforts have mapped phenotypic characteristics of more than one thousand human cancer cell lines. despite this, static lists of univariate data generally cannot identify the underlying molecular mechanisms driving a complex phenotype. we hypothesized that a global cancer data integrator that could incorporate many types of publicly available data including functional genomics, whole genome sequencing, exome sequencing, rna expression data, protein mass spectrometry, dna methylation profiling, chip- seq, atac-seq, and metabolomics data would enable us to link disease features to gene products – . we set out to build a resource that enables cross platform correlation analysis of multi-omic data as this analysis is in and of itself is a high-resolution phenotype. multi-omic analysis of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . functional genomics data with genomic, metabolomic or transcriptomic profiling can link cell state or specific signaling pathways to gene function , , , – . lastly, co-essentiality profiling across large panels of cell lines has revealed protein complexes and co-essential modules that can assign function to uncharacterized genes . problematically, in many cases publicly available data are poorly integrated when considering information on all genes across different types of data and the existing data portals are inflexible. for example, lists of genes cannot be queried against groups of cell lines stratified by mutation status or disease subtype. furthermore, one cannot integrate new data derived from individual labs or other consortia. we created the cancer data integrator (candi) which is a series of python modules designed to seamlessly integrate genomic, functional genomic, rna, protein and metabolomic data into one ecosystem. our python framework operates like a relational database without the overhead of running mysql or postgres and enables individual users to easily query this vast dataset and add new data in flexible ways. this was achieved by unifying the indices of these datasets via index tables that are automatically accessed through candi’s biologically relevant python classes. we highlight the utility of candi through four types of analysis to demonstrate how complex queries can reveal previously unknown molecular mechanisms in synthetic lethality, sex disparity and immunotherapy. these data nominate new small molecule and immunotherapy anti-cancer strategies in kras-mutant colon, lung and pancreatic cancers. results candi is a global cancer data integrator. we set out to integrate three types of data by creating programmatic and biologically relevant abstractions that allow for flexible cross referencing across all datasets. data from the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cancer cell line encyclopedia (ccle) for rna expression, dna mutation, dna copy number and chromosome fusions across more than cancer cells lines was integrated into our database with the functional genomics data from the cancer dependency map (depmap) (fig. a,b and supplementary fig. ) , , . we also integrated protein-protein interaction data from the corum database along with three additional distinct protein localization databases , , , . candi by default will access the most recent release of data from depmap although users can also specify both the release and data type that is accessed. the key advantage to this approach is that candi enables one to easily input user defined queries with multi-tiered conditional logic into this large integrated dataset to analyze gene function, gene expression, protein localization and protein-protein interactions. candi identifies genes that are conditionally essential in brca-mutant ovarian cancer. the concept that loss-of-function tumor suppressor gene mutations can render cancer cells critically reliant on the function of a second gene is known as synthetic lethality. despite the promise of synthetic lethality, it has been challenging to predict or identify genes that are synthetic lethal with commonly mutated tumor suppressor genes. while there are many underlying reasons for this challenge, we reasoned that data integration through candi could identify synthetic lethal interactions missed by others. a paradigmatic example of synthetic lethality emerged from the study of dna damage repair (ddr) . somatic mutations in the dna double-strand break (dsb) repair genes, brca / , create an increased dependence on dna single strand break (ssb) repair. this dependence can be exploited through small molecule inhibition of parp mediated ssb repair. inhibition of parp provides significant clinical responses in advanced breast and ovarian cancer (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patients but they ultimately progress . thus, new synthetic lethal associations with brca / are a potential path towards therapeutic development parp refractory patients. to illustrate the flexibility of candi to mine context specific synthetic sick lethal (ssl) genetic relationships we hypothesized that the genes that modulate response to a parp inhibitor might be enriched for selectively essential proliferation or survival of brca / -mutant cancer cells. to test this hypothesis, we integrated the results of an existing crispr screen that identified genes that modulate response to the parp inhibitor olaparib . we then tested whether any of these genes are differentially essential for cell proliferation or survival in ovarian cancer and in breast cancer cell models that are either brca / proficient or deficient (fig. c,d). this query revealed that the fanconi anemia pathway is selectively essential in brca / -mutated ovarian cancer models but not in brca / -wild type ovarian cancer, brca / -mutated breast cancer or brca / -wildtype breast cancer models (fig. e and supplementary table ). to our knowledge a ssl phenotype between fancm and brca / has never been reported although a recent paper nominated a role for fancm and brca in telomere maintenance . importantly, fancm is a helicase/translocase and thus considered to be a druggable target for cancer therapy . clinical genomics data support this ssl hypothesis although this remains to be tested in ovarian cancer patient samples . because the depmap currently only allows single genes to be queried and does not enable users to easily stratify cell lines by mutation such analysis would normally take a user several days to complete manually. our approach enabled this analysis to be completed using a desktop computer in less than two hours, which includes the visualization of data presented here (fig. e). figure . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . (a) a schematic showing human cell models integrated by candi. (b) a schematic illustrating types of data integrated by candi. (c) a cartoon of a genome-scale crispri screen to identify genes that modulate response to parp inhibition by olaparib. (d) a schematic depicting data feature inputs parsed by candi. (e) essentiality of fanconi anemia genes in ovarian and breast cancer cell lines separated by brca mutation status. a bayes factor score of gene essentiality is displayed by a heat map. n= brca / -mutant ovarian cancer, n= brca-wildtype ovarian cancer, n= brca / -mutant breast cancer, n= brca / -wildtype breast cancer. conditional genetic essentiality in kras- and egfr- mutant nsclc cells. beyond tsgs, many common driver oncogenes such as krasg d are currently undruggable, which motivates the search for oncogene specific conditional genetic dependencies. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we reasoned that candi enables us to rapidly search functional genomics data for genes that are conditionally essential in lung cancer cells driven by kras- and egfr-mutations. we stratified non-small cell lung cancer cell (nsclc) models by egfr and kras mutations and then looked at the average gene essentiality for all genes within each of these subtypes of nsclc. we observed that kras is conditionally self-essential in kras-mutant cell models but that no other genes are conditionally essential in kras-mutant, egfr-mutant, kras-wildtype or egfr-wildtype cell models (fig. a,b and supplementary table ). this finding demonstrates that very few---if any--- genes are synthetic lethal with kras- or egfr- in kras- and egfr- mutant lung cancer cell lines. it may be that these experiments are underpowered or it may be that when the genetic dependencies of diverse cell lines representing a disease subtype are averaged across a single variable (e.g. a kras-mutation) very few common synthetic lethal phenotypes are observed . candi provides potential solutions for both of these hypotheses. candi enables a global analysis of conditional essentiality in cancer. it is thought that data aggregation across vast landscapes of unknown co-variates does not necessarily increase the statistical power to identify rare associations . thus, the global analyses of aggregated cancer data sometimes lies in systematically sub setting data based on key co- variates post aggregation. this has been observed in driver gene identification . inspired by our analysis of tsg and oncogene conditionally essentiality above, we next used candi to identify genes that are conditionally essential in the context of several hundred cancer driver mutations. we first grouped driver mutations (e.g. nonsense or missense) for each driver gene. for this analysis, we selected several thousand genes that are in the - th percentile of essentiality within the depmap data and therefore conditionally essential, meaning these genes are required (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . for cell growth or survival in a subset of cell lines. importantly, it is not known why these several thousand genes are conditionally essential. we then tested whether each of these conditionally essential genes has a significant association with individual driver mutations. our analytic approach does not weight the number of cell models representing each driver mutation nor does this give information on phenotype effect sizes. our analysis nominates a large number of conditionally dependent genetic relationships with both tsg and oncogenes (fig. c,d and supplementary table ). a number of the conditional genetic dependencies identified in our independent variable analysis above are represented by a limited number of cell models and so further investigation is needed to validate these conditional dependencies, but this data further suggests that averaging genetic dependencies across diverse cell lines with un-modeled covariates obscures conditional ssl relationships. to further investigate this hypothesis, we analyzed these same conditional genetic relationships with a second analytic approach that weights the number of cell models representing each driver mutation. we observed a limited number of conditional genetic dependencies that largely consists of oncogene self-essential dependencies as previously highlighted for kras-mutant cell lines (fig. e-g and supplementary table ) , . thus, analysis that averages each conditional phenotype across diverse panels of cell lines with unknown covariates masks interesting conditional genetic dependencies. figure . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . (a) average gene essentiality for kras and egfr in groups of nsclc cell lines stratified by kras mutation status or by both kras and egfr mutation status. n= for kras-wildtype shown in blue n= for kras-mutant shown in blue. n= for kras- wildtype egfr-wildtype shown in grey and n= for kras-mutant egfr-wildtype shown in grey. gene essentiality is an averaged bayes factor score for each group of cell lines. (b) average gene essentiality for kras and egfr in groups of nsclc cell lines stratified by egfr mutation status or by both egfr and kras mutation status. n= for egfr-wildtype shown in blue, n= for egfr-mutant shown in blue. n= for egfr-wildtype kras- wildtype shown in grey and n= for egfr-mutant kras-wildtype shown in grey. gene essentiality is an averaged bayes factor score for each group of cell lines. (c) p-values from chi tests of gene essentiality and nonsense mutations. (d) p-values from chi tests of gene essentiality and missense mutations. (e) a scatter plot showing effect size of the change in gene essentiality with select missense mutations and the -log (p-value) of each essentiality/mutation pair. (f) a scatter plot showing effect size of the change in gene essentiality with select nonsense mutations and the -log (p-value) of each essentiality/mutation pair. (g) a scatter plot showing effect size of the change in gene essentiality with all mutations and the -log (p-value) of each essentiality/mutation pair. candi reveals female and male context specific essential genes in colon, lung and pancreatic cancer. cancer functional genomics data is often analyzed without consideration for fundamental biological properties such as the sex of the tumor from which each cell line is derived. it is well established that biological sex influences cancer predisposition, cancer progression and response (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to therapy . we hypothesized that individual genes may be differentially essential across male and female cell lines. this hypothesis to our knowledge has never been tested in an unbiased large-scale manner. to maximize our statistical power to identify such differences we chose to test this hypothesis in a disease setting with large number of relatively homogenous cell lines and fewer unknown covariates. using candi, we stratified all kras-mutant nsclc, pancreatic adenocarcinoma (pdac), and colorectal cancer (crc) by sex and then tested for conditional gene essentiality. this analysis identified a number of genes that are differentially essential in male or female kras-mutant nsclc, pdac and crc models (fig. a-f and supplementary table ). the genes that we identify are not common across all three disease types suggesting as one might expect that the biology of the tumor in part also determines gene essentiality. to test whether any association between differentially essential genes could be identified from expression data (e.g essential genes encoded on the y chromosome) we first used candi to identify genes that are differentially expressed between male and female cell lines within each disease . we then plotted the set of differentially essential genes against the differentially expressed genes in kras-mutant nsclc, pdac and crc models (fig. a,c,e and supplementary table ) and found little overlap between these gene lists. a number of genes that are more essential in male cells, such as ahcyl , eno , gpi and pkm, regulate cellular metabolism. this finding is consistent with previous literature on sex and metabolism . our analysis demonstrates that stratifying groups of heterogeneous cancer models by three variables, in this case tumor type, kras mutation status and sex, reveals differentially essential genes. candi enables biologically principled stratification of data in the ccle and depmap by any feature associated with a group of cell models. this stratification allows us to identify genes associated with sex, which is not possible with other covariates included. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . (a) differential gene expression and differential gene essentiality in male and female crc cell lines. n= male cell lines and n= female cell lines. (b) the distribution of bayes factor gene essentiality scores in male and female crc cell lines. the top seven and bottom (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . three differentially essential genes are shown in violin plots split by the sex of the cell lines. (c) differential gene expression and differential gene essentiality in male and female nsclc cell lines. n= male cell lines and n= female cell lines. (d) the distribution of bayes factor gene essentiality scores in male and female nsclc cell lines. the top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. (e) differential gene expression and differential gene essentiality in male and female pdac cancer cell lines. n= male cell lines and n= female cell lines. (f) the distribution of bayes factor gene essentiality scores in male and female pdac cell lines. the top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. candi enables rapid integration of external datasets to reveal new immunotherapy targets. an emerging challenge in the cancer biology is how to robustly integrate larger “resource” datasets like ccle with the vast amount of published data from individual laboratories. for example, a big challenge in antibody discovery is identifying specific surface markers on cancer cells. to approach these big questions we utilized candis ability to rapidly take new datasets, such as raw rna-seq counts data in a disparate study of interest, then normalize and integrate this data into the ccle, depmap and protein localization databases previously described. specifically, we rapidly integrated an rna-seq expression dataset that measured the set of transcribed genes in primary lung bronchial epithelial cells from donors . classes within candi enable rapid application of deseq to assess the differential expression between outside datasets and the ccle. we used this feature to identify genes that are differentially expressed between primary lung bronchial epithelial cells and kras-mutant nsclc, egfr-mutant nsclc or all nsclc models in ccle. we then used candi to identify (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . genes that are upregulated in cancer cells over normal lung bronchial epithelial cells with protein products that are localized to the cell membrane. this analysis of kras-mutant, egfr-mutant and pan-nsclc generated highly similar lists of differentially expressed surface proteins (fig. a-f and supplementary table ). notably, overexpression of several of these genes, such as cd and cd , has been observed in lung cancer and is associated with poor prognosis – . these proteins represent potential new immunotherapy targets in kras-driven nsclc. figure . figure . (a) a graph showing genes that are upregulated in kras-mutant nsclc cell lines relative to primary human bronchial epithelial cells. a cell membrane protein localization score (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . is shown for each gene. higher protein localization scores indicate higher confidence annotations. (b) a scatter plot showing gene expression for genes that encode cell surface proteins in kras-mutant nsclc cell lines and primary human bronchial epithelial cells. n= for kras-mutant nsclc cell lines and n= for primary human bronchial epithelial cells. (c) a graph showing genes that are upregulated in egfr-mutant nsclc cell lines relative to primary human bronchial epithelial cells. a cell membrane protein localization score is shown for each gene. higher protein localization scores indicate higher confidence annotations. (d) a scatter plot showing gene expression for genes that encode cell surface proteins in egfr-mutant nsclc cell lines and primary human bronchial epithelial cells. n= for egfr-mutant nsclc cell lines and n= for primary human bronchial epithelial cells. (e) a graph showing genes that are upregulated in nsclc cell lines relative to primary human bronchial epithelial cells. a cell membrane protein localization score is shown for each gene. higher protein localization scores indicate higher confidence annotations. (f) a scatter plot showing gene expression for genes that encode cell surface proteins in nsclc cell lines and primary human bronchial epithelial cells. n= for nsclc cell lines and n= for primary human bronchial epithelial cells. discussion data integration is a critical requirement in biology research in the era of genomics and functional genomics. large scale efforts such as the ccle have revealed genomic features of more than cell line models. this data has not to our knowledge previously been integrated with functional genomics data in a manner that individual users can enter batched queries that are stratified by disease subtype or mutation status. this is not just a small improvement in functionality, but rather it is an enabling format that makes possible the types of conditional (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . genomics analyses that drive discovery. moreover, it fills a fundamental gap in the cancer research community that integrates large scale projects with investigator initiated studies our data framework enables biologists without specialized expertise in bioinformatics to use the full spectrum of data in the ccle and depmap in a higher throughput and precise manner. using candi, we identified genes that are selectively essential in male versus female kras-mutant nsclc, pdac and crc models. to our knowledge, such analysis has never been performed to begin to query the biologic basis of sex disparity in cancer or cancer therapy. we illustrate another feature of our framework by analyzing a list of hit genes nominated by a bespoke crispr drug screen for gene essentiality in brca / -wild type and brca / - mutated breast and ovarian cancer. in a third application, we analyzed the principle of synthetic lethality for genes in kras-mutant and egfr-mutant nsclc models. we then used candi to globally identify genes that are conditionally essential in the context of common cancer driver mutations. finally, we nominated potential new immunotherapy targets in kras-mutant, egfr-mutant and pan -nsclc models by using candi to identify genes that are differentially expressed in normal bronchial epithelial cells versus nsclc models that are localized at the plasma membrane. our data reveal a wealth of new hypotheses that can be rapidly generated from publicly available cancer data. by sharing data flows and use cases with a candi community we illustrate the ways in which individual research groups can interact with massive cancer genomics projects without reinventing tools or relying upon depmap tool releases. we anticipate that candi will be widely used in cell biology, immunology and cancer research. methods (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . candi the candi data integrator is available at https://github.com/yogiski/candi. candi module structure the candi data integrator is a python library built on top of the pandas that is specialized in integrating the publicly available data from the cancer dependency map (depmap release: quarter ) , the cancer cell line encyclopedia (ccle release: quarter ) , the pooled in-vitro crispr knockout essentiality screens database (pickles library: avana quarter ) , the comprehensive resource of mammalian protein complexes (corum) and protein localization data from the cell atlas , the map of the cell , and the in silico surfaceome , . data from depmap and ccle used in the following analyses are from the q release. data from pickles is from the quarter release of depmap using the avana library. access to all datasets is controlled via a python class called data. upon import the data class reads the config file established during installation and defines unique paths to each dataset and automatically loads the cell line index table and the gene index table. installation of candi, configuration, and data retrieval is handled by a manager class that is accessed indirectly through installation scripts and the data class. interactions with this data are controlled through a parent entity class and several handlers. the biologically relevant abstraction classes (gene, cellline cancer, organelle, genecluster, celllinecluster) inherit their methods from entity. entity methods are wrappers for hidden data handler classes who perform specific transformations, such as data indexing and high throughput filtering. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . differential expression in all cases where it is mentioned differential expression was evaluated using the deseq r package (release . ) . significance was considered to be an adjusted p-value of less than . . differential essentiality essentiality scores are taken from the pickles database (avana q ). to reduce the number of hypotheses posed during this analysis the mutual information of gene essentiality was calculated using the mutual information metric from the python package scikitlearn (version . . ). genes with mutual information scores greater than one standard devation above the median were removed from consideration. differential essentiality was evaluated by performing a mann-whitney u-test between two groups on every gene that passed the mutual information filter. significance was considered to be a p-value of less than . . magnitude of differential essentiality of a given gene was shown as the difference in mean bayes factors between two groups of cell lines. protein localization confidence protein localization data was assembled from the cell atlas , the map of the cell , and the in silico surfaceome , . confidence annotations were taken from the supplemental data of each paper and put on a number scale from to and summed for a total confidence score for each localization annotation for every gene where across all three papers. the analysis shown in figure represents a gene list that was further manually curated to remove the genes that are (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . localized to the intracellular space at the cell membrane revealing cell surface protein targets that are highly expressed in nsclc cancer models over normal lung bronchial epithelial cells , , , . depmap creative commons license when an individual user runs candi they are downloading depmap data and thus are agreeing to a cc attribution . license (https://creativecommons.org/licenses/by/ . /). synthetic lethality of fanconi anemia genes in ovarian and breast cancer models we made a list of the top gene hits that confer sensitivity to parp inhibition in hela cells . using candi the essentiality scores of these top hits were visualized across all ovarian cancer cell models in pickles (avana q ). fanca and fance showed selective essentiality in the brca / mutant ovarian cancer cell lines. following this observation candi was used to gather the gene essentiality for all fanc genes in the fanconi anemia pathway. candi was then used to visualize these data across all ovarian and breast cancer cell lines, sorting by brca / mutation status. synthetic lethality in kras and egfr mutant cell lines candi was leveraged to bin nsclc cell lines present in both ccle (release: q ) and pickles (avana q ) into groups. kras mutant and kras wild type cell lines with and without egfr mutants removed as well as egfr mutant and egfr wild type cell lines with and without kras mutants removed. the mean essentiality score for every gene in the genome was calculated for every group of cell lines. synthetic lethality score per gene is defined as the change in mean essentiality from the mutant groups to the wild type groups. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pan cancer synthetic lethality analysis a set of core oncogenes and tumor suppressor driver mutations was chosen for analysis . to test the effect of these gene’s mutations on gene essentiality candi was leveraged to split into two groups: a nonsense mutation group containing genes annotated as tumor suppressors (n= ) and a missense mutation group containing genes annotated as oncogenes with specific driver protein changes (n= ). candi was then used to collect a core set of genes with highly variable essentiality. to do this the bayes factors from the pickles database (avana q ) were converted to binary numeric variables. bayes factors over were assigned a =essential and bayes factors under were assigned a =non-essential. genes were then sorted buy their variance across cell lines and genes between the th and th percentile were used for this analysis (n= ). to determine a short list of genes with which to follow up on chi tests were applied to the gene pairs in the missense group and the gene pairs in the tumor suppressor group. three new groups were formed for further analysis: the first consisted of the significant gene/mutation pairs from the oncogenic group, the second consisted of the significant gene/mutation pairs from the tumor suppressor group, and the third was a combination of the significant pairs from both groups with no discrimination on the type of mutations considered. these groups were further analyzed for differential essentiality via the mann whitney method described above and the cohens d effect size were calculated to measure the extent of the phenotype. differential expression and essentiality of male and female kras driven cancers (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we used candi to gather all cell lines that are present in both pickles (avana q ) and ccle (release q ). candi was then leveraged to put these cell lines into the following tissue groups: kras mutant colon/colorectal, pdac, and nsclc. each tissue group was then split into male and female sub-groups. differential expression was analyzed by applying the methods described above to raw rna-seq counts data from ccle (release: q ). genes with adjusted p-values less than . were considered significantly differentially expressed. differential essentiality was analyzed using the methods described above on the previously described sex-subgroups for each tissue type. genes with p-values less than . were considered significantly differentially essential between male and female cell models. for each tissue type the distributions of the top significantly differentially essential genes were highlighted in comparison with the bottom as a negative control. differential expression of benign and malignant cancer cell lines we downloaded human bronchial epithelial (hbe) rna-seq data from gillen et al via the european nucleotide archive to use as a benign lung tissue model . this data set contains gene expression data for primary hbe cells cultured from three different donors and also nhbe cells (lonza cc- , a mixture of hbe and human tracheal epithelial cells). we then used candi to put nsclc models into three different groups: kras mutant, egfr mutant, and all cell lines. for our benign model raw counts were quantified via kallisto . raw counts for our malignant cell lines were queried via candi. deseq was then applied to evaluate the differential expression between our normal lung tissue model and our three malignant lung tissue groups. the results from deseq were then filtered by significance (adjusted p-value < . ). to filter based on potential immunotherapy targets we removed all genes not annotated as being (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . localized to the plasma membrane, and genes with localization confidence scores lower than six. genes that were obviously mis-annotated as surface proteins were also manually removed. supplementary figure/table legends supplementary figure . supplementary figure . an object-oriented schema diagram showing core structure of candi software. supplementary table . a table containing raw pickles bayes factors displayed in the heat map of fig. e. supplementary table . a table containing mean pickles bayes factors for each series displayed in fig. a,b. a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary table . a table containing the data for all chi tests performed to generate fig. c,d. supplementary table . a table containing the data for scatter plots shown in fig. e,f,g. supplementary table . a table containing the data from the differential essentiality analysis for all three tissues in fig. a-f. supplementary table . a table containing the data from the differential expression analysis for all three tissues in fig. a,c,e. supplementary table . a table containing the differential expression analysis data merged with the location data for all three tissues shown in fig. . acknowledgements we thank everyone in the gilbert lab for helpful comments and discussion. lag is supported by k /r ca and dp ca as well as the goldberg-benioff endowed professorship in prostate cancer translational biology. conflicts of interest none bibliography . ghandi, m. et al. next-generation characterization of the cancer cell line encyclopedia. nature , – ( ). . li, h. et al. the landscape of cancer cell line metabolism. nat. med. , – ( ). . tsherniak, a. et al. defining a cancer dependency map. cell , - .e ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . thul, p. j. et al. a subcellular map of the human proteome. science , ( ). . cancer cell line encyclopedia consortium & genomics of drug sensitivity in cancer consortium. pharmacogenomic agreement between two cancer cell line data sets. nature , – ( ). . barretina, j. et al. the cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. nature , – ( ). . bausch-fluck, d. et al. the in silico human surfaceome. pnas , e –e ( ). . giurgiu, m. et al. corum: the comprehensive resource of mammalian protein complexes- . nucleic acids res. , d –d ( ). . nusinow, d. p. et al. quantitative proteomics of the cancer cell line encyclopedia. cell , - .e ( ). . szklarczyk, d. et al. the string database in : quality-controlled protein-protein association networks, made broadly accessible. nucleic acids res. , d –d ( ). . itzhak, d. n., tyanova, s., cox, j. & borner, g. h. global, quantitative and dynamic mapping of protein subcellular localization. elife , ( ). . meyers, r. m. et al. computational correction of copy number effect improves specificity of crispr-cas essentiality screens in cancer cells. nat. genet. , – ( ). . behan, f. m. et al. prioritization of cancer therapeutic targets using crispr–cas screens. nature , – ( ). . wang, t. et al. identification and characterization of essential genes in the human genome. science , – ( ). . hart, t. et al. high-resolution crispr screens reveal fitness genes and genotype- specific cancer liabilities. cell , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . wang, t. et al. gene essentiality profiling reveals gene networks and synthetic lethal interactions with oncogenic ras. cell , - .e ( ). . chan, e. m. et al. wrn helicase is a synthetic lethal target in microsatellite unstable cancers. nature , – ( ). . adamson, b. et al. a multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. cell , - .e ( ). . wainberg, m. et al. a genome-wide almanac of co-essential modules assigns function to uncharacterized genes. http://biorxiv.org/lookup/doi/ . / ( ) doi: . / . . lenoir, w. f., lim, t. l. & hart, t. pickles: the database of pooled in-vitro crispr knockout library essentiality screens. nucleic acids res , d –d ( ). . bausch-fluck, d. et al. a mass spectrometric-derived cell surface protein atlas. plos one , ( ). . o’connor, m. j. targeting the dna damage response in cancer. mol. cell , – ( ). . zimmermann, m. et al. crispr screens identify genomic ribonucleotides as a source of parp-trapping lesions. nature , – ( ). . pan, x. et al. fancm, brca , and blm cooperatively resolve the replication stress at the alt telomeres. pnas , e –e ( ). . lou, k., gilbert, l. a. & shokat, k. m. a bounty of new challenging targets in oncology for chemical discovery. biochemistry , – ( ). . narayan, g. et al. promoter hypermethylation of fancf: disruption of fanconi anemia- brca pathway in cervical cancer. cancer res , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . ideker, t., dutkowski, j. & hood, l. boosting signal-to-noise in complex biology: prior knowledge is power. cell , – ( ). . chang, m. t. et al. identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. nat. biotechnol. , – ( ). . lou, k. et al. krasg c inhibition produces a driver-limited state revealing collateral dependencies. sci signal , ( ). . cancer disparities - national cancer institute. https://www.cancer.gov/about- cancer/understanding/disparities ( ). . love, m. i., huber, w. & anders, s. moderated estimation of fold change and dispersion for rna-seq data with deseq . genome biology , ( ). . rubin, j. b. et al. sex differences in cancer mechanisms. biol sex differ , ( ). . gillen, a. e. et al. molecular characterization of gene regulatory networks in primary human tracheal and bronchial epithelial cells. j. cyst. fibros. , – ( ). . mj, k. et al. prognostic significance of cd overexpression in non-small cell lung cancer. lung cancer (amsterdam, netherlands) vol. https://pubmed.ncbi.nlm.nih.gov/ / ( ). . ko, y. h. et al. prognostic significance of cd s expression in resected non-small cell lung cancer. bmc cancer , ( ). . penno, m. b. et al. expression of cd in human lung tumors. cancer res , – ( ). . bailey, m. h. et al. comprehensive characterization of cancer driver genes and mutations. cell , - .e ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . bray, n. l., pimentel, h., melsted, p. & pachter, l. near-optimal probabilistic rna-seq quantification. nat biotechnol , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . count sgrnas abundance by deep sequencing to measure gene/drug phenotypes t samplecrispr hela cell line lentiviral transduction of genome-scale crispr sgrna library olaparib untreated hela cell line cal cell line kpl cell line zr cell line ... cov cell line jhos cell line tov g cell line ... breast cancer cervical cancer ovarian cancer ca b d e candi integration cancer data integrator essentiality mutation ... candi cellular genomics functional genomics transcriptomics proteomics vs. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . − − differential essentiality (Δ average bf) − . − . − . − . . . . . . ppp r b cflar nxt ctnnb slc a mansc ahcyl arhgef l mrpl efcab c ol on non-sigfnificant differentially expressed differentially essential shown in violin plots pp p r b cf la r nx t ct nn b sl c a ma ns c ah cy l ar hg ef l mr pl ef ca b gene − − − b ay es f ac to r top hit female top hit male − − − differential essentiality (Δ average bf) − . − . − . − . . . . . d iff er en ti al e xp re ss io n ( lo g (f c )) bcl l gpi eno rtcb pkm wac pcid arhgap slc a gpr bc l l gp i en o rt cb pk m w ac pc id ar hg ap sl c a gp r gene − − b ay es f ac to r − − − differential essentiality (Δ average bf) − − chmp chmp haus wls katnb id acsl kcne rufy krt pa nc re as ch mp ch mp ha us w ls ka tn b id ac sl kc ne ru fy kr t gene − − b ay es f ac to r lu ng negative control female negative control male essential gene thresholdm or e es se nt ia l le ss e ss en tia l m or e es se nt ia l le ss e ss en tia l m or e es se nt ia l le ss e ss en tia l female cell linesmale cell lines more essential in more essential in male cell lines more essential in female cell lines more essential in male cell lines more essential in female cell lines more essential in u p re gu la te d in u p re gu la te d in d iff er en ti al e xp re ss io n ( lo g (f c )) u p re gu la te d in m al e c el l l in es u p re gu la te d in fe m al e c el l l in es d iff er en ti al e xp re ss io n ( lo g (f c )) u p re gu la te d in u p re gu la te d in m al e c el l l in es fe m al e c el l l in es m al e c el l l in es fe m al e c el l l in es a b c d e f (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . log (fold change) -l og (q v al ue ) cd slc a b m itga slc a hla-c cd lrpap ddr vdac slc a slco a kras mutant cd slc a b m itga slc a hla-c cd lrpap ddr vdac slc a slco a gene lo g ( tp m + ) kras mutant cell line type benign bronchial malignant log (fold change) -l og (q v al ue ) b m slc a cd itga atp a slc a cd ddr hla-clrpap itga tfpi egfr mutant b m slc a cd itga atp a slc a cd ddr hla-c lrpap itga tfpi gene lo g ( tp m + ) egfr mutant log (fold change) -l og (q v al ue ) b m cd thy slc a slc a lrpap hla-c ddr slc a itga ptgfrn vdac all lung cancer b m cd thy slc a slc a lrpap hla-c ddr slc a itga ptgfrn vdac gene lo g ( tp m + ) all lung cancer location confidence a b c d e f (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gene essentiality in kras mt cell lines (average bf) g en e es se nt ia lit y in k r as w t c el l l in es ( av er ag e bf ) kras egfr kras egfr more essentialless essential m ore essential less essential essential gene threshold egfr mt included egfr mt removed gene essentiality in egfr mt cell lines (average bf) g en e es se nt ia lit y in e g fr w t c el l l in es ( av er ag e bf ) kras egfr kras egfr more essentialless essential m ore essential less essential essential gene threshold kras mt included kras mt removed a b c es se nt ia lit y nonsense tumor supressor genes context speci�c effect size . braf/braf nras/nras kras/kras hras/hras effect size effect size kras/kras nras/nras braf/braf hras/hras nras/kras non-hit signi�cant hit essentiality/mutation missense all mutations nonsense e f g more essential less essential . . . p-value d missense oncogenes tumor supressor genes context speci�c mutations (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ancestralclust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees ancestralclust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees lenore pipes ,∗ and rasmus nielsen , , ∗ department of integrative biology, university of california-berkeley, berkeley, , usa, department of statistics, university of california-berkeley, berkeley, ca , usa, and globe institute, university of copenhagen, københavn k, denmark ∗to whom correspondence should be addressed. abstract motivation: clustering is a fundamental task in the analysis of nucleotide sequences. despite the expo- nential increase in the size of sequence databases of homologous genes, few methods exist to cluster divergent sequences. traditional clustering methods have mostly focused on optimizing high speed clus- tering of highly similar sequences. we develop a phylogenetic clustering method which infers ancestral sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences. results: we describe a clustering program ancestralclust, which is developed for clustering divergent sequences. we compare this method with other state-of-the-art clustering methods using datasets of homologous sequences from different species. we show that, in divergent datasets, ancestralclust has higher accuracy and more even cluster sizes than current popular methods. availability and implementation: ancestralclust is an open source program available at https://github.com/lpipes/ancestralclust contact: lpipes@berkeley.edu supplementary information: supplementary figures and table are available online. introduction traditional clustering methods such as uclust (edgar, ), cd-hit (fu et al., ), and dnaclust (ghodsi et al., ) use hierarchical or greedy algorithms that rely on user input of a sequence identity threshold. these methods were developed for high speed clustering of a high quantity of highly similar se- quences (ghodsi et al., ; li et al., ; edgar, ) and, generally, these methods are considered unreliable for identity thresholds < % because of either the poor quality of alignments at low identities (zou et al., ) or because the performance of the threshold used to count short words drops dramatically with low identities (huang et al., ). at low identities, these meth- ods produce uneven clusters where the majority of sequences are contained in only a few clusters (chen et al., ) and the high variance in cluster sizes reduces the utility of the clustering step for many practical purposes. clustering of divergent sequences is a fundamental step in genomics analysis because it allows for an early divide-and-conquer strategy that will significantly increase the speed of downstream analyses (zheng et al., ) and clus- tering of divergent sequences is a frequent request of users of at least one clustering method (huang et al., ). currently, there are no clustering methods that can accurately cluster large taxo- nomically divergent metabarcoding reference databases such as the barcode of life database (ratnasingham and hebert, ) in relatively even clusters. only a few other methods, such as sp- clust (matar et al., ) and treecluster (balaban et al., ), exist for clustering potentially divergent sequences. spclust cre- ates clusters based on the use of laplacian eigenmaps and the gaussian mixture model based on a similarity matrix calculated on all input sequences. while this approach is highly accurate, the calculation of an all-to-all similarity matrix is a computation- ally exhaustive step. treecluster uses user-specified constraints for splitting a phylogenetic tree into clusters. however, treeclus- ter requires an input tree and thus can also be prohibitively slow .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pipes and nielsen for large numbers of sequences where a phylogenetic tree is dif- ficult to estimate reliably. with the increasing size of reference databases (schoch et al., ), there is a need for new compu- tationally efficient methods that can cluster divergent sequences. here we present ancestralclust that was specifically developed for clustering of divergent metabarcoding reference sequences in clusters of relatively even size. methods to cluster divergent sequences, we developed ancestralclust which is written in c (figure ). firstly, k random sequences are chosen and the sequences are aligned pairwise using the wavefront algorithm (marco-sola et al., ). a jukes-cantor distance ma- trix is constructed from the alignments and a neighbor-joining phylogenetic tree is constructed. the jukes-cantor model is cho- sen for computational speed, but more complex models could in principle be used to potentially increase accuracy but also in- crease computational time. the c − longest branches in the tree are then cut to yield c clusters. these subtrees comprise the initial starting clusters. the sequences in each starting clus- ter are aligned in a multiple sequence alignment using kalign (lassmann, ). the ancestral sequences at the root of the tree of each cluster is estimated using the maximum of the posterior probability of each nucleotide using standard programming algo- rithms from phylogenetics (see e.g., yang, ). the ancestral sequences are used as the representative sequence for each cluster. next, the rest of the sequences are assigned to each cluster based on the shortest nucleotide distance from the wavefront alignment between the sequence and the c ancestral sequences. if the short- est distance to any of the c ancestral sequences is larger than the average distance between clusters, the sequence is saved for the next iteration. we iterate this process until all sequences are as- signed to a cluster. in each iteration after the first iteration, a cut of a branch in the phylogenetic tree is chosen if the the branch is longer that the average length of branches cut in the first iteration. in praxis, only one or two iterations are needed for most data sets if k is defined to be sufficiently large. we compared ancestralclust to five other state-of-the-art clustering methods: uclust (edgar, ), meshclust (james and girgis, ), dnaclust (ghodsi et al., ), cd-hit (fu et al., ), and spclust (matar et al., ). we used a variety of measurements to assess the accuracy and evennness of the clustering. we calculated two traditional measures of accu- racy, purity and normalized mutual information (nmi), used in bonder et al. ( ). the purity of clusters is calculated as: purity(Ω, c) = n ∑ k max j |ωk ∩ cj| ( ) where Ω = w , w , ..., wk is the set of clusters, c = c , c , ..., cj is the set of taxonomic classes and n is the total number of sequences. nmi is calculated as: nmi(Ω, c) = i(Ω, c) [h(Ω) + h(c)]/ ( ) where mutual information gain is i(Ω, c) and h is the entropy function. to measure the evenness of the clusters, we used the coefficient of variation which is calculated as: cv = √∑j i (ni − m) /j m ( ) where ni is the number of sequences in cluster i, j is the total number of clusters, and m is the mean size of the clusters. we also used a taxonomic incompatibility measure to assess the ac- curacy of the clusters. let a,b be a pair of species found in cluster i. incompatibility at a given taxonomic rank is calculated by first identifying the number of times a and b exist in clusters other than cluster i. the total incompatibility is calculated by summing over all pairs of sequences (a,b) and all i. both nmi and taxonomic incompatibility are very sensitive to the number of clusters and also to unevenness of cluster sizes. to allow fair comparison when numbers of clusters and evenness of cluster sizes vary we, therefore, calculate the relative nmi and relative incompatibility. these measures are calculated by scaling them relative to their expected values under random as- signments given the number of clusters and the cluster sizes. we estimated relative nmi by dividing the raw nmi score by the average nmi of clusterings in which sequences have been as- signed at random with equal probability to clusters, such that the cluster sizes are same as the cluster sizes produced in the original clustering. the same procedure was used to convert the taxonomic incompatibility measure into relative incompatibility. results to first assess performance of clustering methods on divergent nucleotide sequences, we used random samples of , sequences from three metabarcode reference databases ( , s, and cytochrome oxidase i (coi)) from the caledna project meyer et al. ( ). we chose to compare our method on this dataset against uclust because it is the most widely used clus- tering program and it performs better than cd-hit on low identity thresholds (chen et al., ). we first compared ancestralclust against uclust using relative nmi and coefficient of variation (figure ). we used k = random initial sequences, which is % of the total num- ber of sequences in each sample and c = cuts in the initial phylogenetic tree. notice that the relative nmi tends to be higher with a lower coefficient of variation for ancestralclust across all barcodes. this suggests, that for these divergent edna sequences, ancestralclust provides clusterings that are more even in size and that are more consistent with conventional taxonomic assignment. as a second measure of accuracy we measured relative incom- patibility and coefficient of variation using ancestralclust and uclust using for the same datasets under the same running .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ancestralclust conditions. notice in figure , ancestralclust tends to create balanced clusters with lower relative taxonomic incompatibilities compared to uclust at all taxonomic levels. similar results are seen for metabarcode s (fig s ). however, for metabar- code s (fig s ), ancestralclust performs noticeably better than uclust at the species, genus, and family levels but at the order, class, and phylum levels it performs either the same or worse. also, at the species, genus, and family levels, it is apparent that as the uclust clusters approach a lower coefficient of variation, the relative incompatibility increases dramatically. next, we analyzed two datasets with different properties: one dataset of diverse species from the same gene and another dataset of homologous genes from species of the same phyla. in the first dataset, we expect that the sequences to cluster according to species. in the second dataset, we expect the sequences to cluster according to different genes. we compared ancestralclust to four commonly used clustering programs (uclust, meshclust , cd- hit , and dnaclust) and one clustering program designed for divergent sequences, spclust. the first dataset contained , sequences from the coi caledna database from divergent species that were from different phyla and different classes and the second data set contained sequences from different genes from taxonomically similar species. first, we compared all meth- ods using , coi sequences from the different species (table ). we expect these sequences to form different clus- ters, each including all the sequences from one species. we chose identity thresholds to enforce the expected number of clusters for each method. we were unable to form clusters using cd-hit because the program does not allow clustering of sequences with identity thresholds < % at default parameters. for spclust, we used the three precision modes available for the method. in this analysis, ancestralclust achieved a perfect clustering (the purity was and relative incompatibility was ) although it was the second slowest, and had the second lowest memory require- ments. uclust was one of the fastest methods and used the least amount of memory but had the second lowest purity with third highest relative nmi values. meshclust had no incompatibilities and the second highest purity and relative nmi values but was the third slowest method. dnaclust had the most uneven clusters and the second lowest relative nmi value with the highest relative incompatibility. spclust only identified one cluster, with a com- putational time of ~ days. in comparison, ancestralclust took ~ minutes and uclust used < second. next, we analyzed ’genomic set ’ from matar et al. ( ), which consists of sequences from homologous genes (fcer g, s a , s a , s a , s a , and sh bgrl in table ). we expect these sequences to form clusters. we varied the identity thresholds for uclust and meshclust using thresholds . , . , and . . for cd-hit, we used the lowest identity threshold available on default parameters which is . . we were unable to use dnaclust for this anal- ysis because it cannot handle sequences longer than bp (the average sequence length was , . bp and the longest sequence was , bp). since this dataset contained different genes, we calculated relative nmi using genes as the classes and did not use incompatibility as an accuracy measure. only ancestralclust, uclust, and meshclust produced the expected number of clus- ters, and among the methods that created the expected number of clusters, ancestralclust had the highest purity value. ancestral- clust was the second slowest method and had the highest memory requirements which is due to the wavefront algorithm alignment which iso(s ) in memory requirements where s is the alignment score. since alignments were performed using different genes that were longer than . kb, this resulted in a high value of s. sp- clust had the highest relative nmi using all precision modes and the same purity as ancestralclust for its moderate and maximum precision modes, however, failed to produce the expected number of clusters. conclusions we developed a phylogenetic-based clustering method, ances- tralclust, specifically to cluster divergent metabarcode sequences. we performed a comparative study between ancestralclust and widely used clustering programs such as uclust, cd-hit, dnaclust, meshclust , and for divergent sequences, spclust. uclust and dnaclust are substantially faster than ances- tralclust and should be the preferred method if computational speed is the main concern. however, ancestralclust tends to form clusters of more even size with lower taxonomic incompatibility and higher nmi than other methods, for the relatively divergent sequences analyzed here. we recommend the use of ancestral- clust when sequences are divergent, especially if a relatively even clustering is also desirable, for example for various divide-and- conquer approaches where computational speed of downstream analyses increases faster than linearly with cluster size. acknowledgements this work used the extreme science and engineering discov- ery environment (xsede) bridges system at the pittsburgh supercomputing center through allocation bio . references balaban, m., moshiri, n., mai, u., jia, x., and mirarab, s. ( ). treecluster: clustering biological sequences using phylogenetic trees. plos one, ( ), e . bonder, m. j., abeln, s., zaura, e., and brandt, b. w. ( ). compar- ing clustering and pre-processing in taxonomy analysis. bioinformatics, ( ), – . chen, q., wan, y., zhang, x., lei, y., zobel, j., and verspoor, k. ( ). comparative analysis of sequence clustering methods for deduplication of biological databases. j. data and information quality, ( ). edgar, r. c. ( ). search and clustering orders of magnitude faster than blast. bioinformatics, ( ), – . fu, l., niu, b., zhu, z., wu, s., and li, w. ( ). cd-hit: accelerated for clustering the next-generation sequencing data. bioinformatics, ( ), – . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pipes and nielsen ghodsi, m., liu, b., and pop, m. ( ). dnaclust: accurate and efficient clustering of phylogenetic marker genes. bmc bioinformatics, ( ), – . huang, y., niu, b., gao, y., fu, l., and li, w. ( ). cd-hit suite: a web server for clustering and comparing biological sequences. bioinformatics, ( ), – . james, b. t. and girgis, h. z. ( ). meshclust : application of alignment-free identity scores in clustering long dna sequences. biorxiv, page . lassmann, t. ( ). kalign : multiple sequence alignment of large datasets. li, w., jaroszewski, l., and godzik, a. ( ). clustering of highly homologous sequences to reduce the size of large protein databases. bioinformatics, ( ), – . marco-sola, s., moure lópez, j. c., moreto planas, m., and es- pinosa morales, a. ( ). fast gap-affine pairwise alignment using the wavefront algorithm. bioinformatics, (btaa ), – . matar, j., khoury, h. e., charr, j.-c., guyeux, c., and chrétien, s. ( ). spclust: towards a fast and reliable clustering for potentially divergent biological sequences. computers in biology and medicine, , . meyer, r. s., curd, e. e., schweizer, t., gold, z., ramos, d. r., shirazi, s., kandlikar, g., kwan, w.-y., lin, m., freise, a., et al. ( ). the california environmental dna “caledna” program. biorxiv, page . ratnasingham, s. and hebert, p. d. ( ). bold: the barcode of life data system (http://www. barcodinglife. org). molecular ecology notes, ( ), – . schoch, c. l., ciufo, s., domrachev, m., hotton, c. l., kannan, s., khovanskaya, r., leipe, d., mcveigh, r., o’neill, k., robbertse, b., et al. ( ). ncbi taxonomy: a comprehensive update on curation, resources and tools. database, . yang, z. ( ). molecular evolution: a statistical approach. oxford university press. zheng, w., mao, q., genco, r. j., wactawski-wende, j., buck, m., cai, y., and sun, y. ( ). a parallel computational framework for ultra-large- scale sequence clustering analysis. bioinformatics, ( ), – . zou, q., lin, g., jiang, x., liu, x., and zeng, x. ( ). sequence clus- tering in bioinformatics: an empirical study. briefings in bioinformatics, ( ), – . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ancestralclust figure . overview of ancestralclust. in ( ), k random sequences are chosen for the initial clusters. ( ) using the k sequences a distance matrix is constructed. using the distance matrix, a neighbor-joining tree is constructed and c − cuts are made to create c clusters. in ( ), each cluster is multiple sequenced aligned and the ancestral sequences are reconstructed in the root node of each tree. the rest of the unassigned sequences are then aligned to the ancestral sequences of each cluster and the shortest distance to each ancestral sequence is calculated. the process is iterated until all sequences are assigned to a cluster. figure . relative nmi against coefficient of variation for ancestralclust and uclust for samples of , randomly chosen s, s, and coi reference sequences from the caledna project (meyer et al., ). the similarity threshold for uclust was . . for ancestralclust, we used initial random sequences with initial clusters. relative nmi was calculated by dividing nmi by the average of random samples of the same fixed cluster size. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pipes and nielsen figure . relative incompatibility against coefficient of variation for ancestralclust and uclust for samples of , randomly chosen coi reference sequences. coi reference sequences are from the caledna project (meyer et al., ). the similarity threshold for uclust was . . for ancestralclust, we used initial random sequences with initial clusters. table . comparisons of clustering methods using , coi sequences from different species. the list of species can be found in table s . incompatibility was calculated at the taxonomic rank of species. for uclust, meshclust , and dnaclust, the identity thresholds were chosen to force the expected number of clusters. for cd-hit, the lowest possible identity was chosen which is . . in the case of spclust, coefficient of variation cannot be calculated for cluster. spclust clusters were created with version . method # of clusters time (sec) mem (mb) purity relative incompat. (species) relative nmi coeff. of var. ancestralclust . . . . uclust < . . . . . meshclust . . . . . cd-hit . . . . . dnaclust < . . . . . spclust (fast) . . - spclust (moderate) . . - spclust (maxprecision) . . - .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ancestralclust table . comparisons of clustering methods using sequences from homologous genes from matar et al. ( ).’id’ refers to the identity threshold used. we used identity thresholds of . , . , and . for uclust and meshclust . we used precision levels of fast, moderate, and maximum for spclust using version since version only produced cluster for all modes. dnaclust has a maximum sequence length of bp and could not be used on this dataset. method # of clusters time (sec) memory (mb) purity relative nmi coefficient of variation ancestralclust . . . . . uclust (id= . ) . . . . uclust (id= . ) . . . . uclust (id= . ) . . . . . meshclust (id= . ) . . . . . meshclust (id= . ) . . . . . meshclust (id= . ) . . . . . spclust (fast) . . . . . spclust (moderate) . . . . . spclust (max precision) . . . . . cd-hit (id= . ) . . . . . dnaclust - - - - - - .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / rdrugtrajectory: an r package for the analysis of drug prescriptions in electronic health care records jss journal of statistical software mmmmmm yyyy, volume vv, issue ii. reddoi: . /jss.v .i rdrugtrajectory: an r package for the analysis of drug prescriptions in electronic health care records anthony nash university of oxford tingyee e. chang university of oxford benjamin wan kings college london m. zameel cader university of oxford abstract primary care electronic health care records are rich with patient and clinical infor- mation. studying electronic health care records has resulted in marked improvements to national health care processes and patient-care decision making, and is a powerful supple- mentary source of data for drug discovery effort. we present the r package rdrugtrajec- tory, designed to yield demographic and patient-level characteristics of drug prescriptions in the uk clinical practice research datalink dataset. the package operates over clin- ical practice research datalink gold clinical, referral and therapy datasets and includes features such as first drug prescriptions analysis, cohort-wide prescription information, cu- mulative drug prescription events, the longitudinal trajectory of drug prescriptions, and a survival analysis timeline builder to identify risks related to drug prescription switching. the rdrugtrajectory package has been made freely available via the github repository. keywords: ehr, electronic health care records, cprd, clinical practice research datalink, prescriptions, r, therapeutics, drug discovery, clinical epidemiology. . introduction the uk clinical practice research datalink (cprd) service offers high quality longitudinal data on million patients with up to years of follow-up for % of those patients. the service provides drug treatment patterns, feasibility studies and health care resource use stud- ies. patient electronic health care records (ehr) are stored as coded and anonymised data and sourced from over , primary care practices across england. cprd holds informa- tion on consultation events, medical diagnoses, symptoms, prescriptions, vaccination history, laboratory tests, and referrals. cprd can provide routine linkage to other health-related patient datasets, for example: small area level data, such as patient and/or practice postcode .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dx.doi.org/ . /jss.v .i https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records linked deprivation measures; data from nhs digital which includes hospital episode statistic, outpatient and accident and emergency data; and cancer data from public health england. evidence from ehrs is making an impact on primary care decision-making and best prac- tice oyinlola et al. ( ). with nationwide longitudinal datasets more readily available, the evaluation of treatments over long timescales can contribute to clinical decision-making hepp et al. ( ). for example, adverse events caused by prescription medication can be studied using retrospective data in situations where randomized clinical trials may prove impracti- cal ghosh et al. ( ); bally et al. ( ). this publication serves as an introduction to the rdrugtrajectory r package and whilst this publication is by no means a complete tutorial, we will expand on some of the main pack- age features, such as, how to: isolate patients by first drug prescriptions at given clinical events; calculate time-invariant prescriptions; construct survival analysis timelines (compati- ble with cox proportional hazard regression and kaplan meier curves), and; visualise patient prescription switching. for a comprehensive list of functions please visit the github reposi- tory https://github.com/acnash/rdrugtrajectory. almost all features can be controlled by covariates or stratified by some variable, for example, by gender, age, medical codes or treatment product codes. the example code, figures and data structures presented here mimic a small fraction of our own research. in the interest of patient confidentiality, the clinical data used in the analysis have been fabricated. we present a brief tour of some of the functions available, starting with a discussion on the cprd data structure and how records must be formatted. a glossary of terms has been provided (table ) to assist the reader. . rdrugtrajectory package and data structures . . rdrugtrajectory availability and installation rdrugtrajectory is free to download from the github repository https://github.com/acnash/ rdrugtrajectory and holds an mit license. fabricated cprd clinical and cprd prescrip- tion records in addition to age, gender and index of multiple deprivation scores are included for test and tutorial purposes. before installing the package, the following r dependencies are required: plyr, dplyr, foreach, doparallel, data.table, parallel, splus r, rlist, reda, ggplot , ggalluvial, stats, utils and useful. the latest rdrugtrajectory binary is install using: install.packages("path/to/tar/file", source = true, repos=null) rdrugtrajectory was developed and tested on r version . . . please consult the github page for release notes, the latest version and up to date installation instructions. . . cprd product descirption several rdrugtrajectory functions use the cprd product.txt file for assigning a text descrip- tion to a prescription prodcode. the product.txt (and medical.txt for medcode description) is available in the cprd data dictionary windows software. it is important that the file .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/acnash/rdrugtrajectory https://github.com/acnash/rdrugtrajectory https://github.com/acnash/rdrugtrajectory https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software term description rdrugtrajectory an r packaged designed for the management of cprd prescription data. clinical the clinicalnnn.txt dataset presented in a rdrugtrajectory dataframe. referral the referralnnn.txt dataset presented in a rdrugtrajectory dataframe. therapy the therapynnn.txt dataset presented in a rdrugtrajectory dataframe. additionalnnn.txt the cprd dataset of additional clinical information, for example, patient smoking status and alcohol comsumption. data can be retrieved using cprdlookups.r. modecode a cprd identifier that denotes medical conditions, diagnosis and com- plaints made by a patient. medcodes are recorded in the clinicalnnn.txt and referralnnn.txt files. prodcode a cprd identifier that denotes treatment products, including drugs, foods, and medical apparatus. prodcodes are recorded in the thera- pynnn.txt files. patid a unique cprd patient identifier. used to link datasets. event any procode or medcode in a patient’s ehr. eventdate the date of an event recorded by a general practitioner. present in all three datasets and corresponding rdrugtrajectory dataframe. imd index of multiple deprivation score - a uk government socioeconomic measurement based on postcode of the clinic or a patient’s registered ad- dress. prescription a general time for any prodcode prescribed for treatment. medical history indicates a combination of one or more sets of cprd data, for example, the collection of all clinical and therapy ehr for patients with a medcode for migraine. product.txt a plain text file that contains all prodcodes with a description and comes bundled with the cprd data dictionary. the file is used to link a prodcode with a description. table : table of frequently used terms. remains in plain text, with columns tab-delimited. the files can be simplified by removing all non-essential products. finally, all the eleven columns that make up the product.txt file must be available, with the first column containing all prodcodes and the fourth column containing the product description. a simplified product.txt file, presented below, can be downloaded from the github page. > library(rdrugtrajectory) > productdf <- read.csv("../rdrugtrajectory_data/product.txt", + sep="\t", + header=false) > head(productdf) v v v v v atenolol mg tablets atenolol atenolol mg tablets atenolol atenolol mg tablets atenolol .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records amitriptyline mg tablets amitriptyline hydrochloride lisinopril mg tablets lisinopril lisinopril mg tablets lisinopril v v v v mg tablet oral mg tablet oral mg tablet oral mg tablet oral / / mg tablet oral mg tablet oral v beta-adrenoceptor blocking drugs beta-adrenoceptor blocking drugs beta-adrenoceptor blocking drugs tricyclic and related antidepressant drugs/neuropathic pain/prophylaxis of migraine angiotensin-converting enzyme inhibitors angiotensin-converting enzyme inhibitors v v feb- feb- feb- feb- feb- feb- . . rdrugtrajectory package structure rdrugtrajectory contains three r files: ( ) all functions related to data curating and search- ing reside within prddrugtrajectory.r; ( ) analysis tools and timeline construction reside within cprddrugtrajectorystats.r; and, ( ) all utilities including input/output operations reside within cprddrugtrajectoryutils.r. the packages contains several fabricated cprd datasets: testclinicaldf, testtherapydf, agegenderdf, imddf, and druglistdf. a de- scription of each, along with information on data types and structures are given below. . . the cprd ehr data structure the structure of cprd gold data may depend on whether the cprd license holder per- forms intermediate data management steps before releasing data to the user. however, typ- ically, cprd gold data follows the cprd gold specification https://cprdcw.cprd.com/ _docs/cprd_gold_full_data_specification_v . .pdf. currently, rdrugtrajectory sup- ports ehr data from the flat files clinicalnnn.txt, referralnnn.txt, and therapynnn.txt. the additional clinical details files (additionalnnn.txt) are currently supported using our re- leased r script cprdlookups.r https://github.com/acnash/cprd_additional_clinical ?. patients are assigned a unique numerical patid value. the operations performed by rdrugtra- jectory requires the patid to identify patients and subset patient groups. we recommend that patid, medcode, prodcode are kept as character data throughout any preliminary data curating .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://cprdcw.cprd.com/_docs/cprd_gold_full_data_specification_v . .pdf https://cprdcw.cprd.com/_docs/cprd_gold_full_data_specification_v . .pdf https://github.com/acnash/cprd_additional_clinical https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software steps. medical events are recorded as codes and stored in the clinicalnnn.txt and refer- ralnnn.txt under the column header medcode. prescription events, such as drug prescriptions are also recorded as codes and stored in the therapynnn.txt file under the column header prodcode and the sequences of repeat prescriptions are under the issueseq column header. dates associated medical and prescription events, recorded by the general practitioner, are stored under the column header eventdate. . . essential data types and data structures rdrugtrajectory can operate over cprd gold ehr clinical, referral and prescription data provided each dataset format is presented as separate r dataframes or combined into a rdrug- trajectory medical history dataframe. the construction of clinical, referral and prescription dataframes require, as a minimum, a patid and eventdate column, and either medcode or prod- code (for therapy data, issueseq is necessary), and presented in that order. every record of medcode or prodcode must be accompanied by an eventdate entry (encoded as a date class of the form yyyy-mm-dd). patients can have duplicate events within the same data set and between data sets. medical and prescription codes can be retrieved from the corresponding medical.txt and product.txt files which come bundled with the cprd data dictionary win- dows application. rdrugtrajectory comes packaged with fabricated ehr data in the structure of: > library(rdrugtrajectory) > #fabricated clinical data (referral data follows the same format) > names(testclinicaldf) [ ] "patid" "eventdate" "medcode" "consid" > #fabricated prescription data > names(testtherapydf) [ ] "patid" "eventdate" "prodcode" "consid" "issueseq" users can check if the structure of an ehr dataframe meets the requirements for this package by calling checkcprdrecord; additional columns such as consultation identification number (consid) are not considered. in the following instance, a prescription dataset with the required columns and the optional consultation identification number is presented. > library(rdrugtrajectory) > #check the structure of testtherapy, specify that it is therapy data > checkcprdrecord(df=testtherapydf, datatype="therapy") [ ] "the data.frame is appropriately formatted. returning true." [ ] true > #display the rdrugtrajectory ehr therapy dataframe > str(testtherapydf, strict.width="wrap") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records 'data.frame': obs. of variables: $ patid : int ... $ eventdate: date, format: " - - " " - - " ... $ prodcode : int ... $ consid : int ... $ issueseq : int ... users can combine with the rdrugtrajectory ehr dataframes any number of patient and ehr data to act as covariates and stratifying variables, typically this can be done using the r cbind operation. for example, bmi and smoking status, both of which can be retrieved from the additionalnnn.txt dataset files using cprdlookups.r, can be linked by searching for and binding with the record patid values. the rdrugtrajectory package contains several utility functions to retrieve cprd data, including, patient year of birth, gender (male or female) and either patient-level or clinical-level index of multiple deprivation score (imd). the patient age can be determined by adding to the value in yob column in the patient cprd ehr dataset and then subtracting that value (birth year) from the year of the cprd database release. this data requires preliminary treatment before presenting to the rdrugtrajectory package. patient age, gender and imd score must be presented in a dataframe with the linked patient column patid, along with the columns age, gender, and score. providing the patid column is preserved, patient characteristics can be presented in separate dataframe, for example: > library(rdrugtrajectory) > #patient age and gender as one dataframe > str(agegenderdf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ yob : num ... $ gender: int ... > #clinic-level imd score as one datafrmae > str(imddf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ pracid: int ... $ score : int ... the patid patient identifier is fundamental in every operation performed by rdrugtrajectory. the examples presented here and those in the reference manual rely on searching and subset- ting ehr data using a list or vector of patient identifier. the function getuniquepatidlist will retrieve an r list of patient identification numbers from any dataframe with a patid column. the aforementioned rdrugtrajectory ehr dataframes, clinical, referral and therapy, can be combined into a single dataframe. we refer to this dataset instance as the patient’s medical .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software history and can be constructed using constructmedicalhistory. this dataframe expects events to be in chronological order, and will introduce a new column, code and codetype to denote each of the combined events. the code (medcode and/or prodcode) can be distinguished by a codetype value of c (clinical events), r (referral events), and t (prescription events). events are returned in chronological order using the eventdate data. the following code demonstrates how to retrieve a list of patient identifier from a prescription dataframe and from a medical history dataframe, followed by how to subset using base r operations and, finally, the medical history dataframe structure. > library(rdrugtrajectory) > #retrieve patids from therapy data. > idlist <- getuniquepatidlist(testclinicaldf) > medhistorydf <- constructmedicalhistory(testclinicaldf, null, testtherapydf) [ ] "using clinical data." [ ] "using therapy data." [ ] "building with clinical and therapy data." > #retrieve patid from medical history. > medhistoryidlist <- getuniquepatidlist(medhistorydf) > numofpatients <- length(medhistoryidlist) > #subset using the first patients. > smallmedhistorydf <- subset(medhistorydf, + medhistorydf$patid %in% medhistoryidlist[ : ]) > #separate out the first patient with a clinical record. > smallclinicalonlydf <- subset(smallmedhistorydf, + smallmedhistorydf$codetype == "c") > #separate out the first patient with a therapy record. > smalltherapyonlydf <- subset(smallmedhistorydf, + smallmedhistorydf$codetype == "t") > #subset only or those patient records beyond st jan . > latermedhistorydf <- subset(medhistorydf, + medhistorydf$eventdate > as.date(" - - ")) > #medical history dataframe structure > str(medhistorydf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ eventdate: date, format: " - - " " - - " ... $ code : int ... $ codetype : chr "c" "c" "c" "t" ... the patid data can also be used to retrieve patient characteristics, for example, the gender of the patient using getgenderofpatients: .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records > library(rdrugtrajectory) > idlist <- getuniquepatidlist(testtherapydf) > #only use half of the cohort. > idlist <- idlist[ :(length(idlist)/ )] > #get gender data by specific gender. > malecode <- > femalecode <- > malepatientsdf <- getgenderofpatients(idlist, agegenderdf, malecode) > femalepatientsdf <- getgenderofpatients(idlist, agegenderdf, femalecode) > #get all gender data > allpatientsdf <- getgenderofpatients(getuniquepatidlist(testtherapydf), + agegenderdf) > #structure of the patient gender data. > str(allpatientsdf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ gender: int ... imd data can be retrieved by combining getuniquepatidlist and getimdofpatients func- tions: > library(rdrugtrajectory) > idlist <- getuniquepatidlist(testtherapydf) > #get patients with an imd score of or > onepatientsdf <- getimdofpatients(idlist, imddf, ) > twopatientsdf <- getimdofpatients(idlist, imddf, ) > #get all imd scores for all patients in testtherapydf > allpatientsdf <- getimdofpatients(getuniquepatidlist(testtherapydf), imddf) > #structure of the patient gender data. > str(allpatientsdf, strict.width="wrap") 'data.frame': obs. of variables: $ patid: int ... $ score: int ... the final example of ehr dataframe manipulation presented here demonstrates how to re- trieve all prescription records for patients prescribed a specific prescription treatment. for example, such an operation can be used to retrieve all prescription records for any patient prescribed amitriptyline. in addition, it is also possible to return only prescription records matching specific prescription treatments. importantly, prescription prodcodes can be grouped into lists and used to collect those patients with at least one record that matches an element of that list. this approach is useful if the dose is not relevant to the study or the prescription is dispensed under multiple product names. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > library(rdrugtrajectory) > #it is easy to retrieve a list of all unique prodcodes in the cohort. > prodcodesvector <- unique(testtherapydf$prodcode) > reducedprodcodesvector <- prodcodesvector[ : ] > #all records are maintained for those patients with a matching prodcode. > therapyofinterestdf <- getpatientswithprodcode(testtherapydf, + reducedprodcodesvector) > #only those records that match are retained. > reducedtherapyofinterestdf <- getpatientswithprodcode(testtherapydf, + reducedprodcodesvector, + removeexcessdrugs=true) . ehr drug prescription results and discussion having briefly demonstrated some basic operation on retrieving patient records by matching ehr dataframes against sets of patid values, we move on to showcase several operations available to the user. we begin by presenting examples of cohort prescription summary statistics followed by methods of dataset curating and stratifying by patient groups. we then present examples on how to search for patients prescribed with a first-line treatments, followed by presenting some of these patient groups as sequences of prescriptions. finally, we demonstrate several examples of building time-lines. for futher examples, please see the github page and reference manual. . . cohort summmary statistics geteventdatesummarybypatient rdrugtrajectory can return summary based statistics on patient and cohort level prescription data with geteventdatesummarybypatient and getpopulationdrugsummary, respectively. for example, a single patient (via getuniquepatidlist and [] dataframe subsetting) pre- scription history returns the patient patid, number of prescription events, median number of days between events, fewest number of days between events, the most number of days between events (maxtime and longestduration are the same), and record duration (number of days between the first and last prescription event on record): > library(rdrugtrajectory) > idlist <- getuniquepatidlist(testtherapydf) > resultlist <- geteventdatesummarybypatient( + testtherapydf[testtherapydf$patid==idlist[[ ]],]) > str(resultlist, strict.width="wrap") list of $ timeserieslist: num [ : ] $ summarydf :'data.frame': obs. of variables: ..$ patid : int ..$ numberofevents : int .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records ..$ mediantime : num ..$ mintime : num ..$ maxtime : num ..$ longestduration: num ..$ recordduration : int - attr(*, "class")= chr "eventdatesummaryobj" getpopulationdrugsummary this approach can be extended across the cohort of patients with getpopulationdrugsummary. the returning populationeventdatesummary s object is a list of three elements. the first element is the summarydf dataframe derived from calling geteventdatesummarybypatient per patient, with the set of statistics retrievable through the accompanied patid. the second element is the timeserieslist, which holds a vector per patient of the number of days between consecutive prescription events. vectors can be accessed using the patid element name: > library(rdrugtrajectory) > resultlist <- getpopulationdrugsummary(df = testtherapydf, + prodcodesvector = null) > str(resultlist, strict.width="wrap", list.len = ) list of $ summarydf :'data.frame': obs. of variables: ..$ patid : int [ : ] ... ..$ numberofevents : int [ : ] ... ..$ mediantime : num [ : ] . ... ..$ mintime : num [ : ] ... ..$ maxtime : num [ : ] ... .. [list output truncated] $ timeserieslist:list of ..$ : num [ : ] ..$ : num [ : ] ... ..$ : num ..$ : num ..$ : num [ : ] ... .. [list output truncated] - attr(*, "class")= chr "populationeventdatesummary" > #get all patids for patients younger than . > ageidlist <- getuniquepatidlist(agegenderdf[agegenderdf$yob < ,]) > timeserieslist <- resultlist[[ ]] > #get all patids of available data. > recordpatids <- names(timeserieslist) > #get time data for the intersect of those patids of patients < and the patids > #of available data. > subtimelist <- timeserieslist[intersect(ageidlist, recordpatids)] > str(subtimelist, strict.width="wrap", list.len = ) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software list of $ : num $ : num $ : num $ : num $ : num [list output truncated] . . curating drug prescription records there is no direct link between a prescription event and a medcode in the cprd data. the relationship between the two can be inferred from the event dates of the prescription and clinical events, in addition, to information provided by the consultation id and the prescription issue number. matchdrugwithdisease rdrugtrajectory provides several methods for curating prescription datasets with the aim of es- tablishing a relationship between prescription and clinical events. the matchdrugwithdisease function returns a subset of all prescription events with an established relationship between therapy and clinical event. to what degree these patients are included in the search is con- trolled with a function argument. there are three scenarios: all patients with a record of a specific prescription event and specific clinical event, at any point; all patients with a record of a specific prescription event on the same date as a specific clinical event; and, all patients with a record of a specific prescription event on the same date as a specific clinical event and clear from additional clinical events on that day. one would expect fewer patients as the stringency of the search criteria is increased: > library(rdrugtrajectory) > prodcodes <- unique(testtherapydf$prodcode) > amitriptylinecodes <- prodcodes[ : ] > propranololcodes <- prodcodes[ : ] > medcodelist <- unique(testclinicaldf$medcode) > headachecodes <- medcodelist[ : ] > amitriptylineresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = ) > amitriptylineresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = ) > amitriptylineresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records + drugcodelist = amitriptylinecodes, + severity = ) > propranololresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = propranololcodes, + severity = ) > propranololresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = propranololcodes, + severity = ) > propranololresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = propranololcodes, + severity = ) getgenderofpatients the example presented, demonstrates how to identify patients prescribed amitriptyline and patients prescribed propranolol (there is patient overlap, easily controlled for by subsetting) whilst controlling for clinical overlap with or without consideration for off topic clinical events. with the identified patients, we can, for example, stratify by gender: > library(rdrugtrajectory) > library(ggplot ) > ami gender <- getgenderofpatients(amitriptylineresult , agegenderdf) > ami gender <- getgenderofpatients(amitriptylineresult , agegenderdf) > ami gender <- getgenderofpatients(amitriptylineresult , agegenderdf) > prop gender <- getgenderofpatients(propranololresult , agegenderdf) > prop gender <- getgenderofpatients(propranololresult , agegenderdf) > prop gender <- getgenderofpatients(propranololresult , agegenderdf) > amidf <- data.frame(freq=c(nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]) + ), + search=c("prescribed","with headache","no comorbidities", + "prescribed","with headache","no comorbidities"), + drug="amitriptyline", + gender=c("male","male","male", + "female","female","female") + ) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > propdf <- data.frame(freq=c(nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]) + ), + search=c("at any time","with clinical","clinical & no comorbidities", + "at any time","with clinical","clinical & no comorbidities"), + drug="propranolol", + gender=c("male","male","male", + "female","female","female") + ) > drugprescriptiondf <- rbind(amidf, propdf) > ggprescriptionami <- ggplot(drugprescriptiondf[ + drugprescriptiondf$drug=="amitriptyline",], + aes(x=search, y=freq, fill=gender)) + + geom_bar(stat="identity", position=position_dodge()) + + theme_bw() + xlab("search critera (severity)") + ylab("patient count") + + theme(axis.text.x = element_text(angle= ,hjust= )) + + ggtitle("amitriptyline") > ggprescriptionprop <- ggplot(drugprescriptiondf[ + drugprescriptiondf$drug=="propranolol",], + aes(x=search, y=freq, fill=gender)) + + geom_bar(stat="identity", position=position_dodge()) + + theme_bw() + xlab("search critera (severity)") + ylab("patient count") + + theme(axis.text.x = element_text(angle= ,hjust= )) + + ggtitle("propranolol") > filtering through prescription events can also be controlled by a date range. for example, if one was calculating the number of patients prescribed amitriptyline per year from to and matched to a headache event, one can apply a date range: > library(rdrugtrajectory) > library(ggplot ) > prodcodes <- unique(testtherapydf$prodcode) > amitriptylinecodes <- prodcodes[ : ] > #clinical event of interest are headaches. > medcodelist <- unique(testclinicaldf$medcode) > #medcodes can be refined further. > headachecodes <- medcodelist[ : ] > #dataframes defined for binned dates are constructed by providing all the > #patients to consider and the binned start and stop date. > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records no c om or bi di tie s pr es cr ib ed w ith h ea da ch e search critera (severity) p a tie n t co u n t gender female male amitriptylinea at a ny ti m e cl in ica l & n o co m or bi di tie s w ith c lin ica l search critera (severity) p a tie n t co u n t gender female male propranololb figure : the number of patients prescribed (a) amitriptyline or (b) propranolol. the criteria to match against clinical data is indicated: at any time, with a clinical record, and with a clinical record clear off topic clinical events. > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > #retrieve prescription frequencies per binned range > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > #the number of patids returned by matchdrugwithdisease is equal to the number > #of patients with a drug - disease match per year > datadf <- data.frame(year=c(" "," "," "," "," "), + count=c(length(amitresult ),length(amitresult ), + length(amitresult ),length(amitresult ), + length(amitresult ))) > ggprescriptionyear <- ggplot(datadf, aes(x=year, y=count)) + + geom_bar(stat = "identity") + theme_bw() getpatientswithfirstdrugwithdisease unlike matchdrugwithdisease which retrieves patients with a prescription event matching clinical criteria at any time within a cprd ehr record, getpatientswithfirstdrugwithdisease identifies patients with a first prescription event that matches a desired clinical event. please note, care must be taken when searching for medication with off-label uses. for example, beta-blockers are frequently prescribed to treat hypertension and arrhythmia, however, the beta-blocker propranolol is also prescribed to treat migraine. without in depth analysis into the patient history, patients propranolol with records for hypertension or arrhythmia in addi- tion to migraine on a matching eventdate with the first propranolol prescription, could result .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records year c o u n t figure : the number of patients prescribed amitriptyline from the start of the year to the end of , stratified in year intervals. in a misleading disease-drug association. in cases where a health care professional suggests a change in the patient’s lifestyle choices, that patient may have several clinical events free from prescriptions before the first prescription of interest is prescribed. using basic subsetting one can calculate the number of clinical events before the patient’s first prescription intervention (figure a). further more, we can stratify patients into subgroups (figure b): > library(rdrugtrajectory) > library(ggplot ) > #a vector of prescriptions of interest. > druglist <- unique(testtherapydf$prodcode) > sampledrugs <- druglist[ : ] > #a vector of clinical events to match prescriptions against. > medcodes <- unique(testclinicaldf$medcode) > samplemedcodes <- medcodes[ : ] > #returns the subset of the first prescription event prescribed on the same > #eventdate as those clinical events of interest .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > firstdf <- getpatientswithfirstdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodesvector = samplemedcodes, + drugcodesvector = sampledrugs) > #ensure the only clinical data are for those with an assume first-drug-disease > firstclinicaldf <- subset(testclinicaldf, + testclinicaldf$patid %in% getuniquepatidlist(firstdf)) > #only keep the diseases of interest > firstclinicaldf <- subset(firstclinicaldf, + firstclinicaldf$medcode %in% samplemedcodes) > #only keep the prescriptions of interest > firstdf <- subset(firstdf, firstdf$prodcode %in% sampledrugs) > idlist <- getuniquepatidlist(firstclinicaldf) > beforeresultdf <- data.frame(patid=unlist(idlist), freq= ) > for(id in idlist) { + #retrieve the clinical/therapy data for each patients, one by one. + indclinicaldf <- subset(firstclinicaldf, firstclinicaldf$patid == id) + indtherapydf <- subset(firstdf, firstdf$patid == id) + #get the first event date on record; this will match a clinical date. + firsteventdate <- indtherapydf$eventdate[ ] + clinicalbeforetherapydf <- subset(indclinicaldf, + indclinicaldf$eventdate < firsteventdate) + #number of clinical complaints before first prescription. + ncomplaints <- nrow(clinicalbeforetherapydf) + beforeresultdf[beforeresultdf$patid==id,]$freq <- ncomplaints + } > ggbefore <- ggplot(beforeresultdf, aes(x=freq)) + + geom_histogram(binwidth= , color="black", fill="white") + + ylab("patients") + xlab("clinical events before prescription") + + theme_bw() > #note: not every patient will have a clinical imd score. > imdidsdf <- getimdofpatients(idlist = idlist, + imddf = imddf) > #only work with those with an imd score. > imdresultsdf <- subset(beforeresultdf, + beforeresultdf$patid %in% getuniquepatidlist(imdidsdf)) > imdresultsdf <- imdresultsdf[order(imdresultsdf$patid),] > imdidsdf <- imdidsdf[order(imdidsdf$patid),] > imdresultsdf <- cbind(imdresultsdf, imd_score=as.factor(imdidsdf$score)) > ggbeforeimd <- ggplot(imdresultsdf, + aes(x=freq, fill=imd_score)) + + geom_histogram(binwidth= ) + theme_bw() + + ylab("patients") + xlab("clinical events before prescription") getmultiprescriptionsamedaypatients the function getmultiprescriptionsamedaypatients returns all prescription events for .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records clinical events before prescription p a tie n ts a clinical events before prescription p a tie n ts imd_score b figure : the number of clinical events before the first treatment across the whole cohort (a), and by imd score (b). those patients prescribed more than two prescriptions on the same date. all events of those pa- tients without a prescription prodcode event can be removed. combining getmultipleprescriptionsamedaypatients with getpatientswithfirstdrugwithdisease or matchdrugwithdisease is useful for filter- ing patients for specific prescription patterns. for example, to retrieve all patient prescription records if specific prescriptions are (a) never recorded together on the same date and (b) are used as a first line treatment for a given complaint: > library(rdrugtrajectory) > prodcodesvector = unique(testtherapydf$prodcode)[ : ] > #ensure only patients with specific prescriptions are returned providing a > #patient is prescribed those drugs on different dates, never on the same date. > uniquetherapydf <- getmultiprescriptionsamedaypatients(df = testtherapydf, + prodcodesvector = prodcodesvector, + removepatientswithoutdrugs = true) > #ensure that the patients (patid) in the therapy and clinical dataframes > #are the same. subsetting might not be enough. > reducedclinicaldf <- subset(testclinicaldf, .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software + testclinicaldf$patid %in% getuniquepatidlist(uniquetherapydf)) > #specific medcodes have not been provided. all medcodes in the clinical > #dataframe are considered. this is possible if one either one is not interested > #in the nature of the clinical complaint or the clinical dataframe has been > #adjusted to only include clinical complaints of interest. > firstdf <- getpatientswithfirstdrugwithdisease(clinicaldf = reducedclinicaldf, + therapydf = uniquetherapydf, + drugcodesvector = sampledrugs) in the above example, patients with more than one prescription on the same date or without a prescription at all (from the set of desired prescription prodcodes) were removed from the cohort. this reduced the number of patients from patients to . next, only those patients with a first line treatment (first prescription event on the same date as a clinical event) were kept, reducing the sample size to patients. removepatientsbyduration longitudinal ehr cohort studies often requires careful time-related consideration. currently, rdrugtrajectory presents two functions that identify prescription records of patients that match two time constraints. the first, removepatientsbyduration, removes all patients with prescription events that are no more than n years between consecutive events or removes patients if the duration between the first and last prescription event on record is less than n years. > library(rdrugtrajectory) > df <- removepatientsbyduration(minobsyr = , + minbreakyr = , + therapydf = testtherapydf) getburninpatients the second time-related function, getburninpatients identifies all patient prescription records with at least n days free from prescription events before a specific prescription event. this is useful if one requires a period of time free from prescription intervention before a given prescription event: > library(rdrugtrajectory) > drugofinterestvector <- c( , , , , , ) > patientlist <- getburninpatients(df = testtherapydf, + startcodesvector = drugofinterestvector, + perioddaysbefore = ) > burnintherapydf <- subset(testtherapydf, + testtherapydf$patid %in% patientlist) in the above example, from a cohort of patients, patients had a period of up to days free from of prescription events before the first prescription prodcode specified via the startcodesvector argument. the functionality relies on the patient having prescription events before the burn-in period (required to define whether the patient had a cprd record early .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records enough before the burn-in period began). for example, this patient had over three years of prescription events before the prescription of interest (from - - to - - with over days free from exposure before the prescription event of interest prodcode : > head(burnintherapydf[burnintherapydf$patid == ,], n= ) [ ] patid eventdate prodcode consid issueseq < rows> (or -length row.names) . . first drug prescriptions getfirstdrugprescription a patient’s first prescription event on cprd record can be identified by supplying getfirstdrugprescription with a list of prescription prodcodes. the functions returns firstdrugobject, an r s ob- ject of type list. only the first prescription event to match anyone one of the prescription prodcodes provided is identified. the first element of firstdrugobject contains a named list of patid vectors. each vector contains the patids of all those patients that share the same first prescription prodcode. the list element is named after the corresponding prescription prodcode. the second element in firstdrugoject, like the first, is a list of date vectors, each named after the corresponding prescription prodcode. each date vector contains the eventdate of the prescription event for the patient identified by the patid in the identical position of the preceding list. the third list element contains a table of prescription frequencies for each first prescription prodcode on record. the prodcode is accompanied by a product description providing a file of cprd prescription products has been provided. below we demonstrate how to retrieve information on first-line treatment: > library(rdrugtrajectory) > library(ggplot ) > #an adjusted data dictionary file. > filelocation <- "product.txt" > #without supplying a vector of product files all prodcodes in the therapy > #dataset are considered. > resultfdo <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = null, + descriptionfile = filelocation) > patidlist <- resultfdo[[ ]] > eventdatelist <- resultfdo[[ ]] > drugfrequencydf <- resultfdo[[ ]] > drugfrequencydf <- drugfrequencydf[order(drugfrequencydf$frequency, + decreasing = true), ] > ggfreq <- ggplot(data=drugfrequencydf, aes(x=description, y=frequency)) + + geom_bar(stat="identity") + theme_bw() + + theme(axis.text.x = element_text(angle= , hjust= )) + + xlab("drug product description") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > #the structure of the firstdrugobject. > str(resultfdo, strict.width="wrap", list.len = ) am itr ip ty lin e m g ta bl et s am itr ip ty lin e m g ta bl et s am itr ip ty lin e m g ta bl et s at en ol ol m g ta bl et s at en ol ol m g ta bl et s at en ol ol m g ta bl et s ca nd es ar ta n m g ta bl et s ca nd es ar ta n m g ta bl et s li sin op ril m g ta bl et s li sin op ril . m g ta bl et s li sin op ril m g ta bl et s pr op ra no lo l m g ta bl et s pr op ra no lo l m g ta bl et s pr op ra no lo l m g m od ifie d− re le as e ca ps ul es pr op ra no lo l m g ta bl et s to pi ra m at e m g ta bl et s ve nl af ax in e . m g ta bl et s ve nl af ax in e m g m od ifie d− re le as e ca ps ul es ve nl af ax in e m g m od ifie d− re le as e ta bl et s drug product description f re q u e n cy figure : the frequency of first line treatment prescription. getagegroupbyevents in the next example we explore stratifying first-line prescription events by patient character- istics, such as, age, gender, imd, and number of medcodes (for instance, by comorbidities) or prodcodes (for instance, to separate those patients by additional prescriptions), or by any additional clinical event retrieved using cprdlookups.r ?. rdrugtrajectory provides several utility functions to stratify patients (see reference manual for further information). the func- tion getagegroupbyevents calculates the number of first-line prescription events by patient age. by specifying a set of patids and eventdates from the firstdrugobject, we can calculate the number of first-line prescriptions by age-group for patients linked with a specified medical condition: > library(rdrugtrajectory) > filelocation <- "product.txt" .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records > resultfdo <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = null, + descriptionfile = filelocation) > patidlist <- resultfdo[[ ]] > eventdatelist <- resultfdo[[ ]] > names(agegenderdf) <- c("patid","age","gender") > #the age-groups: [ , ), [ , ), [ , ), ..., [ , +). > agegroupvector <- c( , , , , , , , , ) > #cprd database release year. > ageatyear <- " " > agegrouplist <- getagegroupbyevents(idlist = as.list(patidlist[ : ]), + eventdatelist = eventdatelist[ : ], + agedf = agegenderdf, + agegroupvector = agegroupvector, + ageatyear = ageatyear) > agegrouplist [[ ]] - - - - - - - - + [[ ]] - - - - - - - - + in the above example, the age of each patient (agedf) was provided using year-of-birth calcu- lated against the release year of the cprd gold database (explained above). by providing the database release year (in ageatyear) and the first prescription eventdate (in eventdatelist), the age of each patient is adjusted against the prescription eventdate year. finally, by using a list slice on idlist and eventdatelist, (individual prescriptions can be specified using their prodcode, for example, eventdatelist$‘ ‘), first prescription prescriptions frequencies by age-group are retrievable (figure ). > library(ggplot ) > agegroupdrugdf <- data.frame(age=names(agegrouplist[[ ]]), + count=unlist(agegrouplist[[ ]]), + drug="amitriptyline mg") > ggamitriptyline <- ggplot(agegroupdrugdf, aes(x=age, y=count)) + + geom_bar(stat="identity") + + theme_bw() + ggtitle("amitriptyline mg") + + theme(axis.text.x = element_text(angle= , hjust= )) + + xlab("age-group") + ylab("frequency") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software − − − − − − − − + age−group f re q u e n cy amitriptyline mg figure : the distribution of amitriptyline mg as a first-line treatment by age-group. . . prescription sequences mapdrugtrajectory identifying patient prescription trajectories in longitudinal ehrs remains our biggest motiva- tor behind the development of rdrugtrajectory. therefore, we developed mapdrugtrajectory to identify the chronological of patient prescription events. we restrict the calculation to only look for prescription prodcodes as supplied to groupinglist as a named list (named prodcode vectors). the required number of grouped-prescription events is defined by specifying the mindepth and the number of those changes to display is controlled by maxdepth maximum number. by keeping mindepth and maxdepth the same, only patients with a valid number of prescription changes are displayed (figure (a) and (c)). patient records with fewer than mindepth number of changes to prescription sequences are ignored (figure (b)). for further information please refer to the reference manual. in the code below, mapdrugtrajectory returns patients with at least first five grouped pre- scriptions. prodcodes that have not been grouped are ignored. duplication of prodcodes (those from the same group) do not count as a change in treatment: .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records figure : the distribution of grouped prodcodes across three patients. (a) five groups of valid prescription prodcodes, (b) only three groups, (c) five valid groups, in addition to prodcodes and which are ignored. > library(ggplot ) > library(ggalluvial) > structurelist <- list(amitriptyline = c( , , ), + propranolol = c( , , ), + topiramate = c( ), + venlafaxine = c( , , ), + lisinopril = c( , , ), + atenolol = c( , , ), + candesartan = c( ) + ) > resultlist <- mapdrugtrajectory(df = testtherapydf, + mindepth = , + maxdepth = , + groupinglist = structurelist, + removeundefinedcode = true) > df <- resultlist[[ ]] > ggswitch <- ggplot(df, + aes(y = freq, axis = firstdrug, axis = switch , + axis = switch , axis = switch , axis = switch )) + + geom_alluvium(aes(fill = firstdrug), width = / ) + + geom_stratum(width = / , fill = "black", color = "grey") + + geom_label(stat = "stratum", infer.label = true) + + scale_fill_brewer(type = "qual", palette = "set ") + + theme_bw() + theme(legend.position = "none") + + scale_x_discrete(limits = c("first drug", " st switch", " nd switch", + " rd switch"," th switch"), + expand = c(. , . )) + + ggtitle("migraine preventative switching among patients") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software venlafaxine propranolol lisinopril atenolol amitriptyline candesartan venlafaxine propranolol lisinopril atenolol amitriptyline topiramatecandesartan venlafaxine propranolol lisinopril atenolol amitriptyline topiramate candesartan venlafaxine propranolol lisinopril atenolol amitriptyline topiramatecandesartan venlafaxine propranolol lisinopril atenolol amitriptyline first drug st switch nd switch rd switch th switch f re q migraine preventative switching among patients figure : prescription pattern switching of seven different migraine preventatives. a patient required a a minimum of five changes in prescriptions (including the initial prescription) and, equally, the display was set to five changes in prescription. . . prescription timeline construction rdrugtrajectory contains several functions that transforms patient data into a format com- patible with mean cumulative function (mcf) semi-parametric estimates, prescription per- sistence, prescription incidence, and survival analysis. generatemcfonegroup prescription events are binned into weekly units to increase the statistical power at each time point. the user presents a group at a time, for example, all clinical events of male patients with a first-line prescription of amitriptyline for a migraine. the clinical data has already been refined using the steps for first-line prescription, as described above. the function generatemcfonegroup accepts a dataframe or events, the mcf start date (eventdates are adjusted so all patient records in the dataset begin at the same time), and the minimum number of events per patients (by default this is two events). the following example presents the calculation of first prescription events, the assignment of gender and the calculation of .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records mcf of prescription (therapy dataframe) burden of amitriptyline and propranolol: > library(rdrugtrajectory) > filelocation <- "product.txt" > resultlist <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = null, + descriptionfile = filelocation) > patidlist <- resultlist[[ ]] > eventdatelist <- resultlist[[ ]] > drugfrequencydf <- resultlist[[ ]] > drugfrequencydf <- drugfrequencydf[order(drugfrequencydf$frequency, + decreasing = true), ] > amitriptylinepatid <- patidlist$` ` > propranololpatid <- patidlist$` ` > malecode <- > malepatidsdf <- getgenderofpatients(idlist = getuniquepatidlist(testtherapydf), + genderdf = agegenderdf, + gendercodevector = malecode) > amitriptylinemalepatids <- subset(amitriptylinepatid, + amitriptylinepatid %in% malepatidsdf$patid) > propranololmalepatids <- subset(propranololpatid, + propranololpatid %in% malepatidsdf$patid) > amimaletherapydf <- subset(testtherapydf, + testtherapydf$patid %in% amitriptylinemalepatids) > propmaletherapydf <- subset(testtherapydf, + testtherapydf$patid %in% propranololmalepatids) > amimalemcfdf <- generatemcfonegroup(therapydf = amimaletherapydf, + startdatecharvector = " - - ", + minrecords = ) > propmalemcfdf <- generatemcfonegroup(therapydf = propmaletherapydf, + startdatecharvector = " - - ", + minrecords = ) > amimalemcfdf <- cbind(amimalemcfdf, drug = "amitriptyline") > propmalemcfdf <- cbind(propmalemcfdf, drug = "propranolol") > drugmcfdf <- rbind(amimalemcfdf, propmalemcfdf) > resultmcf <- reda::mcf(reda::recur(week, id, no.) ~ drug, data = drugmcfdf) > mcfplot <- reda::plot(resultmcf, conf.int=true) + + ggplot ::xlab("weeks") + ggplot ::theme_bw() + ggplot ::ggtitle("") getfirstdrugincidencerate prescription incidence be calculated with getfirstdrugincidencerate. the following code demonstrates how to use a firstdrugobject to calculate incidence rates for a set of prodcodes. the study observation starts from the enrollmentdate and ends at the studyenddate: > library(rdrugtrajectory) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software weeks m c f e st im a te s drug amitriptyline propranolol figure : mcf of drug prescriptions of patients with a first drug prescription for either amitriptyline or propranolol, stratified by gender. the dotted lines indicate a % confidence interval. > filelocation <- "product.txt" > druglist <- unique(testtherapydf$prodcode) > requiredprods <- druglist[ : ] > firstdrugobject <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = requiredprods, + descriptionfile = filelocation) > medhistorydf <- constructmedicalhistory(testclinicaldf, null, testtherapydf) > patidlist <- unlist(firstdrugobject$patidlist) > resultmatrix <- getfirstdrugincidencerate(firstdrugobject = firstdrugobject, + medhistorydf = medhistorydf, + enrollmentdate = as.date(" - - "), + studyenddate = as.date(" - - ")) > incidencedf <- as.data.frame(t(resultmatrix), stringsasfactors = true) the above example returns an incidence rate of . per person years over a cohort of .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records patients. for a detailed description please see detail for getfirstdrugincidencerate in the reference manual. getdrugpersistence prescription persistence is calculated as the fraction of patients with a prescription for a specific treatment n-days after the first prescription event. for example, if we wanted to calculate the fraction of patients with a prescription -days after their first prescription, with a -day buffer either side, one specifies a duration of -days and a preceding buffer of -days (therefore, capturing the range to , -days either side of one calender year): > library(rdrugtrajectory) > patientlist <- getdrugpersistence(therapydf = testtherapydf, + idlist = null, + prodcodelist = null, + duration = , + buffer = , + endofrecorddate = " - - ") of patient therapy records, patients had a prescription (+/- ) days after the first prescription event on record, resulting in a crude fraction of only . patients. getdrugpersistence only observes events recorded precisely duration days after the first prescription. the buffer can be used to identify patients who received a prescription shortly after the end of the duration, but more importantly, to ensure patients actively undergoing treatment (indicated by a prescription shortly before the desired duration days) are included. as the buffer is reduced, the fraction of prescription persistence is reduced until the algorithm attempts to only identify patients with a prescription exactly duration of days after the first prescription. future software updates will incorporate repeat prescription data to increase the accuracy of the calculation. . closing remarks and future work rdrugtrajectory is an r package which has the potential for exciting applications such as im- proving clinical decision-making, identifying possible new treatments and analysing outcomes from existing treatments. we have demonstrated several functions, some of which detail sorting and matching records whilst others demonstrate fundamental statistical analysis. we used fabricated clinical and prescription dataframes, along with the age, gender and index of multiple deprivation score of each patient and presented analyses of cohort-wide prescrip- tion patterns, first-line treatment distributions, how to stratify by patient characteristics, and some basic tools to assist longitudinal analysis of prescriptions. the descriptions presented in this publication are not substitutes for the material in the reference manual. we recommend the reader consults the r ? help command or reference manual before running a function. in particular, functions related to the construction of timelines for survival analysis (time dependent/independent cox regression, kaplan meier survival curves and mean cumulative function) or a matrix for drug incidence rate requires fine tuning of several parameters. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software . . . . buffer size (n days before ) f ra ct io n o f p re sc ri p tio n p e rs is te n ce figure : the fraction of prescription persistence adjusted by a buffer number of days before a calender year. as the buffer approaches the value of duration the fraction approaches . the latest release of rdrugtrajectory along with source code and reference manual is available for download from https://github.com/acnash/rdrugtrajectory. whilst active members of the scientific research community we will continue to add new features to rdrugtrajectory whilst making necessary improvements to existing features. acknowledgements oxford science innovation, nihr oxford biomedical research centre and nihr oxford health biomedical research centre (informatics and digital health theme, grant brc- - ). thanks to dr michelle hardy for assistance with the article. references bally m, dendukuri n, rich b, nadeau l, helin-salmivaara a, garbe e, brophy jm ( ). “risk of acute myocardial infarction with nsaids in real world use: bayesian meta- .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/acnash/rdrugtrajectory https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records analysis of individual patient data.” british medical journal, , j . doi: . / bmj.j . ghosh re, crellin e, beatty s, donegan k, myles p, williams r ( ). “how clinical practice research datalink data are used to support pharmacovigilance.” therapeutic advances in drug safety, , – . doi: . / . hepp z, dodick dw, varon sf, chia j, matthew n, gillard p, hansen rn, devine eb ( ). “persistence and switching patterns of oral migraine prophylactic medications among patients with chronic migraine: a retrospective claims analysis.” cephalalgia, ( ), – . doi: . / . oyinlola jo, campbell j, kousoulis aa ( ). “is real world evidence influencing practice? a systematic review of cprd research in nice guidance.” bmc health service research, ( ), – . doi: . /s - - - . affiliation: nuffield department of clinical neurosciences medical sciences division university of oxford oxford uk ox du e-mail: anthony.nash@ndcn.ox.ac.uk journal of statistical software http://www.jstatsoft.org/ published by the foundation for open access statistics http://www.foastat.org/ mmmmmm yyyy, volume vv, issue ii submitted: yyyy-mm-dd doi: . /jss.v .i accepted: yyyy-mm-dd .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dx.doi.org/ . /bmj.j http://dx.doi.org/ . /bmj.j http://dx.doi.org/ . / http://dx.doi.org/ . / http://dx.doi.org/ . /s - - - mailto:anthony.nash@ndcn.ox.ac.uk http://www.jstatsoft.org/ http://www.foastat.org/ http://dx.doi.org/ . /jss.v .i https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / partition quantitative assessment (pqa): a quantitative methodology to assess the embedded noise in clustered omics and systems biology data partition quantitative assessment (pqa): a quantitative methodology to assess the embedded noise in clustered omics and systems biology data camacho-hernández, diego a. , †, nieto-caballero, victor e. , †, león-burguete, josé e. , , and freyre-gonzález, julio a. ,* regulatory systems biology research group, laboratory of systems and synthetic biology and undergraduate program in genomic sciences, center for genomic sciences, universidad nacional autónoma de méxico (unam), morelos, mexico. † these authors contributed equally to this work. * corresponding author: jfreyre@ccg.unam.mx abstract: identifying groups that share common features among datasets through clustering analysis is a typical problem in many fields of science, particularly in post-omics and systems biology research. in respect of this, quantifying how a measure can cluster or organize intrinsic groups is important since currently there is no statistical evaluation of how ordered is, or how much noise is embedded in the resulting clustered vector. many of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical measures; but none measure has been developed to statistically quantify the noise in an arranged vector posterior a clustering algorithm, i.e., how much of the clustering is due to randomness. here, we present a quantitative methodology, based on autocorrelation, to assess this problem. keywords: omics data; hierarchical clustering; noise quantification. . introduction a common task in today’s research is the identification of specific markers, as predictors of a classification yielded in clustering analysis of the data. for instance, this approach is particularly useful after high-throughput experiments to compare gene expression or methylation profiles among different cell lines [ ]. this task is coming handful in the nascent field of single-cell sequencing, leading the important step of clustering cells to further classification or as a qualifying metric of the sequencing process [ ]. regarding the vastly used gene expression assays, the vector of profiles for each marker across different cell lines is recorded using hierarchical clustering algorithms. these algorithms yield a dendrogram and a heat map representing the vector of marker profiles, illustrating the arrangement of the clusters. to assess how well the clustering is segregating different cell lines, a class stating the desired partitioning of each cell line is provided a posteriori. then, a simple visual inspection of the vector of classes is used to estimate whether the clustering is providing a good partition. such partition vector is colored according to the classification that each item is associated with, and it is expected that similar items will be contiguous, so the formed groups are assessed qualitatively on the biological background of each item. this procedure should not be confused with “supervised clustering”, which provides a vector of classes starting the desired partitioning a priori. this is then used to guide the clustering algorithms by allowing the learning of the metric distances that optimizes the partitioning [ ]. additionally, it may get confused with the metric assessment of the clustering algorithms, especially with the external cluster evaluation. for this, various metrics have been developed to qualify the clustering algorithm itself, such as intrinsic and extrinsic measures. these metrics are used for clustering algorithm validation. the extrinsic validation compares the clustering to a goal to say whether it is a good clustering or not. the internal validation compares the elements within the cluster and their differences [ ]. pqa involves characteristics of both kinds of validation, through using both the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / crafted goal standard and the yielded signal itself (clustered vector). however, pqa gathers these elements not qualifying the clustering algorithm itself but to quantify the noise embedded in the cluster, this noise may be due to the intrinsic metric or marker used to order the data set. a possible caveat of the qualitative assessment discussed above is that humans tend to perceive meaningful patterns within random data leading to a cognitive bias known as apophenia [ ]. while interpreting the partitions obtained from unsupervised clustering analysis, researchers attempt to visually assess how close the classifications are to each other finding patterns that are not well supported by the data. such an effect is raised because the adjacency between items may give a notion of the dissimilarity distance in the dendrogram leaves. unfortunately, as much as we know, there is no method to quantitatively assess the quality of the groups of classifications from the clustering or, at least, there is no attempt to quantify whether certain configuration or order of the items may be due to randomness. this is a serious caveat, since the insertion of noise can lead to false conclusion or misleading results. furthermore, the purging of this noise can lead to a more efficient descriptions of markers and its phenomena, accelerating the advance in many fields. in statistics, serial correlation (sc) is a term used to describe the relationship between observations of the same variable over specific periods. it was originally used in engineering to determine how a signal, for instance, a radio wave, varies with itself over time. later, sc was adapted to econometrics to analyze economic data over time principally to predict stock prices and, in other fields, to model-independent random variables [ ]. we applied the sc to propose a manner to quantify how well is the grouping of a posterior classification just by retrieving the results of unsupervised clustering analysis. thus, we propose a novel relative score, pqa, to solve the subjectivity of the visual inspection and to statistically quantify how much noise is embedded in the results of clustering analysis. . methodology . . assigning numeric labels to classifications a vector denoting the putative similarities among the variables in a study is usually obtained after a clustering analysis. each variable is classified to generate a vector of profiles (vp). such a vector of classifications is usually translated into a colors vector, in which each color represents a classification. it is common to inspect this vector to find groups that make sense according to the analyzed data. to the method presented in this work, the vp may be as simple as a vector of strings or numbers that represent the input. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . the pipeline of the pqa methodology. whatever representation of the classifications may be, it is necessary to transform the classifications to a vector of numeric labels, in which a number represents a classification, to be able to calculate sc. to accomplish this, we assign the first numeric label (number ) to the first item in the vector, which usually lays at one of the vector’s extremes. then, if the classification o the next item is different from the previous one, the next number in the sequence is assigned, and so on. this way of labeling warrants that the changes in the sc values are due to the order of numbers, that is to say, the grouping of the classifications resulting from the clustering, and it is not an artifact of the labeling itself (figure ). . . pqa score because the order of the vp could be interpreted as the grouping of the classifications, we measure how well the same classifications are held together in the vp through an sc shifted one position. such sort of correlation is defined as the pearson-product-moment correlation between the vp discarding the first item, and the vp discarding the last (equation , xi (order vector i-th position), n (length of x), 𝜌𝑖 (resulting sc)). 𝜌𝑖 = ∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛 𝑗= 𝑛− ) ∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛− 𝑗= 𝑛− ) 𝑛− 𝑖= 𝑛 𝑖= √∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛 𝑗= 𝑛− ) 𝑛 𝑖= √∑ (𝑥𝑖 − ∑ 𝑥𝑖 𝑛− 𝑗= 𝑛− ) 𝑛− 𝑖= ( ) we then define the pqa as the sc of the vp after removing background noise, normalized for the sc of the percent grouping partitions (defined as the sorted vector in ascending order). this, the more similar vp is to its sorted vector, the higher the score is yielded (equation , 𝝆𝒙 (sc of the vp), 𝝆𝑹𝒂𝒏𝒅𝒙̅̅ ̅̅ ̅̅ ̅̅ ̅ (mean of the sc of one thousand randomizations), 𝝆𝑷𝒆𝒓𝒇𝒆𝒄𝒕𝒙 sc of the sorted vector in ascending order)). 𝑷𝑸𝑨𝒙 = 𝝆𝒙−𝝆𝑹𝒂𝒏𝒅𝒙̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅ 𝝆𝑷𝒆𝒓𝒇𝒆𝒄𝒕𝒙 ( ) . . background-noise correlation factor in the pqa score .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / to compute the background-noise correlation factor in the pqa score definition, we sample the indexes of the vp and the swapping the corresponding items. this background correction is aimed to remove inherent noise in the data, even though the score may still be subjected to noise from the chosen clustering algorithm or discrepancies in the posterior classification. . . statistical significance of the pqa score to quantify the statistical significance of the pqa score, we calculate a z-score (equation ), 𝒛𝒙 = 𝑷𝑸𝑨𝒙−𝑷𝑸𝑨𝑹𝒂𝒏𝒅̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅ 𝑺𝑫𝑷𝑸𝑨𝑹𝒂𝒏𝒅 ( ) where 𝑃𝑄𝐴𝑥 is the pqa score of the vp, 𝑃𝑄𝐴𝑅𝑎𝑛𝑑̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅̅ is the mean of pqa scores of one thousand randomizations of the vp. these randomizations have the purpose of generating a solid random background to compare it to the real signal. the number of randomizations does not depend on the size of the vp. it is worth to notice that there are two randomization processes, one is meant to generate the input population of random vectors to calculate the pqa score to further calculate a z- score and the other is representing the noise in equation . . . defining noise proportions to provide a quantification of the embedded noise in the vp, we calculate the z-scores from the distribution of pqa values of the randomized vectors. this shuffling is yielded by scrambling the vector. then this z-score is interpolated to retrieve the estimated noise in the vp cluster. . . effect of the length and number of partitions of the vector in the z-score distributions. since we want to compare the pqa with the noise, we randomized times the vp. we opted to describe the dynamic of the z-score given the different percentage of noise and the number of partitions. for this, we synthetically crafted vector of both ranging from to elements and number of classifications. the z-scores were retrieved from the crafted vectors using the formulas described above. . results and discussion . . effects of permuted numeric labels on the partition we wondered whether the correct assigning of numeric labels to alter the less possible the sc calculations, so we analyzed how the sc changes over the synthetic partitions with permuted labels. we began generating synthetic partitions in ascending and descending order, increasing both the number of classifications and the number of items, up to . it is important to highlight that the number of items belonging to each classification was kept constant. because trying all the possible permutations for each vector would be implausible, we created a subset of permutations of each vector, then we calculated the mean sc (figure , see methodology). we observed that the mean sc got high when the number of items in the vp was greater or equal to times the number of classifications, nevertheless, we got the highest sc when the numeric labels we assigned by sequential order, either ascending or descending (figure ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . z-scores of the pqa scores from partitions varying in the number of classifications and the length of the partition. . . length of partitions as a proxy of the number of classifications we wonder whether the number of classifications and the length of the vp may change the statistical significance of the pqa score because of the less the number of items in the vp, the greater the chance to group each item with any order. we then tested such effect by calculating a z-score from ordered synthetic partitions increasing both the number of classifications and the number of items up to . we also kept constant the number of classifications for the sake of this analysis. we noticed that only the length of the partition has a true effect on the z-score, but that is not the case for the number of classifications. we observed that every partition minor than could be considered as pure noise, however, we consider a z-score cutoff of greater than (p-value of . ). we also observed z-score values still greater than with a length of , , and , but lesser than with lengths between and (figure ). if we were more flexible, we could have laid out a length cutoff on those values without losing statistical significance, since a z-score of corresponds roughly to a p-value of . . the results of this analysis were expected by intuition because the probability of an item to occupy a position in the vp increases the number of items does the same. . . proof of concept: quantifying real noise after a literature revision, we noticed that some datasets were subject to visual inspection in their respective papers, so we applied our method to quantify the proportion of noise embedded in those datasets and to test whether they may lead to apophenia. we choose two datasets from literature because of two main reasons, first, the data should have a high number of items that are way above our z-score significance threshold (> ) and, second, we wanted contrasting orderings of the partitions so to have one dataset that looks very disordered and another that looks somewhat ordered to compare the noise proportions. lastly, we assessed the behavior of the metric in highly ordered data. this also matches our threshold mentioned above. . . . cancer methylation signatures the first dataset consists of methylation profiles of different cancerous and non-cancerous samples [ ] (figure ). though the classifications look very sparse and the groups are torn apart in many subgroups distributed along with the data’s vp. we detected . % of noise and a pqa score .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / of . (figure , with a z-score of . and a p-value of . x - ), both numbers imply that even though there may be disordered in the vp, there is not a very high noise proportion nor a high pqa score. these results suggest that, like any other statistical test, the longer the number of items in the partition the more diluted is the effect of disorder in the vp, and the results also lead to a greater statistical significance as shown in the analysis of the number of items and classifications. besides the authors concluded that their clustering analysis results made sense from their molecular and biological background, as well as the perspectives about the analyzed profiles, they only assessed grouping just by visual inspection and concluded the grouping was well done. however, understanding the noise in the cluster can help to pursue better markers since it could help to narrow the search space in these kinds of studies. (a) (b) figure . visual representation of clustered data used to assess the method. (a) dataset from jie shen et. al. (b) dataset from tooyoka et. al. . . . distribution of micrornas in cancer the second dataset consists of expression profiles of micrornas from three classes of samples: invasive breast cancer, those with ductal carcinoma in situ (dcis), and health (figure ) [ ]. the authors visually identified three clusters, though selecting the right cutting height threshold is difficult. besides, one of the clusters is a mix of classes in different proportions, leading the authors to arguably conclude that the dcis and control sample profiles are not different. on this matter, the pqa score and the proportion of noise are . and . %, respectively (figure , with z-score of . and a p-value of . x - ) providing a quantitative assay to support the grouping that the authors claimed. furthermore, in comparison with the methylation profiles discussed above, we can appreciate that a partition which appear even less fuzzy has even a higher noise ratio, supporting the idea of how visual inspection could lead to misleading results. (a) (b) figure . z-score distribution by percentage of randomized items. (a) dataset from jie shen et. al. (b) dataset from tooyoka et. al. the red dots represent the z-score interpolation of the corresponding data sets. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . . . comparison of genetic regulatory networks with theoretical models finally, to assess the pqa methodology using systems biology data we clustered networks according to their pairwise dissimilarity [ ]. first, curated biological networks were retrieved from abasy atlas (v . ) [ ]. for each biological network, we then constructed four networks each according to a theoretical model (barabasi-alberts, erdos-renyi, scale-free, and hierarchical- modular). we estimated the parameters of each theoretical model from the properties of the corresponding biological network. the models used reproduce one or more intrinsic characteristics of the biological networks, such as power-law distribution, hubs, and scale-free degrees, and hierarchical modular structure [ ]. visual inspection suggested that the classification yielded a highly ordered pv, distinguishing according to the nature of each network (figure ). the pqa score for this vp is . (p-value = . x - , z-score = . ) and the proportion of noise was . % (figure ). in contrast to the previous examples, here we obtained a highly ordered clustering and a very low proportion of noise, which suggests that although the models recapitulate some of the properties of genetic regulatory networks, each of them is not sufficient to capture their structural properties. figure . cluster analysis of distance among gene regulatory networks and theoretical network models. the abbreviations and colors used in the posterior classification are as follows: barabasi- alberts (ba, red), erdos-renyi (er, blue), scale-free (sf, green), hierarchical modularity (hm, purple), and biological networks (bi, orange). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . z-score distribution by percentage of randomized items of vp from genetic regulatory networks. the red dot represents the z-score interpolation of the actual data set. . conclusions in this work, we presented a novel method to quantify the proportion of noise embedded in the grouping of associated classes of the elements in hierarchical clustering. we proposed a relative score derived from an sc of the vp from the dendrogram of any clustering analysis and calculated z- statistics as well as an extrapolation to deliver an estimation of noise in the vp. we explain how the method is formulated and show the tests we made to systematically refine it. we additionally made a proof of concept by using clustering data from two works that we think perfectly represent overfitting by apophenia. additionally, we added an example from network biology where clustered networks are separated by intrinsic characteristics. although in this work we focused on examples where hierarchical clustering is performed, this framework can apply to any partition algorithm in which the elements are identified and a vector of the order can be acquired. we concluded that the clustered sets of biologic data have a high measure of noise, despite looking well grouped. we proved what a minimum number of classifications should be considered in this sort of clustering analysis to have a significant reduction of noise. on the other hand, we permuted the labels of the associated classes and concluded that the effect is negligible. we proved that randomness still plays an important role by biasing the results, though it may not be evident through visual inspection. the pqa could be used as a benchmark to test what clustering algorithm should be appropriate for the analyzed dataset by minimizing the noise proportion and to guide omics experimental designs. nevertheless, a word of caution, the pqa score alone can be subject to subjectivity if not used properly since it depended on the characteristics of the analyzed data. thus, the pqa score is thought to be considered as a quantification of noise in clustered data and should be used with discretion. author contributions: conceptualization, j.a.f.g.; methodology, j.a.f.g.; software, d.a.c.h., v.e.n.c., and j.a.f.g.; validation, d.a.c.h., v.e.n.c., and j.a.f.g.; formal analysis, d.a.c.h., v.e.n.c., and j.a.f.g.; investigation, d.a.c.h., v.e.n.c., j.r.l.b., and j.a.f.g.; resources, j.a.f.g.; data curation, d.a.c.h., v.e.n.c., and j.e.l.b.; writing—original draft preparation, d.a.c.h., v.e.n.c., j.e.l.b., and j.a.f.g.; writing—review and editing, d.a.c.h., v.e.n.c., and j.a.f.g.; visualization, d.a.c.h., v.e.n.c., j.e.l.b., and j.a.f.g.; supervision, j.a.f.g.; project administration, j.a.f.g.; funding acquisition, j.a.f.g. all authors have read and agreed to the published version of the manuscript. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / funding: this work was supported by the programa de apoyo a proyectos de investigación e innovación tecnológica (papiit-unam) [in to j.a.f.g.]. conflicts of interest: the authors declare no conflict of interest. references . kang, s., kim, b., park, s.-b., et al. . stage-specific methylome screen identifies that nefl is downregulated by promoter hypermethylation in breast cancer. international journal of oncology ( ), pp. – , doi: . /ijo. . . . kiselev, v. y., andrews, t. s., & hemberg, m. ( ). challenges in unsupervised clustering of single-cell rna-seq data. nature reviews genetics, ( ), - , doi: . /s - - - . . al-harbi, s.h. and rayward-smith, v.j. . adapting k-means for supervised clustering. applied intelligence ( ), pp. – , doi: . /s - - - . . hassani, m., & seidl, t. ( ). using internal evaluation measures to validate the quality of diverse stream clustering algorithms. vietnam journal of computer science, ( ), - , doi: . /s - - - . . fyfe, s., williams, c., mason, o.j. and pickup, g.j. . apophenia, theory of mind and schizotypy: perceiving meaning and intentionality in randomness. cortex ( ), pp. – , doi: . /j.cortex. . . . . getmansky, m., lo, a.w. and makarov, i. . an econometric model of serial correlation and illiquidity in hedge fund returns. journal of financial economics ( ), pp. – , doi: . /j.jfineco. . . . . shen, j., hu, q., schrauder, m., et al. . circulating mir- b and mir- a as biomarkers for breast cancer detection. oncotarget ( ), pp. – , doi: . /oncotarget. . . toyooka, s., toyooka, k. o., maruyama, r., virmani, a. k., girard, l., miyajima, k., ... & brambilla, e. ( ). dna methylation profiles of lung tumors. molecular cancer therapeutics, ( ), - . . schieber, t. a., carpi, l., díaz-guilera, a., pardalos, p. m., masoller, c., & ravetti, m. g. ( ). quantification of network structural dissimilarities. nature communications, ( ), - . . escorcia-rodríguez, j. m., tauch, a., & freyre-gonzález, j. a. ( ). abasy atlas v . : the most comprehensive and up-to-date inventory of meta-curated, historical, bacterial regulatory networks, their completeness and system-level characterization. computational and structural biotechnology journal, doi: . /j.csbj. . . . . barabasi, a. l., & oltvai, z. n. ( ). network biology: understanding the cell's functional organization. nature reviews genetics, ( ), - , doi: . /nrg . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / linus: conveniently explore, share, and present large-scale biological trajectory data from a web browser linus: conveniently explore, share, and present large-scale biological trajectory data from a web browser. authors: johannes waschke , , mario hlawitschka , kerim anlas , vikas trivedi , , ingo roeder , , jan huisken , and nico scherf , * max planck institute for human cognitive and brain sciences, stephanstr. a, leipzig, germany faculty of computer science and media, leipzig university of applied sciences, leipzig, germany embl barcelona, c/ dr. aiguader , barcelona, spain. embl heidelberg, developmental biology unit, heidelberg, germany. national center of tumor diseases (nct), partner site dresden, dresden, germany institute for medical informatics and biometry, carl gustav carus faculty of medicine, school of medicine, tu dresden, dresden, germany morgridge institute for research, madison, wisconsin , usa * correspondence: to nscherf@cbs.mpg.de abstract in biology, we are often confronted with information-rich, large-scale trajectory data, but exploring and communicating patterns in such data is often a cumbersome task. ideally, the data should be wrapped with an interactive visualisation in one concise package that makes it straightforward to create and test hypotheses collaboratively. to address these challenges, we have developed a tool, linus, which makes the process of exploring and sharing d trajectories as easy as browsing a website. we provide a python script that reads trajectory data and enriches them with additional features, such as edge bundling or custom axes and generates an interactive web-based visualisation that can be shared offline and online. the goal of linus is to facilitate the collaborative discovery of patterns in complex trajectory data. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:nico.scherf@tu-dresden.de https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction in biology, we often face large-scale trajectory data from dense spatial pathways, such as the brain connectivity obtained from diffusion mri imaging (liu et al., ), or tracking data such as cell trajectories or animal trails (romero-ferrero et al., ) . although this type of data is becoming increasingly prominent in biomedical research (kwok, ; mcdole et al., ; wallingford, ), exploring, sharing, and communicating patterns in such data are often cumbersome tasks requiring a set of different software that are often complex to install, learn and use. recently, new tools have become available for efficiently visualising d volumetric data (pietzsch et al., ; royer et al., ; schmid et al., ), and some of those allow the user to overlay tracking data to cross-check the quality of the results or to visualise simple predefined features (such as speed or time). however, given the more general-purpose design of such software, these are not ideal solutions to efficiently and collaboratively explore and share the visualisations. an interactive, scriptable, and easily shareable visualisation (shneiderman ) would open up novel ways of communicating and discussing experimental results and findings (callaway ). the analysis of complex and large-scale trajectory data and the creation and testing of hypotheses could then be done collaboratively. importantly, since such bioinformatics tools would be right at the interface of computational and life sciences, they need to be accessible and usable for scientists with little or no background in programming. ideally, the data should be bundled with a guided, interactive presentation in one concise visualisation packet that can be passed to a collaborator. to address these challenges, we have developed our visualisation tool linus, making it easier to explore d trajectory data from any device without a local installation of specialised software. linus creates interactive visualisation packets that can be explored in a web browser, while keeping data presentation straightforward and shareable, both offline and online (fig a). we began to develop this tool when we struggled to find adequate software to explore cell trajectories during zebrafish gastrulation from large-scale fluorescence microscopy datasets (shah et al., ) . linus allowed us now to interactively visualise and analyse the tracks of around . cells (starting number) as they moved across the zebrafish embryo throughout hrs. more importantly, it enabled us to share and discuss visualisations with collaborators across disciplines. results and discussion linus is a python-based tool that is easy to install and use for scientists at the interface between disciplines. our overall goal when developing linus was to create a versatile and lightweight visualisation tool that runs on a wide range of devices. to this end, we based the visualisation part on web technologies. specifically, we used typescript, which compiles to javascript and webgl. however, a core component of the visualisation process, the data preparation, requires local file access and fast computations, both of which are limited in javascript. for that reason, we also created a python (> v . ) script that handles the computationally demanding parts of data processing and automatically generates the web-based visualisation packages. creating a visualisation package with linus is done in a few simple steps (fig. a): the user imports trajectory data from a generic, plain csv format (see methods) or from a variety of established trajectory formats such as svf (mcdole et al., ), tgmm xml (amat et al., ) , or the community standard biotracks (gonzalez-beltran et al., ), which itself supports import from a wide variety of cell tracking tools such as cellprofiler (mcquin et al., ) or trackmate (tinevez et al., ) . during the data conversion, linus can enrich the trajectory data with additional attributes or spatial context. for example, we declutter dense trajectories by highlighting the major “highways” through edge-bundling (fig. b). linus can automatically add generic attributes that are useful in a range of applications, such as the local angle of the trajectories or a timestamp. the user can simply add custom numerical attributes for specific applications by providing these measurements as extra columns in csv files (see methods). the data attributes form the basis for advanced rendering effects. if users want to give a spatial context, linus can generate axes automatically, or users can define custom axes. for more efficient computing, the preprocessing script uses established and optimised packages from python’s rich ecosystem, like numpy and (py)opencl. in particular, the edge bundling algorithm runs highly parallel on the graphics card and thus, about - times faster than a cpu-based calculation (with opencl-enabled hardware). however, only the creator of a linus-based visualisation package needs to run this preprocessor script. the target audience requires only a web browser to view and explore the data. the result of the preprocessing is a ready-to-use visualisation package that can be opened in a web browser on any device with webgl support. the package is a folder containing html, javascript, and related files. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/at m https://paperpile.com/c/m zzqk/rjuf https://paperpile.com/c/m zzqk/rjuf https://paperpile.com/c/m zzqk/ srq+xjeg+ sfn https://paperpile.com/c/m zzqk/ srq+xjeg+ sfn https://paperpile.com/c/m zzqk/n pa+ ifm+njch https://paperpile.com/c/m zzqk/ k https://paperpile.com/c/m zzqk/nhfw https://paperpile.com/c/m zzqk/ ld https://paperpile.com/c/m zzqk/xjeg https://paperpile.com/c/m zzqk/xjeg https://paperpile.com/c/m zzqk/qisc https://paperpile.com/c/m zzqk/meu https://paperpile.com/c/m zzqk/q l https://paperpile.com/c/m zzqk/c cd https://paperpile.com/c/m zzqk/c cd https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / interactive visualisation with configurable filters allows in-depth data exploration for a variety of applications across sciences. after configuring and creating the visualisation package with the python toolkit, further adjustments are possible within the web browser. opening the index.html file starts the visualisation and shows the trajectories with baseline render settings (semi-transparent, single-coloured rendering on a grey background). the browser renders an interactive visualisation of the trajectories and an interface for the user to update and adapt the visualisation to their needs (e.g. colour scales, projections, clipping planes) (fig. b). the user interface itself is adapted to each dataset: the preprocessing script generates a separate property and the corresponding slider (filters and colour mapping) for each given data attribute in the user interface. if more than one state is available for the dataset (e.g. an edge bundled copy of the data, or custom projections), the interface automatically offers the functionality to fade between the states (see methods). the user can carve out patterns from the original “hairball” of lines by setting general visualisation parameters like shading and colour maps (fig. a). to focus on particular parts of the dataset, the user filters the data for the various attributes such as specific time intervals or user-specified numerical properties such as marker expression in cell tracking (fig. b). alternatively, the user can select spatial regions of interest (rois) either with cutting planes or with progressively refinable selections (fig. c). the visual attributes can then be separately defined for the selected in-focus areas and the (non-selected) context regions (fig. c) to create a focused visualization. apart from the purpose of qualitative visualization, the selected trajectories can also be downloaded as csv files for subsequent quantitative analysis (see methods). one important problem with large-scale trajectory data is the sheer density of tracks that often leads to extreme visual clutter. to tackle this problem, one prominent feature of linus is the ability to blend between different data transformations seamlessly. we provide two main sorts of transformations out-of-the-box: the user can smoothly transition between original and bundled state to focus on major “highways” (fig. d, fig. b), or between original ( d cartesian) view and different d projections (e.g. a mercator map) to provide a global, less cluttered perspective on the trajectories (fig. e,f). if other, application-specific transformations are needed, such as a spatial transformation or any form of trajectory clustering, the user can provide such an alternative state during preprocessing and then interactively blend between those states. however, the choice of a web-based visualisation solution brings some drawbacks. the amount of data that can be fluently visualised depends on the underlying hardware (smartphones: > , trajectories, notebooks, and desktop computers: > , trajectories). another limitation is the reduced feature set which common web browsers offer regarding graphics card access: compared to the api of opengl, the browser-based webgl api offers fewer shader features. these restrictions lead to some limitations for the rendering process. a drawback of our rendering approach is that it creates artifacts related to the rendering order when we rotate the camera. thus, we have to order the line fragments offline (i.e. not on the graphics card, but in javascript), which is a time-consuming process. to maintain high framerates, we only sort line fragments within a second after a user interaction has finished, leading to artifacts during camera motions (see methods). furthermore, we cannot provide correct render order when rendering two datasets in the same view, and thus linus works best when only rendering one dataset at once. data and visualisations are easily shareable with collaborators via interactive visualisation packets. as a straightforward solution to share the results, the user directly exports the visualisations from the webview as static images and videos (e.g. such as supplementary video ). but sharing the visualisation of the data can go a step beyond image or video data. the user can conveniently record all these visualisation properties directly in the web-interface of linus to create information-rich, interactive tours. the user adjusts these tours on a detailed level using a timeline-based editor (supplementary fig. ). an icon represents each action that can be moved along the time axis to develop a visual storyline. smooth transitions and textual markers that can be precisely timed, facilitate understanding and storytelling. to communicate and distribute new findings, these tours can easily be shared online or offline with the community (colleagues, readers of a manuscript, audience of a real or virtual presentation). the tours are copied into the source code of the visualisation package or, if they consist of a limited number of actions (see methods for details), they can be shared by a dynamically created url or a qr code. fig. shows examples of visualisations that have been created with linus ranging from dynamic trajectories in d (fig. a) or on surfaces (fig. b) to static (fig. c) or dynamic d (fig. d) tracks across applications from ethology, neuroscience, and developmental biology. an interactive version of each example can be found online by simply scanning the respective qr codes in the figure. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we tested linus visualisation packages across various devices and found that performance is the most important aspect of the user experience that varies between different devices. desktop computers with mid-range graphics cards (e.g. the graphics processors that are built-in with current cpus) can easily handle more than , trajectories at smooth framerates. mid-range smartphones handle the same data with low framerates (ca. fps), which is still usable but does not feel as smooth. for virtual reality applications, we also tested linus on the oculus go vr goggles. here, a high frame rate is essential as the user experience would be quite discomforting otherwise and we recommend reducing the number of trajectories further to about , in this use-case. due to the differences in performance and user experience, we recommend creating dedicated visualisation packages (or tours) for the intended type of output device. in the future, we would like to support further advanced preprocessing options such as trajectory clustering, more generic transforms or feature extraction. we also would like to extend the visualization part of linus, so the user can interactively annotate the data. here, we envision that the user can easily label subsets of trajectories and then use this information for downstream analysis (such as building a trajectory classifier). our experience with linus shows that sharing relatively complex data visualisations in this interactive way makes it much more efficient to collaboratively find patterns in data and to create and discuss figures or videos for presentations and manuscripts. more generally, interactive data sharing is helpful when collaborations, presentations, or teaching occur remotely, as it has been a common situation during the current pandemic. at the same time, during an in-person event such as a talk or poster session at a conference, the target audience can explore the data instantly on their computers, tablets, or smartphones. in any case, touch screens or even virtual reality goggles increase the immersion with more natural controls and true d-rendering, helping to grasp the trajectories’ spatial relation. with these features, we are convinced that approaches like linus will improve considerably how we collectively explore, communicate, and teach the spatio-temporal patterns from information-rich, multi-dimensional, experimental data. methods our software consists conceptually of two parts: a python-based preprocessing and a web-based visualisation tool. we aimed to move all static and computationally expensive adjustments to the preprocessor, whereas dynamic adjustments to tweak the visualisations are all be performed directly in the web browser later. after running the preprocessor, a folder containing html, css, and javascript files is created (called a visualization packet). these files are opened directly or uploaded to a web server. types of input data we currently support different trajectory file types directly: tgmm (amat et al., ) , biotracks (gonzalez-beltran et al., ), svf (mcdole et al., ), and custom csv. most formats are designed to store d coordinates plus a timestamp primarily, but no other custom data. however, linus supports additional numerical attributes that can then be used to filter or colour the trajectories accordingly. we, therefore, offer a generic csv format which can be supplemented with custom numerical data: each csv file contains the data for a single trajectory, the first three columns represent the coordinates (x, y, z) and any further column is interpreted as another attribute. the columns are delimited by semicolons, and the number of columns must be identical for all csv files. linus reads the first line of a csv file by default as the header and uses this information to automatically name the respective properties in the user interface. the data converter script then expects a folder that exclusively contains csv files as input. implementation of data preprocessing the trajectory data are then converted to a custom json format by our python-based preprocessor. python has the advantage of being executable on a wide range of operating systems and hardware. the preprocessor is used with a command-line interface or by calling the respective commands directly. the command-line interface is easier to use, and it covers the most common cases (e.g. visualising a dataset with custom attributes, and automatically adding an edge-bundled version). for more complex cases, e.g. visualising two datasets at once, or using multiple custom states of the data (e.g. custom projections), users can write their own python script. we provide detailed and up-to-date documentation in our repository at https://gitlab.com/imb-dev/linus. time-consuming operations are implemented using numpy, and the most demanding process (edge bundling) is handled by an opencl script, which increases calculation speed by - fold. all trajectories are resampled to equal length during the preprocessing step, enabling us to use numpy’s fast matrix-based algorithms (we use -matrices,n * m storing trajectories with points in each trajectory). the resulting json file then contains a list of datasets. eachn m dataset holds a set of trajectories that optionally can be further organised into several states, for example, the original data and a projected version. at this point, all data are organized in the same structure as it is required by webgl (supplementary fig. ), which allows faster loading of the data in the next step. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/qisc https://paperpile.com/c/m zzqk/meu https://paperpile.com/c/m zzqk/meu https://paperpile.com/c/m zzqk/xjeg https://gitlab.com/imb-dev/linus https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / implementation of the web-based tool the visualisation part runs in web languages (html, javascript, css, webgl). the json file containing the preprocessed data is directly loaded as an object by javascript. this part of the software copies the numeric arrays from the json file into webgl's data buffers like the position buffer, index buffers, and attribute buffers. if a dataset contains more than one state (e.g. an original state and a projected state), these states are stored in additional attribute buffers. depending on the provided data, we also adjust the shader source code dynamically. for example, we inject variables and specific statements into the shader source code before it is compiled by webgl. with the dynamic creation of buffers as well as code statements and variables, we pre-build a shader program that is directly tailored to the properties of the respective data. as a result, rendering the data allows quick changes of the visualisation (e.g. color mapping or projections) without the need for updating the datasets on the graphics card, which results in higher frame rates and smooth transitions compared to approaches where data is transformed offline. in principle, linus supports an arbitrary number of attributes and states. however, practically this number is limited by the particular device’s abilities (i.e. its graphics card) and webgl in general. typically, we have eight attribute arrays on smartphones and sixteen or more on desktop computers. our software requires four such attribute arrays for internal purposes, plus one more array for each state or attribute. thus, for a dataset containing original data, bundled data and two custom attributes (that are shared between the states) we would need eight attribute buffers in total, which can still be managed by a smartphone. visualising adding additional states or attributes requires devices with more capabilities, like a desktop computer. the graphical user interface (gui) the user interface (see fig. and supplementary fig. ) consists of a general part that includes options to change the size of the gui, the background colour, and camera controls. furthermore, the user can choose how often the render order should be restored (see section "current technical limitations"). additionally, several data-specific settings are shown, and this section is further divided into: ● filters for each attribute to only show data within a defined range; if window is a positive value, it will be used to automatically display a range [min, min+window] (while max is ignored). ● render settings, including colour mapping, shading, transparency, which can be independently set for selected and unselected trajectories. ● mercator projection plus rotations that are applied to the d positions before the d transformation, and mapping the "free" z component to attributes for d + feature plots (e.g. space-time trajectories). ● cutting planes can be used to generate a generic d projection. here, the projection plane can be defined by selecting a centre point and a normal direction. everything above the projection plane is then mapped onto the plane. ● the last part of the gui offers options to export selected trajectories and also shows a list of available tours. this list is used to start or to load a tour into the tour editor. sharing visualisations and tours as explained above, the user receives a self-contained package. this package can be opened with any web browser that supports webgl and can be distributed in multiple ways: it can be locally shared (e.g. sent by email or copied using, e.g. a usb stick) or made easily accessible to a broad audience by uploading it to a web server (as done e.g. on our companion website for this manuscript https://imb-dev.gitlab.io/linus-manuscript/). the method of sharing the actual visualisation package also influences how an interactive tour can be distributed. in order to make a tour reproducible, they are internally represented by a textual list of actions. this script can be copied directly into the source code of the file main.html of the visualisation package. this method works both for server-based and for file-based distribution of the package. if the visualisation package is hosted on a web server, the tours can also be shared simply with a custom url and qr code that encodes a tour’s actions. however, the length of such tours is restricted: qr codes are limited in the amount of information they can store, and urls are usually limited as well (but typically this limit can be configured in the web server's settings). the commands for camera motion and parameter adjustment (e.g. changing the colour) are concise and only require a few bytes of the url or qr code. in contrast, textual annotations and especially spatial selections require considerably more space. thus, sharing a tour by qr codes or urls usually works for tours without selections and without extensive text annotations. specific considerations for virtual reality devices the virtual reality mode works only when the visualisation package is hosted on a web server. further, the way of navigation changes slightly because the head position takes over the task of the camera. for convenience, we introduce .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the possibility to adjust the height of the dataset and to rotate the data horizontally. inside the vr environment, no gui is rendered. to allow controlling the gui, the user can switch between " d mode" and "vr mode" instantly. export of trajectories the user can select trajectories and download this selection. the download may take several minutes as the data must internally be converted into csv format. the result is a zip folder containing one folder for each data set (usually a single folder), each containing a separate folder for each state of the data (e.g. "original" and "bundled"). each trajectory is saved as a separate csv file. it should be noted, however, that the user can only download the resampled trajectories and not trajectories in the raw (temporal or spatial) resolution before the data preprocessing. screenshots and videos at any time, the user can take screenshots and record videos with the respective buttons in the bottom left corner. video recording requires an up-to-date chrome-based browser (chrome version or later; other browsers might support it as well but only with enabled experimental features). the output format is webm, which is currently the only file type that can be directly saved from webgl. additional technical limitations in order to offer the tool for a broader range of platforms, we decided to utilise webgl . . this web standard provides the feature set of opengl es . (https://www.khronos.org/webgl/), which is limited compared to regular opengl versions. webgl . is implemented by a wide range of browsers, such as chrome version , firefox . , safari . , ios , chrome mobile (or newer, respectively). when rendering a scene containing both trajectories and context, our application must render two different types of geometric primitives (lines and triangles) simultaneously. this can only be performed by two consecutive draw calls: the program first renders all triangles, and then we subsequently render the line segments. since we need to support transparent rendering, we cannot rely on the z-buffer for determining the spatial order of the segments as this works only for non-transparent geometries (the z-buffer usually tells us if a segment should be drawn or not by checking if already another closer segment has been drawn that would cover the new segment). thus, we use an alternative to the z-buffer: we sort the geometry first and render it starting with the most distant element. step by step, we draw elements that are closer to the observer over more distant ones ensuring the correct depth ordering of elements. however, we cannot use this idea to compute the overlap between the set of triangles and the set of line segments since they are different types of primitives and as such, require separate draw calls. as webgl currently does not have a geometry shader, we cannot mix triangles and lines in one draw call. a consequence is that context can only be rendered as a background silhouette. our internal resorting procedure can require a noticeable amount of time (e.g. around . s for . trajectories). to ensure a fluent user experience, we use an adaptive strategy and only sort the data when the user stops moving the camera. this can lead to some visual artifacts during the rotation of the camera, but after stopping the motion, the correct rendering order is established quickly. for huge amounts of data, or for devices with low cpu performance (the sorting happens on the cpu, not on the gpu), it is also possible to completely disable the sorting. in that case, we shuffle the rendering order, which at least avoids distracting global patterns introduced by these artifacts. data availability exemplary visualizations are available by scanning the qr codes in fig. directly or by visiting https://imb-dev.gitlab.io/linus-manuscript/ code availability the linus software including source code and documentation is freely available at our repository at https://gitlab.com/imb-dev/linus. acknowledgments the authors are grateful to gopi shah and konstantin thierbach for sharing data and contributing useful feedback. j.w. received funding from the international max planck research school on neuroscience of communication: function, structure, and plasticity (leipzig, germany; https://imprs-neurocom.mpg.de ). k.a. and v.t. acknowledge funding from european molecular biology laboratory (embl) barcelona and mesoscopic imaging facility, embl barcelona for help with imaging. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://imb-dev.gitlab.io/linus-manuscript/ https://imb-dev.gitlab.io/linus-manuscript/ https://imprs-neurocom.mpg.de/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / author contributions n.s., j.h., and i.r. conceived the project. j.w. wrote the software code. m.h. and n.s. supervised the project. n.s. and j.w. wrote the manuscript. k.a. and v.t. generated the dataset on zebrafish blastoderm explants. all authors read, edited, and approved the manuscript. references amat f, lemon w, mossing dp, mcdole k, wan y, branson k, myers ew, keller pj. . fast, accurate reconstruction of cell lineages from large-scale fluorescence microscopy data. nat methods : – . bailey h, mate br, palacios dm, irvine l. . behavioural estimation of blue whale movements in the northeast pacific from state-space model analysis of satellite tracks. endanger species res. callaway e. . the visualizations transforming biology. nature news : . egevang c, stenhouse ij, phillips ra, petersen a, fox jw, silk jrd. . tracking of arctic terns sterna paradisaea reveals longest animal migration. proc natl acad sci u s a : – . gonzalez-beltran an, masuzzo p, ampe c, bakker g-j, besson s, eibl rh, friedl p, gunzer m, kittisopikul m, le dévédec se, leo s, moore j, paran y, prilusky j, rocca-serra p, roudot p, schuster m, sergeant g, strömblad s, swedlow jr, van erp m, van troys m, zaritsky a, sansone s-a, martens l. . community standards for open cell migration data. biorxiv. doi: . / imirzian n, zhang y, kurze c, loreto rg, chen dz, hughes dp. . automated tracking and analysis of ant trajectories shows variation in forager exploration. sci rep : . kwok r. . deep learning powers a motion-tracking revolution. nature : – . liu c, ye fq, newman jd, szczupak d, tian x, yen cc-c, majka p, glen d, rosa mgp, leopold da, silva ac. . a resource for the detailed d mapping of white matter pathways in the marmoset brain. nat neurosci : – . mcdole k, guignard l, amat f, berger a, malandain g, royer la, turaga sc, branson k, keller pj. . in toto imaging and reconstruction of post-implantation mouse development at the single-cell level. cell . doi: . /j.cell. . . mcquin c, goodman a, chernyshev v, kamentsky l, cimini ba, karhohs kw, doan m, ding l, rafelski sm, thirstrup d, wiegraebe w, singh s, becker t, caicedo jc, carpenter ae. . cellprofiler . : next-generation image processing for biology. plos biol :e . pietzsch t, saalfeld s, preibisch s, tomancak p. . bigdataviewer: visualization and processing for large image data sets. nat methods : – . romero-ferrero f, bergomi mg, hinz r, heras fjh, de polavieja gg. . idtracker.ai: tracking all individuals in large collectives of unmarked animals. arxiv [cscv]. royer la, weigert m, günther u, maghelli n, jug f, sbalzarini if, myers ew. . clearvolume: open-source live d visualization for light-sheet microscopy. nat methods : – . schmid b, tripal p, fraaß t, kersten c, ruder b, grüneboom a, huisken j, palmisano r. . dscript: animating d/ d microscopy data using a natural-language-based syntax. nat methods : – . shah g, thierbach k, schmid b, waschke j, reade a, hlawitschka m, roeder i, scherf n, huisken j. . multi-scale imaging and analysis identify pan-embryo cell dynamics of germlayer formation in zebrafish. nat commun : . shneiderman b. . the eyes have it: a task by data type taxonomy for information visualizationsproceedings ieee symposium on visual languages. pp. – . tinevez j-y, perry n, schindelin j, hoopes gm, reynolds gd, laplantine e, bednarek sy, shorte sl, eliceiri kw. . trackmate: an open and extensible platform for single-particle tracking. methods. doi: . /j.ymeth. . . trivedi v, fulton t, attardi a, anlas k, dingare c, martinez-arias a, steventon b. . self-organised symmetry breaking in zebrafish reveals feedback from morphogenesis to pattern formation. biorxiv. doi: . / wallingford jb. . the -year effort to see the embryo. science : – . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://dx.doi.org/ . / http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ k http://paperpile.com/b/m zzqk/ k http://paperpile.com/b/m zzqk/c cd http://paperpile.com/b/m zzqk/c cd http://paperpile.com/b/m zzqk/c cd http://paperpile.com/b/m zzqk/c cd http://dx.doi.org/ . /j.ymeth. . . http://paperpile.com/b/m zzqk/ iq http://paperpile.com/b/m zzqk/ iq http://paperpile.com/b/m zzqk/ iq http://paperpile.com/b/m zzqk/ iq http://dx.doi.org/ . / http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figures figure browser-based exploration and sharing of trajectory visualizations with linus. (a) control workflow of linus. starting with the data, a python-converter is used to enrich the data with further features (e.g. numeric metrics, an edge-bundled version of the data, visual context) and to prepare the visualisation package. (b) within minutes, the data can be visualised and explored in the browser, and different aspects of the data can be interactively highlighted (example shows the effect of changing the degree of trajectory bundling). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure configurable filters allow deep data exploration. the user can choose from a range of several visualisation methods directly in the browser interface to highlight aspects of interest in the data (zebrafish tracking results from (shah et al., ) as an example). (a) the line data is visualized using a range of options for shading and colour mapping. (b-d) the user can filter parts of the data with respect to specific attributes, such as (b) time intervals or (c) a specific range of signals (marker expression in cells in this case). (d) the user can further create subselections of the tracks in space using cutting planes or refinable spatial selection. the visual attributes can be defined separately for the selected focus region and the non-selected context region. (e-g) the web interface can blend seamlessly between different states of the data. this feature can be used to map between (e) original tracks and their edge-bundled version, to visualize planar projections of the d data (f) locally on a definable (oblique) plane or (g) globally using a mercator projection (with definable parameters). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/ ld https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure sharable interactive visualization packets for a multitude of applications ranging across a variety of sciences. the user can combine the visualization methods, annotations, and camera motion paths in a scheduled tour that can be shared by a custom url or qr code generated directly in the browser interface. panels (a)-(d) demonstrate use cases for real-world datasets with different characteristics and dimensionality. (a) ant trails ( d+t) from (imirzian et al., ) . bundling and colour-coding (spatial orientation by mapping (x,y,z) to (r,g,b) values) indicate the major trails running in opposing directions. (b) gps animal tracking data for two species (blue whales (bailey et al., ) - blue and arctic tern (egevang et al., ) - red) shown on a mercator projection of the earth’s surface. for a better orientation, the outline of the continents is included as axes into the visualization that dynamically adapt to the projections and viewpoint changes ( d surface data + t). (e) cell movements during the elongation process of zebrafish blastoderm explants ( d+t) (trivedi et al., ) . bundling, colour coding, and spatial selection highlight collective cell movements as the explant starts elongating, focusing on a subpopulation of cells driving this process. colour code shows time from early (yellow) to late (red) for selected tracks. (f) brain tractography data showing major white matter connectivity from diffusion mri ( d). the spatial selection highlights the left hemisphere while anatomical context is provided by the outline of the entire brain (from mesh data) and the defocused tracts of the right hemisphere. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/uwiz https://paperpile.com/c/m zzqk/ey c https://paperpile.com/c/m zzqk/iwwe https://paperpile.com/c/m zzqk/ iq https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supplementary figures supplementary figure : overview of data structure. the coordinate list holds the x/y/z values for each supporting point of the trajectories. for each such point, an arbitrary number (only limited by the graphics card's capabilities) of attributes can be stored. the attributes must be provided in the same order as the points. to create trajectories from the point set, an index list is provided as well. each pair of indices describes one segment of a trajectory. the number of such segments is not restricted, as any point (and its respective attributes) can be used multiple times. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supplementary figure : overview of settings. an overview of the different visualisation settings available to the user from the gui (two screenshots merged). for explanations regarding different settings, see text or documentation at https://gitlab.com/imb-dev/linus. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gitlab.com/imb-dev/linus https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supplementary figure : tour editor. the tour actions can be organised by drag and drop (reading order: from left to right, top to bottom). every action can be scheduled with a time delay with respect to the end of the previous action. some actions use transitions (e.g. camera motions or the adjustment of numeric values) whose duration can be configured as well. eventually, a url or a qr code can be created. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / review and performance evaluation of trait-based between-community dissimilarity measures title review and performance evaluation of trait-based between-community dissimilarity measures author details attila lengyel * & zoltán botta-dukát * *centre for ecological research, institute of ecology and botany, alkotmány u. - ., h- vácrátót, hungary corresponding author, lengyel.attila@ecolres.hu botta-dukat.zoltan@ecolres.hu (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract . in the recent years a variety of indices have been proposed with the aim of quantifying functional dissimilarity between communities. these indices follow different approaches to account for between-species similarities in the calculation of community dissimilarity, yet they all have been proposed as straightforward tools. . in this paper we reviewed the trait-based dissimilarity indices available in the literature, contrasted the approaches they follow, and evaluated their performance in terms of correlation with an underlying environmental gradient using individual-based community simulations with different gradient lengths. we tested how strongly dissimilarities calculated by different indices correlate with environmental distances. using random forest models we tested the importance of gradient length, the choice of data type (abundance vs. presence/absence), the transformation of between-species similarities (linear vs. exponential), and the dissimilarity index in the predicting correlation value. . we found that many indices behave very similarly and reach high correlation with environmental distances. there were only a few indices (e.g. rao’s dq, and representatives of the nearest neighbour approach) which performed regularly poorer than the others. by far the strongest determinant of correlation with environmental distance was the gradient length, followed by the data type. the dissimilarity index and the transformation method seemed not crucial decisions when correlation with an underlying gradient is to be maximized. . synthesis: we provide a framework of functional dissimilarity indices and discuss the approaches they follow. although, these indices are formulated in different ways and follow different approaches, most of them perform similarly well. at the same time, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sample properties (e.g. gradient length) determine the correlation between trait-based dissimilarity and environmental distance more fundamentally. keywords beta diversity, dissimilarity index, distance metric, community ecology, functional traits abbreviations cdf = cumulative distribution function, cwm = community-weighted mean, fdissim = functional dissimilarity, vis = variable importance score introduction understanding and explaining the variation of living communities along dimensions of space and time have been in the focus of ecological research ever since. the widely applied scheme by whittaker ( , ) to tackle questions of different aspects of community variation divides community diversity into alpha (within-community), beta (between-community) and gamma (across-community) components. it is no exaggeration to say that among these three, beta diversity sparked the most controversy due to the multitude of ways how it can be formulated (tuomisto a,b, anderson et al. , podani & schmera , baselga & leprieur ). one of the most popular approaches to beta diversity builds upon quantification of variation between pairs of communities using dissimilarity indices (anderson et al. , legendre & de cáceres , ricotta ). a broad spectrum of such dissimilarity indices are available for many specific purposes providing elementary tools for different fields of ecology and beyond (see reviews by legendre & legendre , podani (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ). nevertheless, choosing from such many options requires a more or less subjective decision from the researcher which may affect the final result of the analysis. comparative reviews of dissimilarity indices (faith et al. , koleff et al. ) and evaluations of effects of methodological decisions (lengyel & podani ) are inevitably helpful in making these decisions. the most popular, yet not exclusive, interpretations of diversity for long time considered species as variables which are unrelated with each other. in the last two decades, however, the functional approach to ecological questions gained unprecedented attention (díaz & cabido , mcgill et al. ). this approach relies on the fact that species are not all maximally different from each other, rather they can be considered related with respect to similarities in their traits thought to represent their roles in ecosystems (violle et al. ). the need for explicitly accounting for between-species relatedness generated a wave of methodological improvements that introduced new methods in the calculation of diversity. next to a lively scientific discussion on how functional alpha diversity can be appropriately quantified (mason et al. , petchey & gaston , villéger et al. , mouchet et al. ), suggestions were made also for the expression of functional beta diversity (swenson , botta-dukát , chao et al. ). among them, a large variety of indices for calculating dissimilarity between pairs of communities on the basis of the traits of their species have been proposed (e.g. ricotta & burrascano , cardoso et al. , ricotta & pavoine ). although these indices have been introduced as straightforward measures for revealing between- community dissimilarity on the basis of traits, they have very different concepts behind, and we still lack a comparative review of them. in this paper we aim to provide an overview and a conceptual framework for the pairwise functional dissimilarity (hereafter called fdissim) measures available in the literature to our best knowledge. we start with a ( ) short overview of the concept and indices of ecological (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (dis-)similarity without accounting for relatedness of species, then ( ) we review and classify fdissim indices according to their conceptual basis, and ( ) we test the performance of fdissim indices. short overview of taxon-based (dis-)similarity methods most fdissim measures are generalizations of simple indices which were originally designed for expressing dissimilarity based on species composition (that is, omitting similarities between species). we start the review of trait-based (dis-)similarity measures with a brief summary of these species-based indices. then, we present a framework of approaches including several families of trait-based dissimilarity indices. species-based indices most indices can be written in either similarity (s) or dissimilarity (d= -s) form but when we do not see necessary to specify the form, we call them ‘resemblances’. in the case of presence/absence data, these indices are based on the well-known × contingency table whose cells represent the number of species shared (denoted by a), as well as the number of species occurring only in one of the communities (b and c). the fourth cell of the contingency table quantifying the number of shared absences is disregarded by these indices and rarely used in ecological analyses (but see tamás et al. ). all these indices agree that they express similarity as the proportion of shared diversity to total diversity. hence, all of them range between and . in the case of presence/absence data the number of shared species, a, in the numerator stands for shared diversity for all indices, while the denominators are different. in the sørensen index (ss) the denominator is the arithmetic mean of the species numbers of the two communities, in ochiai index (so) it is their geometric mean, in kulczynski (sk) it is their harmonic mean, while in simpson index (ssi) it is the richness of the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . species poorer community. if the two communities are equally species-rich, then these indices are equal, otherwise ss < so < sk < ssi. in the jaccard index (sj), the denominator is the total number of species in the two communities, while in sokal & sneath index (sss) species occurring in a single community are taken into account with double weight. there is a direct and monotonic relationship between jaccard, sørensen, and sokal & sneath indices (see appendix s ). table summarizes the similarity and dissimilarity forms of the above indices. for abundance data, the resemblance of two communities is derived from the summation of species-wise differences, with the simplest interpretation being the euclidean and the manhattan distances, respectively: eq. . �� ∑ �� eq. . �� ∑ �� where xij and xik are the abundance of species i in communities j and k, sjk is the total number of species in j and k. for both indices, the minimum is but the maximum of euclidean distance is the square-root of the sum of squared abundances, while for manhattan distance the maximum is the sum of abundances. obviously, their dependence on total abundance makes these index values difficult to compare across samples; therefore, indices including a standardization have become more popular in ecological studies. the standardization is possible in several ways. the first option is to standardize raw species contributions to between-community dissimilarity (xij-xik), and then to sum them. therefore, each species-level difference in abundance should be divided by a scaling factor in a way that maximal species- level difference is and this difference is maximal if species present only one of the compared communities. summing xij and xik in the denominator satisfies this requirement and gives a well-known distance measure, the canberra index: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . eq. . �� ∑ �� however, canberra index still ranges between and sjk. according to ricotta & podani ( ), the normalized canberra index can be derived by unweighted averaging of species contributions: eq. . �� ∑ �� alternatively, species-level differences can be divided by max(xij, xik). it also results unity, if species occur only either of the plots. ricotta & podani ( ) called this modified canberra index, whose normalized version follows: eq. . �� ∑ �� ,��" �� calculating from binary data, both normalized canberra and normalized modified canberra result in jaccard dissimilarity. a different way of standardization is possible if raw species-level differences are summed and divided by the sum of their theoretical maxima. in this case, the denominator can follow the logic of canberra index, thus leading to the bray-curtis index: eq. . �#� � ∑ �� ∑ ��" �� analogously with the normalized modified canberra index, instead of the sum, the denominator may contain the maximum of abundance, resulting in the formula known as marczewski-steinhaus index: eq. . �� ∑ �� ∑ �� ,��" �� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . worth to note that bray-curtis and marczewski-steinhaus indices calculated on presence/absence data return the values of sørensen index and jaccard index in dissimilarity form, respectively. moreover, several abundance-based indices can be expressed if we generalize a, b, and c quantities used during the definition of indices for presence/absence data (tamás et al. ). eq. . % � ∑ min �� , �� eq. . �% � ∑ �max �� , �� eq. . �% � ∑ �max �� , �� substituting a, b and c with a’, b’ and c’ into the formula of sørensen index gives bray- curtis, and doing so with jaccard index results in the marczewski-steinhaus. abundance versions of all other presence/absence indices can be created in the same manner. a classification of fdissim indices fdissim indices incorporate trait information into the calculation of dissimilarity in different ways. the simplest solution is when summary statistics or distributions are calculated for the two communities and a measure of distance or segregation is calculated between them. we call this the summary-based class, and in our review, we include two approaches within this, the typical value approach and the distribution-based approach. in the second class we include indices which utilize a symmetrical species by species (dis-)similarity matrix and link it directly through matrix operations with the compositional matrix. we call this the dissimilarity-based class which includes the probabilistic, the ordinariness-based, the diversity partitioning, and the nearest neighbour approaches. the third class includes methods which make use of between-species (dis-)similarities for classification of species; (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . therefore, we call it the classification-based class. the classification either transforms the original structure of the dissimilarity matrix into discrete groups of species which can be used as functional types, or expresses dissimilarities in a form of a tree-graph where between- species dissimilarities are organized in an inclusive hierarchy. this is a widespread approach for accounting for phylogenetic relatedness, since phylogenies are commonly summarized in the form of cladograms. such methods heavily rely on the algorithm chosen for the classification, including the decisions about the number of clusters and the method for breaking tied values. examples are provided by hérault & honnay ( ), nipperess et al. ( ), and cardoso et al. ( ), while a review is available by pavoine ( ). as there is no general recommendation for the classification method, we omit this class from the framework detailed below and the comparative test. the classification of trait-based dissimilarity indices and their main properties are summarized on table . typical value approach indices following this approach represent each community with a typical trait value, and calculate a distance metric between them. the most commonly applied typical trait value is the community weighted mean (cwm; garnier et al. ). the rationale behind the cwm can be linked with the mass ratio hypothesis (grime ) stating that the effect of species on ecosystem functioning is proportional to their relative abundances. although, several issues emerged regarding its limited applicability in statistical inference (hawkins et al. , peres- neto et al. , zeleny ) and its negligence of within-community variation (muscarella & uriarte ), difference in cwm is still considered a reliable indicator of robust changes in trait composition induced by selective forces like environmental matching or succession (de bello et al. , , kleyer et al. ). ricotta et al. ( ) investigated the relatedness of the distance between cwms with the probabilistic approach (see therein) and showed its applicability on phylogenetic data. due to its tolerable requirements for (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . computational capacity, lengyel et al. ( ) used the euclidean distance between trait cwms of phytosociological relevés for the trait-based numerical classification of grasslands of poland with a sample size of sites and species. another advantage of this method is its euclidean property. besides the community-weighted mean, other typical values, e.g. the median or the mode, might be considered depending on the scaling of the trait variable and on specific research aims. distribution-based approach instead of typical values, the distribution of trait values is considered a more reliable representative of the trait composition and variability of a community. continuous distributions can be defined by a density function, while discrete distributions by the probabilities of the possible values, while both types can be characterized by a cumulative distribution function (cdf). a useful analogue of the distance between typical values might be distance between discrete distributions, density functions or cdfs. if data is available on intraspecific trait variation, trait values forms a continuous distribution. first, separate density functions have to be fitted within each species. then, density function of this community-level distribution can be calculated as weighted sum of species level density functions (carmona, de bello, mason, & lepš, ). if such data is not available, we can use relative abundances as estimates of probabilities of the corresponding trait values. pairs of trait values and their probability form a discrete distribution. similarity of density functions can be measured by their overlap (see appendix s for overview of overlap measures). overlap functions between within-species trait distributions has already been proved useful in the quantification of between-species niche segregation (macarthur & levins , mouillot et al. ) or trait-based dissimilarity of species (lepš (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . et al. , de bello et al. ). nevertheless, they are perfectly applicable to the community level as well. gregorius et al. ( ) proposed an index called delta for the quantification differences between discrete trait distributions. delta is the minimal sum of frequencies shifted from one trait state to another trait state, weighted by the differences between the respective states. minimizing the sum of shifted frequencies is known in linear programming as the transportation problem (hitchcock ). due to its relatively high computational demand, it is unfeasible for large compositional and trait data matrices typically used in ecological research, therefore, we exclude this index from our comparison. difference between two cdfs can be calculated at each possible trait values (i.e. not only the observed ones), then the sum of them can be used as a trait-based dissimilarity measure. in appendix s we introduce the distance between cdfs in more detail. maximally distinct communities species-based dissimilarities, except euclidean, manhattan and (non-normalized) canberra distances, equal unity, which is their maximum, when the two compared communities do not share any species. in this context, we could call such communities maximally distinct. however, when traits are considered, two communities can be similar, even if they do not share any species. for example, if all species of community a is replaced by a similar species in community b, the two communities have no shared species, but from functional point of view, they are similar. in this context, two communities are maximally distinct, when similarity of any species from the first community is zero to any species in the other community. it is a desirable property for a functional similarity index to take the value if and only if the two compared communities are maximally distinct. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . probabilistic approach this approach can be traced back to the diversity framework proposed by rao ( ), and recently extended by pavoine & ricotta ( ). rao’s within community diversity is defined as the expected dissimilarity between two randomly drawn individuals from a single community: eq. . �� ∑ ∑ �� δ� � where pi is the relative abundance of the ith species in the community and δij is the dissimilarity between species i and j. this has become a widely used index of functional alpha diversity (botta-dukát ). likewise, a between-community component of diversity, q(p,q), can be defined as the dissimilarity between two random individuals, each selected from different communities: eq. . ��, �� ∑ ∑ �� δ� � between community diversity can be expressed using within community diversity of the two original communities (q(p) and q(q)) and the community with mean relative abundances; � �&�' � �. eq. . � �&�' � � � ∑ ∑ (��)� � (��)� � � δ� � �� ∑ ∑ �� δ� � � �"�� " � ��, �� subtracting mean within community diversity from the between community diversity leads to rao’s dissimilarity (also called disc): (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . eq. . !* � ∑ ∑ �� ∑ ∑ (�(�+�� ∑ ∑ )�)�+�� ∑ ∑ �� "�� " � � �� where pi and qi are the relative abundances of species i in the two communities. champely and chessel ( ) proved that if δ has squared euclidean property, rao quadratic entropy is concave function, i.e. � �&�' � � is higher than or equal to mean of �� and ��. thus under this condition, !* " . if $ %� $ , ∑ ∑ �� %� � , which is the weighted average of between-species distances, also has to be within this range. therefore, $ !* $ . however, dq may be much less than , even if the two communities are completely distinct, when �� and �� are high. therefore, pavoine & ricotta ( ) suggested dividing dq by its theoretical maximum (see equations and in pavoine & ricotta ). they recognized that the resulting indices are representatives of a broader family of indices, hereafter called dsimcom, which are actually the implementations of rao’s between-community and within- community components of diversity into the similarity formulae designed for presence/absence data. for this index, it is necessary to introduce the similarity between species, εij= - δij. the expected similarity between individuals of different communities, ' � ∑ ∑ � � � � (�� is taken analogous with the shared diversity, a, according to the parameters of the similarity indices for presence/absence data disregarding species properties, while the expected similarities within communities (' � ) � ∑ ∑ � � � � (�� and ' � * � ∑ ∑ ��(�� ) are analogous with the species numbers (a+b, a+c). in this way, pavoine & ricotta ( ) presented formulae following the sokal & sneath, jaccard, sørensen, and ochiai indices. additionally, a formula analogous with whittaker’s effective species turnover (β=γ/α- ; whittaker , tuomisto a) is suggested for two communities, which in similarity form is shown to be identical with the overlap index of chiu et al. ( ). in this formulation γ=a+b+c and α=( a+b+ c)/ . pavoine & ricotta ( ) showed that members of the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dsimcom family provide meaningful values also if absolute abundances, percentage values or binary occurrences are used instead of relative abundances. when εij contains taxonomical similarities, its off-diagonal elements are , and a=a, b=b, and c=c. worth to note the inherent link between dq and cwmdis on the basis of the geometric interpretation by pavoine ( ) and ricotta et al. ( ). pavoine ( ) showed that if between-species dissimilarities are in the form δij=(dij )/ and dij is euclidean embeddable, dq is half the squared euclidean distance between the centroids of two communities – a function monotonically related with cwmdis, the simple euclidean distance between centroids of communities. as ricotta et al. ( ) argue, if species relatedness is only described by a dissimilarity matrix, which is the common case in phylogenetic analyses, species can be mapped into a principal coordinate analysis ordination using dij. given the euclidean embeddable property of dij, this ordination should produce s- or fewer ordination axes, all with positive eigenvalues. ordination scores for species can be used as traits, and therefore, centroids of communities, and (squared) euclidean distances between communities can be calculated. in the special case when between-species dissimilarities are euclidean distances, dq must be equal with the euclidean distance between the weighted averages of traits, that is, cwmdis. it is also notable that swenson et al. ( ) and swenson ( ) use the quantity q(p, q) as a standalone index of pairwise beta diversity and call it dpw or “rao’s d”. the latter name is misleading since rao ( ) himself noted with dij the disc (or dq) index. q(p, q) measures dissimilarity between two communities but the dissimilarity of a community from itself is not zero. swenson ( ) also presents a standardized version of q(p, q) under the name rao’s h. with this formula the dissimilarity of a community to itself is scaled to , however, its (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . transformation to a meaningful scale where each community has dissimilarity value zero towards itself is not elaborated. due to this drawback, we do not consider these indices in our review of functional dissimilarity measures. schmidt et al. ( ) proposed probabilistic indices with weighted and unweighted versions for expressing community similarity on the basis of taxa interaction networks (called tina, taxa interaction-adjusted) and phylogenetic relatedness (pina, phylogenetic interaction- adjusted). tina and pina differ only in what type of data the interaction matrix contains. notably, the functional formula of weighted tina is identical with the ochiai version of dsimcom. however, the unweighted tina, abbreviated tu, is not a special case of tina, which we consider an inconsistency. therefore, we did not include tu as a separate index. ordinariness-based approach with respect to functional alpha diversity, leinster & cobbold ( ) introduced the concept of species ordinariness defined as the weighted sum of relative abundances of species similar to a focal species within the same community, or in other words, the expected similarity of an individual of the focal species and an individual chosen randomly from the same community. according to ricotta & pavoine ( ) it is straightforward to replace abundances with ordinariness values in the species-based (dis-)similarity indices. following this concept, ricotta & pavoine ( ) introduced a new family of trait-based similarity measures called dissabc. dissabc applies the schemes of jaccard, sørensen, ochiai, kulczynski, sokal & sneath, and simpson indices. either relative or absolute abundances can be chosen as input values. species ordinariness values can be calculated either with respect to the pooled species list of the two communities under comparison, or to the total species list of the data matrix. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . for species-based analyses, ricotta & podani ( ) suggested a general formula of distance measures in which community dissimilarity is calculated by the weighted averaging of species-level differences in abundance. from this formula, a normalized canberra distance, bray-curtis distance, marczewski-steinhaus index, and an evenness-based dissimilarity index (ricotta ) can be derived. according to pavoine & ricotta ( ), replacing species abundances with species ordinariness values, a meaningful dissimilarity index can be designed, which is called generalized_tradidiss. additionally, this index contains a factor which weights the contribution of each species to the overall dissimilarity between the two communities. this weight can be set to give even weight to all species or to weigh them proportionally to their relative abundance in the pooled communities. diversity partitioning approach following the work of hill ( ), a community with diversity of order q, qd, is as diverse as a theoretical community containing qd equally abundant species. the order of diversity, q, expresses the weight given to differences in species abundance, q = representing the presence/absence case, q = ∞ considering only the relative abundance of the most abundant species in the community. without accounting for interspecific similarities, there is emerging consensus that using effective numbers (also called number of equivalents) is a straightforward way for partitioning diversity into within-community (alpha), between- community (beta) and across-community (gamma) components (jost ). of these three, the between-community component, beta diversity, can be interpreted as a form of dissimilarity when applied for two communities (ricotta ). beta diversity can be derived from alpha and gamma diversity in a multiplicative (beta = gamma/alpha) or an additive way (beta = gamma – alpha). jost ( ) and chao et al. ( ) argued that multiplicative beta diversity is a useful way for quantifying community differentiation; however, due to its (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . scaling between and n (n being the number of communities) it is not comparable across samples containing different numbers of communities. to remove this dependence, they offer three solutions with which the value of multiplicative beta can be normed. although, for pairwise comparisons, n is always , it seems straightforward to follow these recommendations, since the scaling between and has several advantages, and most other indices also share this property. the rescaling formulae of chao et al. ( ) embody different concepts of community (dis-)similarity, which together we call the family of multiplicative beta indices. the first formula is the relative turnover rate per community, which is a linear transformation of beta to the normed scale. eq. . +�� ,-�� ,�- � � +) � �/�/ � � here means identical species composition, while indicates totally distinct communities. in the pairwise comparison (n = ), βturnover〈q〉 = q β - . the second index measures homogeneity, and is a linear transformation of the inverse of beta. with respect to the fact that the complement term of homogeneity is heterogeneity, we call its dissimilarity form βheterogeneity: eq. . +��,.� ��/ ,�- � � � � � �� when n = , βhet〈q〉 = - / q β. with q = (presence/absence case) the index is identical with jaccard index, while with q = ∞ (abundance case) it is the morisita & horn index. the third index measures overlap between communities, whose counterpart is segregation, thus we call it βsegr: eq. . + �.��.��, ,�- � � � � )�� )�� )�� with q = , + �.��.��, ,�- � +�� ,-�� ,�-, and both gives the sørensen index. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . according to leinster & cobbold ( ), it is possible to implement species similarities in the calculation of effective numbers. this way, the meaning of qdz, is the diversity of a theoretical community with qdz equally abundant and maximally different species. hence, both unevenness in the abundance structure and the between-species similarities decrease the value of effective species number. due to measuring diversity in effective numbers, it is possible to partition diversity into alpha, beta, and gamma fractions (leinster & cobbold ; botta-dukát ) in the multiplicative way. then, this multiplicative beta can be rescaled using the formulae proposed by chao et al. ( ). these indices behave consistently only if abundances are taken into account as relative abundances. nearest neighbour approach the earliest representatives of this family were shown by clarke & warwick ( ) and izsák & prince ( ), then ricotta & burrascano ( ), and ricotta & bacaro ( ; see dcw and dip indices). later ricotta et al. ( ) introduced a new, general family called paddis. all these indices were primarily defined for presence-absence data type. the approach is based on a re-definition of the b and c quantities of the × contingency table. looking at species as maximally different, and taking x and y the two communities under comparison, b can be viewed as the total uniqueness of community x. the uniqueness of a single species in x is if it is absent in y, otherwise it is . therefore, b is the sum of species uniqueness values. however, from a functional perspective, the uniqueness of a species present only in x should be between and if it is absent in y but a similar species present there. therefore, it is possible to define the analogue of b which accounts for similarities between species: eq. . � � ∑ � max� �� ∑ max� �� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the same logic applies for c, which is the uniqueness of community y, where c expresses the degree of uniqueness: eq. . � � ∑ � max� � �� ∑ max� � �� ricotta et al. ( ) define the a quantify as follows: eq. . ' � � �� )� � �� *� having a, b, and c defined as analogues of a, b, and c, it is now possible to design trait-based similarity measures following the logics of jaccard, sørensen, sokal & sneath, kulczynski, ochiai and simpson indices. it is notable that ricotta et al. ( ) define a as a quantity that ensures the components b and c to add up to a + b + c but with no explicit biological interpretation. notably, dip and dcw are identical with the sørensen and kulczynski forms of paddis. the generalization of dip and dcw to relative abundances, dcw(q), was also derived by ricotta & bacaro ( ). for these two versions, it is not necessary to explicitly define the a component. using the relationships between jaccard, sørensen, kulczynski, ochiai and sokal & sneath indices, from dcw(q) it is theoretically possible to derive the extension of paddis to relative abundances; however, the biological interpretation of a remains dubious in this framework. methods the performance of fdissim indices can be reliably tested on data sets with known background processes driving community assembly which is hardly possible to satisfy with real data. therefore, we compared the performance of fdissim indices using simulated data sets. the data sets were generated using the comm.simul function of the comsimitv r package (botta-dukát & czúcz , botta-dukát ). this function follows an individual-based (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . model for a meta-community comprising n communities and a regional pool of s species. local communities include j individuals, and are distributed equidistantly along a continuous environmental gradient (with gradient values between and ). each individual possesses three traits: an ‘environmental’, a ‘competitive’ trait, and a neutral trait, all ranging on [ ; ]. intraspecific variation in trait values is neglected in the simulation, that is, individuals belonging to the same species are identical. the environmental trait defines the optimum of the species along the environmental gradient. the closer the position of a community along the environmental gradient to the environmental trait value of a species, the more suitable it is for that species: eq. . :; �:<:;= � � -��, � � � � -��, � �� "� . where σ (sigma) is adjustable so as to change the niche width of the species, and hence, the length of the gradient (see later). the competitive trait represents the resource acquisition strategy of the individual. the more similar the latter value between two individuals, the higher the competition is between them, which means that intraspecific competition is the strongest. the neutral trait has no effect on community assembly, thus it is not considered in our study. the simulation starts with the random assignment of all individuals of all communities to species. the second step is a ‘disturbance’ event, when one individual ‘dies’ in each community. this individual is to be replaced by an offspring of other individuals within the same community or those of other communities. each individual produces one offspring or does not reproduce. probability of reproduction depends on the strength of competition. the offspring remains in the same community or randomly disperses into any of the other communities. finally, the dead individual is replaced by one new individual from the seeds produced and dispersed. the probability that an individual of a certain species replaces the dead individual is defined by the number of seeds of that species and the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . suitability of the habitat. steps between the disturbance event and the establishment of a new individual constitute a single ‘generation’. community composition is evaluated after lot of generations. the strength of the environmental filtering can be adjusted by the sigma parameter, respectively. when sigma is , all species are maximally specialist, which means that they can occur only at the optimum point of the gradient (that is, at the exact value for the environmental trait). if sigma is infinity, species are maximally generalist and all points along the environmental gradient are equally suitable for them. therefore, sigma is the parameter which defines the suitability of each point of the gradient for each species based on its distance from the respective optima. we generated data sets with sigma values . , . , . , . , and in order to simulate situations with different strength of environmental filtering. the number of communities was , each community comprised individuals, the number of species in the species pool was , the simulation iterated for generations, and we allowed no intraspecific trait variation. for all the other parameters, we used the default options. however, it needed further explanation what real situations the six simulated levels of environmental filtering represent. to provide a reference and assist interpretation, we calculated two species-based beta-diversity measures, the multiplicative beta (whittaker ) and the gradient length of the first axis of a detrended correspondence analysis (dca) ordination (hill & gauch ; appendix s , fig. s . ). the former gives the number of distinct communities present in the total species pool of the gradient, while the latter is minimal number of average niche breadths (also called turnover units) necessary for covering the total gradient length. moreover, we plotted the abundance of species in the sample units along the gradient as a visual tool for assessing gradient length (appendix s , fig. s . ). all these methods indicated that with sigma = . the gradient is extremely long: there are more than distinct communities and near turnover units along the gradient. samples with such (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . high beta diversity are very rare and special in real ecological research; therefore, findings from simulations with sigma = . are mostly of theoretical importance. beta diversity values from sigma = . to sigma = are more similar to real study situations, hence they should be more relevant for practice. at sigma = , environmental filtering is practically not operating, between-community variation is driven by interspecific relations and chance. we calculated between-species dissimilarities as the gower distance between their environmental trait values which in this case equals the euclidean distance scaled to [ ; ]. these distances had to be transformed to similarities according to the requirements of the fdissim indices. several formulae are available with which it is possible; however, they may assume different functional relationships between similarity and distance. one of such formulae we used is the linear transformation according to similarity = -distance. besides this, we also used similarity = e-u×distance which supposes a curvilinear function between similarity and distance (leinster & cobbold ). with this exponential formula, it is possible to weight the importance of small gower distances between species relative to large distances. with changing the parameter u it is possible to adjust how steeply similarity decreases with increasing distance. we set u = which leads to a relatively steep decline. although, after this transformation the minimal value for similarity is higher than zero, we considered it negligibly low (e- ≈ . ) so we did not apply the transformation proposed by botta-dukát ( ). for all fdissim indices where it was necessary we used the similarity matrix or a dissimilarity matrix calculated as dissimilarity = -similarity as input. the dissimilarity matrix is identical with the gower distance matrix if the similarities were calculated in a linear way, but in the other case, it keeps the exponential relationship between distance and (dis-)similarity. dissimilarity matrices were calculated for the four community data sets with different sigma values, with the two functions transforming gower distances, and across a broad range of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . available fdissim indices. for indices where absolute or relative abundances could have been taken into account, we opted for relative abundance for the sake of better comparability. with generalized_tradidiss, we calculated the ‘even’ and the ‘uneven’ weighting versions. the entire analysis was run with abundance and presence/absence data. some fdissim indices are only suitable for binary data, thus the number of indices applied for relative abundance and binary data were and , respectively. in cases of indices handling both data types, we used exactly the same version of the index as with abundance data, hence communities with different numbers of species were given equal weight due to division by community totals. additionally, dissimilarity matrices were also calculated using the bray-curtis index (for binary data: sørensen index in dissimilarity form) to provide a contrast against the case disregarding between-species dissimilarities. then for each dissimilarity matrices, we conducted two types of analyses. firstly, we compared how strongly the dissimilarity indices correlate with the environmental distance using kendall tau rank correlation. this gives an estimate of how well a dissimilarity index reveals the monotonic relationship between trait composition of local communities and the environmental gradient. we visually assessed the shape of relationship between dissimilarity and environmental distance in the case of lowest sigma (i.e., longest gradient) when the distortion of linear relationship between the two is supposed to be the strongest. then, to disentangle the effects of different methodological decisions and the sigma parameter on the correlation between fdissim indices and environmental distance we calculated a random forest model. in this model the dependent variable was the kendall tau correlation coefficient, while the independent variables were the sigma, the data type (abundance vs. presence/absence), the transformation method for gower distances (linear vs. exponential), and the fdissim method. within approaches fdissim methods often strongly correlated that resulted in very similar kendall’s tau values. therefore, only the sørensen/bray-curtis (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . versions of dsimcom, dissabc, paddis/dcw, generalized_tradidiss with uneven weights, as well as βturnover, cwmdis, and the cdfdis were included into this analysis. variable importance scores (vis) in the random forest were estimated by the permutation approach based on mean decrease in log-likelihood using the varimp function of the partykit package. the effects of the model terms were also illustrated by heat-maps. all statistical analyses were done in r (r core team ) using the fd (laliberté & legendre , laliberté et al. ), adiv (pavoine a,b), comsimitv (botta-dukát ,) vegan (oksanen et al. ), desctools (signorell et al. ), partykit (hothorn et al. , strobl et al. , strobl et al. , hothorn & zeileis ) packages. results kendall tau correlation coefficients decreased as the strength of environmental filtering decreased (that is, with increasing sigma) in all examined cases. for fdissim indices which handled both data types, presence/absence data resulted in lower correlations than abundance data for all indices. for most indices, this difference was highest at intermediate values for sigma. these trends were consistent between the linear and the exponential transformations. correlations for all indices at all sigma values with linear transformation are shown in table for abundances data and in table for presence/absence data. in most simulation scenarios, the fdissim indices correlated more strongly with the environmental gradient than the species-based bray-curtis index. however, in several occasions, indices belonging to the nearest neighbour family performed poorer than the species-based dissimilarity. notably, at the highest sigma and with presence/absence data, all indices showed correlation near to zero but among them the bray-curtis index had the highest correlation with environmental distance. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . as expected, we found perfect rank correlations among jaccard, sørensen, sokal-sneath and whittaker’s beta versions of dsimcom, among jaccard, sørensen and sokal-sneath forms of dissabc, between dip and sørensen form of paddis (only for presence-absence data), between dcw and kulczynski form of paddis (only for presence-absence data), and between dip and dcw (for abundance data type). dissimilarity indices showed various shapes of relationship with environmental distance (appendix s ). at strongest environmental filtering, all fdissim indices had dissimilarity values near zero at minimal environmental distance, only the species-based bray-curtis which had dissimilarity was near . at the smallest environmental distances. in case of linear transformation of gower distances and presence/absence data, approximately linear relationship was found for cwmdis, cdfdis, dq, sørensen and ochiai forms of dsimcom, jaccard form of dissabc, marczewski-steinhaus form of generalized_tradidiss with both weighting versions, βheterogeneity and βsegregation; although, most other indices showed only a small degree of distortion of linear function (figure s . ). exponential relationship was found for the evenness-based (pe) form of generalized_tradidiss. notably, the taxon-based bray- curtis index had the steepest asymptotic function among all. in case of exponential transformation all other indices relying on between-species dissimilarities showed an asymptotic curve (figure s . ). in the random forest, niche width (that is, sigma) acquired by far the highest variable importance score (vis= . ). the less important variables were the data type (vis= . ), the dissimilarity method (vis= . ) and the transformation (vis=- . ). the heat map (figure ) also revealed a strong decrease in correlation along increasing sigma. it is also clearly shown that in most cases abundance data resulted in significantly higher correlation than presence/absence. the difference between linear and exponential transformation methods was not always visible. regarding variation between dissimilarity indices, the most striking (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patterns were the relatively poor performance of the paddis/dcw indices. all but the latter index combined with abundance data and linear transformation of dissimilarities lead to the highest correlation with environmental distance. discussion general patterns in the correlation with environmental distance we ran different simulation scenarios with varying strength of environmental filtering. we expected that the correlation between fdissim indices and environmental distance to be the highest when the environmental filtering is the strongest, and the correlation to become neutral when environmental filtering is not effective. when environmental filtering was strongest (that is, minimal overlap of species niches along the environmental gradient), all fdissim indices correlated highly with the environmental gradient. as expected, correlation between trait dissimilarity and environmental distance decreased as filtering weakened, moreover, differences between families of indices became more apparent. this result suggests that all tested methods are able to reveal the strong environmental filtering processes. as the contribution of competitive exclusion and stochastic processes approach or override environmental filtering, the correlation between fdissim indices and the background gradient becomes weaker. this decrease itself is not a drawback of the fdissim methods, rather it is a consequence of our study design, since we applied a series of scenarios where the effect of niche filtering was decaying. however, we think that the degree of the decrease reflects the sensitivity of the fdissim indices to the underlying trait-environmental relationship. indices, which showed high correlation with environmental distance, could be capable of revealing the environmental signal even when it is weak. actually, in our tests, most indices reached similarly high correlation, and there were only a few combinations of simulation parameters (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . which resulted in a decreased correlation with environmental distance for some dissimilarity indices. determinants of the correlation based on the random forest model the random forest model revealed that the effect of gradient length is the most important determinant of the correlation between dissimilarity and environmental distance, while methodological decisions had much lower variable importance. these observations suggest that the absolute value of the correlation between dissimilarity and environmental distance is primarily dependent on the sample in hand, and can be influenced by methodological decisions to a limited extent. correlations were stronger with abundance than with presence/absence data. this finding is at least partly attributable to our simulation design where community composition was driven by individual-based processes: birth, fitness difference, reproduction, and death. as a result, species relative abundances had to be proportional with their environmental suitability in the local community. transforming such data to binary scale loses meaningful information and weakens the correlation between dissimilarity in trait composition and environmental background. in cases when presences and absences of species respond more robustly to the main environmental gradient, while relative abundances change stochastically, or abundance estimations are inaccurate, the binary data type might be more straightforward. transforming between-species dissimilarities has a potential to conform distributional requirements, to approximate expert intuitions about relatedness of species or to customize sensitivity to functional difference with respect to specific research aims. for most indices across the tested range of gradient length and data type, the exponential transformation resulted a somewhat lower correlation than with linear transformation. more insight is provided by examining the shape of the relationships besides the pure correlation value. after (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . linear transformation of gower distances, most dissimilarity indices showed a linear or slightly curved function along environmental distance; although the scatter of the evenness- based generalized_tradidiss differed considerably from the straight line towards an exponentially increasing one. after exponential transformation of between-species trait dissimilarities, all indices in the direct dissimilarity-based class showed a rather steeply increasing asymptotic function. this result suggests that with the exponential transformation of between-species dissimilarities, it is possible to make fdissim indices more sensitive to smaller differences in functional composition. certainly, summary-based indices (cwmdis, cdfdis) are not affected by this transformation, since they are not based on between-species dissimilarities. comparison of taxon-based vs. trait-based dissimilarity the basic assumption of functional ecology is that the traits of individuals should be in closer relationship with ecological properties than their taxonomical status. following this argument, we expected that trait-based dissimilarity measures correlate more strongly with the environmental background than species-based indices. in contrast, higher correlation of species-based dissimilarity than trait-based dissimilarity indicates loss of information with the introduction of between-species similarity – which is non-sensual since our data was simulated in a way to possess a strong pattern in trait-environment relationship. we used the sørensen/bray-curtis index in a dissimilarity form as a reference method representing species-based dissimilarity calculations disregarding traits. our expectation was fulfilled by all indices with the exception of the members of the nearest neighbour family (dip, dcw and paddis). we suspect two potential reasons behind the low performance of these latter groups of indices. the first one is the improper scaling factor used for standardizing the ‘operational part’ of the indices (see the description in of the paddis family and the discussion about it under the paragraph “within-family variation of indices”). second, these indices rely on the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . quantities of minimally different species in the two communities under comparison. however, the minimum is a less robust descriptor of any sample distribution because of its dependency on sampling error; therefore, it might provide a poor representation of total community dissimilarity. although, we did not include dissimilarity values at exactly zero distance, the y-intercept (also called ‘nugget’) of the dissimilarity vs. environmental distance functions can be extrapolated with negligible error (fortin & dale ). brownstein et al. ( ) argued that the nugget of the distance decay relationship is a direct estimate of the amount of chance in the variation between local communities. in this respect worth noting is that the nugget with species-based bray-curtis index was near . , while with all trait-based indices the nugget was near zero. this suggests that without accounting for species similarities, environmental distance between communities can be overestimated due to similar species replacing each other. within-family variation of indices the perfect correlation between jaccard, sørensen and sokal-sneath forms of dsimcom and dissabc families was expected, since the original, taxon-based jaccard, sørensen and sokal- sneath indices are algebraically related, too (janson & vegelius ). however, for paddis jaccard, sørensen and sokal-sneath forms showed correlation below . at this family, the b and c components of the × contingency table are defined as measurable quantities with clear interpretation: the sum of species uniqueness values within each community. the total diversity (a+b+c) is defined to be equal with the species richness of the pooled pair of communities (a+b+c), and the quantity a is derived by subtracting (b+c) from it. with this definition, a remains a virtual quantity with no biological interpretation. in paddis indices, trait-based quantities b and c appear in the numerator (the ‘operational part’ sensu ricotta et al. ) of the indices, while in the denominators (i.e., in the ‘scaling factor’) the taxon- based quantities, a, b and c are used. we argue that the inconsistent behaviour of paddis is (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . due to the application of taxon-based quantities for scaling factors of trait-based operational parts. at the same time, we acknowledge that we either see no obvious solution to define total diversity or shared diversity according to the uniqueness-based idea behind paddis in a more realistic way. in the generalized_tradidiss family, the trait-based analogue of bray-curtis index can be achieved by calculating generalized canberra distance with uneven weighting of species. we expected this to be perfectly correlated with marczewski-steinhaus form of generalized_tradidiss index with uneven weighting, since bray-curtis and marczewski- steinhaus indices are the abundance forms of sørensen and jaccard indices, respectively. however, the correlation between them was lower. in the generalized_tradidiss family, between-community dissimilarity is calculated as weighted sum a standardized differences in species ordinariness values. species ordinariness is calculated on the basis of species abundance and trait values; however, weights used for adjusting species-level contributions are derived solely from abundances. therefore, generalized_tradidiss also follows a ‘hybrid’ approach in accounting for taxon-based vs. trait-based information. we argue that this is the reason why the algebraic relationships between the original sørensen and jaccard indices does not apply to its sørensen/bray-curtis-type and jaccard/marczewski-steinhaus-type forms. to sum up, we point to our observation that jaccard, sørensen and sokal-sneath forms of certain families of indices do not satisfy the algebraic relationships they supposed to, opening space for potential confusion. these algebraic relations hold only if a, b and c quantities are explicitly and consistently defined. families of fdissim indices combine abundance difference of species between plots and interspecific trait differences in a unique way, while indices belonging to the same family differ in how they relate this amount of ‘unshared’ variation (summarized as the b and c portions of the contingency table) to the shared (a) variation. some indices are able to handle abundances either as absolute or relative abundance (e.g. dsimcom, generalized_tradidiss, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dissabc), while others divide absolute abundances by their sum over the respective community, thus they work only with relative abundances. when indices in the former group are set to consider absolute abundances, they become sensitive to variation in the summed abundances of the communities under comparison. to place our tests on a common ground, we simulated communities with equal total number of individuals, and set all indices, where relevant, to work with relative abundances. hence, we removed the effect of differences in total abundance. the constant number of individuals might have increased the similarity between fdissim indices belonging to the same family and the correlation with the environmental gradient. the sum of abundances, let them be measured on any quantitative scale, may vary considerably in real study situations due to aggregated distribution of individuals or uneven sampling effort. therefore, our findings are more likely valid for settings when the sum of abundances are relatively stable, e.g. when sampling effort is controlled and individuals are dispersed evenly, or when abundances are recorded on percentage scale. limitations of our study in our study, we simulated a research situation in a simplistic way. we applied only one environmental gradient which operated as an environmental filter driving convergence on a single trait. besides this, we applied another trait which was constantly affected by a low level of competitive exclusion. these two traits were uncorrelated. nevertheless, there was some effect of random drift on community composition due to the probabilistic components of the simulation algorithm. we varied the strength of environmental filtering thus it had different relative contribution compared with competitive exclusion and stochasticity. in real research situations local trait composition is influenced by a wide range of processes, including several abiotic and biotic filters acting simultaneously. unless they are manipulated as parts of an experimental system, the full set of such filters are usually unknown for the researchers. the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiplicity of filters may reduce the ability of fdissim indices in recovering trait- environment relationships. further research should clarify how increasing complexity of the sample affects the behaviour of fdissim indices. conclusions considering the diversity of concepts they are built upon, fdissim indices showed unexpectedly low variation in performance. cwmdis, dsimcom, generalized_tradidiss acquired the highest correlation with environmental distance in all simulation scenarios, therefore they seem to be equally suitable for quantifying pairwise beta diversity based on traits. nevertheless, the most important determinant of the matching between trait-based dissimilarity and environmental distance is the length of the trait gradient. besides this, the data type (presence/absence vs. abundance) also affected the correlation more strongly than the choice of fdissim method. extending the comparative tests of fdissim measure to more complex gradients and real data sets could offer further insight into their behaviour. data availability simulated data was generated using the comsimitv r package. own functions for functional dissimilarity indices are made available through the zenodo public repository: . /zenodo. . author contributions a.l. designed and carried out the analysis, lead writing, z.b.d. discussed the concept and the results, wrote parts of and commented on the manuscript. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references anderson, m. j., crist, t. o., chase, j. m., vellend, m., inouye, b. d., freestone, a. l., sanders, n. j., cornell, h. v., comita, l. s., davies, k. f., harrison, s. p., kraft, n. j. b., stegen, j. c. & swenson, n. g. ( ). navigating the multiple meanings of β diversity: a roadmap for the practicing ecologist. ecology letters, ( ), - . doi: . /j. - . . .x anderson, m. j., ellingsen, k. e. & mcardle, b. h. ( ). multivariate dispersion as a measure of beta diversity. ecology letters, ( ), - . doi: . /j. - . . .x baselga, a. & leprieur, f. ( ). comparing methods to separate components of beta diversity. methods in ecology and evolution, : - . doi: . / - x. botta�dukát, z. & czúcz, b. ( ). testing the ability of functional diversity indices to detect trait convergence and divergence using individual�based simulation. methods in ecology and evolution, , - . https://doi.org/ . / - x. botta�dukát, z. ( ). rao's quadratic entropy as a measure of functional diversity based on multiple traits. journal of vegetation science, , - . https://doi.org/ . /j. - . .tb .x botta�dukát, z. ( ). the generalized replication principle and the partitioning of functional diversity into independent alpha and beta components. ecography, : - . doi: . /ecog. botta-dukat, z. ( ). comsimitv: flexible framework for simulating community assembly. r package version . . . https://cran.r-project.org/package=comsimitv (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . brownstein, g., steel, j.b., porter, s., gray, a., wilson, c., wilson, p.g. & wilson, j. b. ( ). chance in plant communities: a new approach to its measurement using the nugget from spatial autocorrelation. journal of ecology, , - . https://doi.org/ . /j. - . . .x cardoso, p., rigal, f., carvalho, j.c., fortelius, m., borges, p.a.v., podani, j. & schmera, d. ( ). partitioning taxon, phylogenetic and functional beta diversity into replacement and richness difference components. journal of biogeography, , - . doi: . /jbi. carmona, c. p., de bello, f., mason, n. w. h., lepš, j. ( ). traits without borders: integrating functional diversity across scales. trends in ecology and evolution ( ), - . doi: . /j.tree. . . champely, s., chessel, d. ( ). measuring biological diversity using euclidean metrics. environmental and ecological statistics , – . https://doi.org/ . /a: chao, a., chiu, c. and hsieh, t.c. ( ). proposing a resolution to debates on diversity partitioning. ecology, , - . https://doi.org/ . / - . chao, a., chiu, c.�h., villéger, s., sun, i�f., thorn, s., lin, y.�c., chiang, j.�m., & sherwin, w. b. ( ). an attribute�diversity approach to functional diversity, functional beta diversity, and related (dis)similarity measures. ecological monographs, ( ), e . . /ecm. chiu, c.-h., jost, l. & chao, a. ( ). phylogenetic beta diversity, similarity, and differentiation measures based on hill numbers. ecological monographs, ( ), - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . clarke, k.r. & warwick, r.m. ( ). quantifying structural redundancy in ecological communities. oecologia, ( ), - . de bello, f., carmona, c.p., mason, n.w.h., sebastià, m.�t. and lepš, j. ( ). which trait dissimilarity for functional diversity: trait means or trait overlap? journal of vegetation science, , - . doi: . /jvs. de bello, f., lepš, j., lavorel, s., & moretti, m. ( ). importance of species abundance for assessment of trait composition: an example based on pollinator communities. community ecology, ( ), – . https://doi.org/ . /comec. . . . díaz, s., & cabido, m. ( ). vive la différence: plant functional diversity matters to ecosystem processes. trends in ecology and evolution, ( ), – . https://doi.org/ . /s - ( ) - faith, d. p., minchin, p. r. & belbin, l. ( ). compositional dissimilarity as a robust measure of ecological distance. vegetatio , - . fortin, m.�j. & dale, m.r.t. ( ). spatial data analysis: a guide for ecologists. cambridge university press, cambridge. garnier, e., cortez, j., billès, g., navas, m., roumet, c., debussche, m., laurent, g., blanchard, a., aubry, d., bellmann, a., neill, c. & toussaint, j. ( ). plant functional markers capture ecosystem properties during secondary succession. ecology, , - . doi: . / - gregorius, h.�r., gillet, e.m. & ziehe, m. ( ). measuring differences of trait distributions between populations. biometrical journal, , - . https://doi.org/ . /bimj. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . grime, j. p. ( ). benefits of plant diversity to ecosystems: immediate, filter and founder effects. journal of ecology, , – . hawkins, b.a., leroy, b., rodríguez, m.Á., singer, a., vilela, b., villalobos, f., wang, x. & zelený, d. ( ). structural bias in aggregated species�level variables driven by repeated species co�occurrences: a pervasive problem in community and assemblage data. journal of biogeography, , - . hérault, b., & honnay, o. ( ). using life-history traits to achieve a functional classification of habitats. applied vegetation science, ( ), – . https://doi.org/ . /j. - x. .tb .x hill, m. o. & gauch, h. g. ( ). detrended correspondence analysis: an improved ordination technique. vegetatio, , – . hill, m. o. ( ). diversity and evenness: a unifying notation and its consequences. ecology, ( ), – . hitchcock, f.l. ( ). distribution of a product from several sources to numerous localities. journal of mathematical physics, : - . hothorn, t., hornik, k., van de wiel, m. a. & zeileis, a. ( ). a lego system for conditional inference. the american statistician, ( ), – . hothorn, t., zeileis, a. ( ). partykit: a modular toolkit for recursive partytioning in r. journal of machine learning research, , - . url http://jmlr.org/papers/v /hothorn a.html izsák, c., & price. r. g. ( ). measuring b-diversity using a taxonomic similarity index, and its relation to spatial scale. marine ecology progress series , – . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . janson, s. & j. vegelius ( ). measures of ecological association. oecologia, ( ), - . jost, l. ( ). partitioning diversity into independent alpha and beta components. ecology, , – . kleyer, m., dray, s., bello, f., lepš, j., pakeman, r.j., strauss, b., thuiller, w. & lavorel, s. ( ). assessing species and community functional responses to environmental gradients: which multivariate methods? journal of vegetation science, , - . doi: . /j. - . . .x: – . koleff, p., gaston, k. j. & lennon, j. j. ( ). measuring beta diversity for presence– absence data. journal of animal ecology, , - . doi: . /j. - . . .x laliberté, e. & p. legendre ( ). a distance-based framework for measuring functional diversity from multiple traits. ecology, , - . laliberté, e., legendre, p., & shipley, b. ( ). fd: measuring functional diversity from multiple traits, and other tools for functional ecology. r package version . - . legendre, p. & legendre, l. ( ) numerical ecology. elsevier, amsterdam, nl legendre, p., de cáceres, m. ( ). beta diversity as the variance of community data: dissimilarity coefficients and partitioning. ecology letters , – leinster, t. & cobbold, c.a. ( ). measuring diversity: the importance of species similarity. ecology, , - . doi: . / - . lengyel, a. & podani, j. ( ). assessing the relative importance of methodological decisions in classifications of vegetation data. journal of vegetation science, , - . doi: . /jvs. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . lengyel, a., swacha, g., botta-dukát, z. & kacki, z. ( ). trait-based numerical classification of mesic and wet grasslands in poland. journal of vegetation science, , – . https://doi.org/ . /jvs. lepš, j., de bello, f., lavorel, s. & berman, s. ( ). quantifying and interpreting functional diversity of natural communities: practical considerations matter. preslia, , – . macarthur, r., levins, r. ( ). limiting similarity convergence and divergence of coexisting species. american naturalist, , – . mason, n. w. h., mouillot, d., lee, w. g. & wilson, j. b. ( ). functional richness, functional evenness and functional divergence: the primary components of functional diversity. oikos, , - . doi: . /j. - . . .x mcgill, b., enquist, b. j., weiher, e., westoby, m. ( ). rebuilding community ecology from functional traits. trends in ecology and evolution ( ), - . mouchet, m.a., villéger, s., mason, n.w.h. and mouillot, d. ( ). functional diversity measures: an overview of their redundancy and their ability to discriminate community assembly rules. functional ecology, , - . doi: . /j. - . . .x mouillot, d., stubbs, w., faure, m., dumay, o., tomasini, j.a., wilson, j.b. & chi, t.d. ( ). niche overlap estimates based on quantitative functional traits: a new family of non�parametric indices. oecologia, , – . muscarella, r. & uriarte, m. ( ). do community-weighted mean functional traits reflect optimal strategies? proceedings of the royal society b, , . https://doi.org/ . /rspb. . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . nipperess, d.a., faith, d.p. & barton, k. ( ), resemblance in phylogenetic diversity among ecological assemblages. journal of vegetation science, , - . doi: . /j. - . . .x oksanen, j., blanchet, f.g., friendly, m., kindt, r., legendre, p., mcglinn, d., peter r. minchin, p. r., o'hara, r. b., simpson, g. l., solymos, p., stevens, m. h. m., szoecs, e. & wagner, h. ( ). vegan: community ecology package. r package version . - . https://cran.r-project.org/package=vegan pavoine, s. & ricotta, c. ( ). functional and phylogenetic similarity among communities. methods in ecology and evolution, , -- . pavoine, s. & ricotta, c. ( ). measuring functional dissimilarity among plots: adapting old methods to new questions. ecological indicators, , - . pavoine, s. ( ). clarifying and developing analyses of biodiversity: towards a generalisation of current approaches. methods in ecology and evolution, , - . doi: . /j. - x. . .x pavoine, s. ( ). a guide through a family of phylogenetic dissimilarity measures among sites. oikos, , - . doi: . /oik. pavoine, s. ( ). adiv: an r package to analyse biodiversity in ecology. methods in ecology and evolution, , – . https://doi.org/ . / - x. peres-neto, p.r., dray, s. & ter braak, c.j.f. ( ). linking trait variation to the environment: critical issues with community�weighted mean correlation resolved by the fourth�corner approach. ecography, , - . petchey, o. l. & gaston, k. j. ( ). functional diversity: back to basics and looking forward. ecology letters, , - . doi: . /j. - . . .x (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . podani, j. & schmera, d. ( ). a new conceptual and methodological framework for exploring and explaining pattern in presence – absence data. oikos, , - . doi: . /j. - . . .x podani, j. ( ). introduction to the exploration of multivariate biological data. backhuys, leiden, nl. r core team ( ). r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria. https://www.r-project.org/. rao, c. r. ( ). diversity and dissimilarity coefficients: a unified approach. theoretical population biology, , - . ricotta c. & burrascano s. ( ). beta diversity for functional ecology. preslia, , – . ricotta, c. & g. bacaro. ( ). on plot-to-plot dissimilarity measures based on species functional traits. community ecology, , – . ricotta, c. & j. podani. ( ). on some properties of the bray-curtis dissimilarity and their ecological meaning. ecological complexity, , – . ricotta, c. & pavoine, s. ( ). measuring similarity among plots including similarity among species: an extension of traditional approaches. journal of vegetation science, , - . doi: . /jvs. ricotta, c. ( ). of beta diversity, variance, evenness, and dissimilarity. ecology and evolution , – . https://doi.org/ . /ece . ricotta, c. ( ). a family of (dis)similarity measures based on evenness and its relationship with beta diversity. ecological complexity, , - . doi: . /j.ecocom. . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ricotta, c., bacaro, g., caccianiga, m., cerabolini, b.e.l. & moretti, m. ( ). a classical measure of phylogenetic dissimilarity and its relationship with beta diversity. basic and applied ecology ( ), - . https://doi.org/ . /j.baae. . . ricotta, c., podani, j., pavoine, s. ( ). a family of functional dissimilarity measures for presence and absence data. ecology and evolution, , – . doi: . /ece . schmidt, t., matias rodrigues, j. & von mering, c. ( ). a family of interaction-adjusted indices of community similarity. isme journal , – . https://doi.org/ . /ismej. . signorell, a. et mult. al. ( ). desctools: tools for descriptive statistics. r package version . . . strobl, c., boulesteix, a.l., kneib, t., augustin, t. & zeileis, a. ( ). conditional variable importance for random forests. bmc bioinformatics, ( ). http://www.biomedcentral.com/ - / / strobl, c., boulesteix, a.l., zeileis, a. & hothorn, t. ( ). bias in random forest variable importance measures: illustrations, sources and a solution. bmc bioinformatics, , . http://www.biomedcentral.com/ - / / swenson n. g., anglada-cordero p. & barone j. a. ( ). deterministic tropical tree community turnover: evidence from patterns of functional beta diversity along an elevational gradient. proceedings of the royal society b, , – . swenson, n. g. ( ). phylogenetic beta diversity metrics, trait evolution and inferring the functional beta diversity of communities. plos one ( ), e . https://doi.org/ . /journal.pone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tamás, j., podani, j. & csontos, p. ( ). an extension of presence/absence coefficients to abundance data: a new look at absence. journal of vegetation science, , - . doi: . / tuomisto, h. ( a). a diversity of beta diversities: straightening up a concept gone awry. part . defining beta diversity as a function of alpha and gamma diversity. ecography, , - . doi: . /j. - . . .x tuomisto, h. ( b). a diversity of beta diversities: straightening up a concept gone awry. part . quantifying beta diversity and related phenomena. ecography, , - . doi: . /j. - . . .x villéger, s., mason, n.w.h. & mouillot, d. ( ). new multidimensional functional diversity indices for a multifaceted framework in functional ecology. ecology, , - . doi: . / - . violle, c., navas, m.�l., vile, d., kazakou, e., fortunel, c., hummel, i. & garnier, e. ( ). let the concept of trait be functional! oikos, , - . doi: . /j. - . . .x whittaker, r. h. ( ). vegetation of the siskiyou mountains, oregon and california. ecological monographs, , – . whittaker, r. h. ( ). evolution and measurement of species diversity. taxon, , - .doi: . / zelený, d. ( ). which results of the standard test for community weighted mean approach are too optimistic? journal of vegetation science , - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tables and figures table . similarity and dissimilarity forms of resemblance indices for presence-absence data name of the index similarity version dissimilarity version sørensen �� ⁄ �� ochiai �� kulczynski �� ⁄ � ��⁄ �⁄ �� ! � � � � � � � � �" � # � �� $ simpson �� min��, �� min ��, �� ()*��, �� jaccard �� sokal & sneath �� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table . classification of trait-based dissimilarity indices. in columns of input data type x-es indicate, if abundance (a), relative abundance (r), and presence-absence data can be used as input. class approach family references input data tpye r function a r p/a summary-based typical value cwm-based ricotta et al. ( ) x x x fd:::functcomp distribution- based cdf-based appendix s x x x our new functions, see data availability direct dissimilarity probabilistic disc/dq rao , pavoine & ricotta ( ) x x x adiv::sq dsimcom pavoine & ricotta ( ) x x x adiv:::dsimcom ordinariness- based dissabc pavoine & ricotta ( ) x x x adiv:::dissabc generalized_tradidiss pavoine & ricotta ( ) x x adiv:::generalized_tradidiss diversity multiplicative beta chao et al. ( ) x our new functions, see data (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r. a ll rig h ts re se rve d . n o re u se a llo w e d w ith o u t p e rm issio n . t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d ja n u a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . partitioning availability nearest neighbour dcw, dcw(q) clarke & warwick ( ), ricotta & bacaro ( ) x x our new functions, see data availability dip izsák & prince ( ), ricotta & bacaro ( ) x x our new functions, see data availability paddis ricotta et al. ( ) x adiv:::paddis classification- based not discussed not discussed hérault & honnay ( ), nipperess et al. ( ), cardoso et al. ( ), pavoine ( ) (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r. a ll rig h ts re se rve d . n o re u se a llo w e d w ith o u t p e rm issio n . t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d ja n u a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . table . kendall tau correlations between environmental distance and the functional dissimilarity measures at different values of sigma and with abundance data type sigma= . sigma= . sigma= . sigma= . sigma= sigma= cwmdis . . . . . . cdfdis . . . . . . d(q) . . . . . . dsimcom.ss . . . . . . dsimcom.jac . . . . . . dsimcom.sor . . . . . . dsimcom.och . . . . . . dsimcom.beta . . . . . . dissabc.jac . . . . . . dissabc.sor . . . . . . dissabc.ss . . . . . . dissabc.och . . . . . . dissabc.kul . . . . . . dissabc.si . . . . . . tradidiss.gc.even . . . . . . tradidiss.ms.even . . . . . . tradidiss.pe.even . . . . . . tradidiss.gc.uneven . . . . . . tradidiss.ms.uneven . . . . . . tradidiss.pe.uneven . . . . . . βturnover . . . . . . βheterogeneity . . . . . . βsegregation . . . . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dip . . . . . . dcw . . . . . . bray-curtis (species-based) . . . . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table . kendall tau correlations between environmental distance and the functional dissimilarity measures at different values of sigma and with presence/absence data type sigma= . sigma= . sigma= . sigma= . sigma= sigma= cwmdis . . . . . . cdfdis . . . . . - . d(q) . . . . . - . dsimcom.ss . . . . . - . dsimcom.jac . . . . . - . dsimcom.sor . . . . . - . dsimcom.och . . . . . . dsimcom.beta . . . . . - . dissabc.jac . . . . . - . dissabc.sor . . . . . - . dissabc.ss . . . . . - . dissabc.och . . . . . - . dissabc.kul . . . . . - . dissabc.si . . . . . . tradidiss.gc.even . . . . . - . tradidiss.ms.even . . . . . - . tradidiss.pe.even . . . . . - . tradidiss.gc.uneven . . . . . - . tradidiss.ms.uneven . . . . . - . tradidiss.pe.uneven . . . . . - . βturnover . . . . . - . βheterogeneity . . . . . - . βsegregation . . . . . - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dip . . . . . - . dcw . . . . . - . paddis.jac . . . . . - . paddis.sor . . . . . - . paddis.ss . . . . . - . paddis.och . . . . . - . paddis.simp . . . . . . paddis.kul . . . . . - . sørensen (species-based) . . . . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . heat maps showing the interactive effects of niche width (sigma), transformation of between-species dissimilarities (lin = linear, exp = exponential), data type (abund = abundance, p/a = presence/absence), and dissimilarity index ( – cwmdis, – cdfdis, – dq, – dsimcom/sørensen, – dissabc/sørensen, – generalized_tradidiss/generalized canberra, uneven weighting, – βturnover, – dcw) on the correlation with environmental distance (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . competitive binding of stats to receptor phospho-tyr motifs accounts for altered cytokine responses in autoimmune disorders competitive binding of stats to receptor phospho-tyr motifs accounts for altered cytokine responses in autoimmune disorders stephan wilmes *, polly-anne jeffrey *, jonathan martinez-fabregas , maximillian hafer , paul fyfe , elizabeth pohler , silvia gaggero , martín lópez-garcía , grant lythe , thomas guerrier , david launay , mitra suman , jacob piehler , carmen molina-parís # and ignacio moraga # division of cell signalling and immunology, school of life sciences, university of dundee, dundee, uk. department of applied mathematics, school of mathematics, university of leeds, leeds, uk. department of biology and centre of cellular nanoanalytics, university of osnabrück, osnabrück, germany. université de lille, inserm umr cnrs umr –canther and institut pour la recherche sur le cancer de lille (ircl), lille, france. univ. lille, inserm, chu lille, u - infinite - institute for translational research in inflammation, f- lille, france. * these authors contributed equally to this work # these authors share senior authorship abstract cytokines elicit pleiotropic and non-redundant activities despite strong overlap in their usage of receptors, jaks and stats molecules. we use il- and il- to ask how two cytokines activating the same signaling pathway have different biological roles. we found that il- induces more sustained stat phosphorylation than il- , with the two cytokines inducing comparable levels of stat phosphorylation. mathematical and statistical modelling of il- and il- signaling identified stat binding to gp , and stat binding to il- ra, as the main dynamical processes contributing to sustained pstat by il- . mutation of tyr on il- ra decreased il- -induced stat phosphorylation by % but had limited effect on stat phosphorylation. strong receptor/stat coupling by il- initiated a unique gene expression program, which required sustained stat phosphorylation and irf expression and was enriched in classical interferon stimulated genes. interestingly, the stat/receptor coupling exhibited by il- /il- was altered in patients with systemic lupus erythematosus (sle). il- /il- induced a more potent stat activation in sle patients than in healthy controls, which correlated with higher stat expression in these patients. partial inhibition of jak activation by sub-saturating doses of tofacitinib specifically lowered the levels of stat activation by il- . our data show that receptor and stats concentrations critically contribute to shape cytokine responses and generate functional pleiotropy in health and disease. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction il- and il- both have intricate functions regulating inflammatory responses ( ). il- is a hetero-dimeric cytokine comprised of p and ebi subunits ( ). il- exerts its activities by binding gp and il- rα receptor subunits in the surface of responsive cells, triggering the activation of the jak /stat /stat signaling pathway. il- elicits both pro- and anti- inflammatory responses, although the later activity seems to be the dominant one ( ). il- stimulation inhibits rorgt expression, thereby suppressing th- commitment and limiting subsequent production of pro-inflammatory il- ( , ). moreover, il- induces a strong production of anti-inflammatory il- on (tbet+ and foxp -) tr- cells ( - ) further contributing to limit the inflammatory response. il- engages a hexameric receptor complex comprised of each of two copies of il- ra, gp and il- ( ), triggering the activation, as il- does, of the jak /stat /stat signaling pathway. however, opposite to il- , il- is known as a paradigm pro-inflammatory cytokine ( , ). il- inhibits lineage differentiation to treg cells ( ) while promoting th- ( , ), thus supporting its pro-inflammatory role. how il- and il- elicit opposite immuno-modulatory activities despite activating almost identical signaling pathways is currently not completely understood. the relative and absolute stats activation levels seem to have intricate roles, which lead to a strong signaling and functional plasticity by cytokines. although il- robustly activates stat , it is capable to mount a considerable stat response as well ( ). moreover, in the absence of stat , il- induces a strong stat response comparable to ifng – a prototypic stat activating cytokine ( ). likewise, the absence of stat potentiates the stat response for il- , which normally elicits a strong stat response, rendering it to mount an il- -like response ( ). furthermore, negative feedback mechanisms like socss and phosphatases have been described as critical players influencing stat and stat phosphorylation kinetics and thereby shaping their signal integration for gp -utilizing cytokines ( - ). yet, how all these molecular components are integrated by a given cell to produce the desired response is still an open question. among the il- /il- cytokine family, il- exhibits a unique stat activation pattern. the majority of gp -engaging cytokines activate preferentially stat , with activation of stat being an accessory or balancing component ( , ). il- , however, triggers stat and stat activation with high potency ( ). indeed, different studies have shown that il- responses rely on either stat ( - ) or stat activation ( , ). moreover, recent transcriptomics studies showed that in the absence of stat , il- and il- lost more than % of target gene induction. yet, stat was the main factor driving the specificity of the il- versus the il- response, highlighting a critical interplay of stat and stat engagement ( ). while the biological responses induced by il- and il- have been extensively studied ( , ), the very initial steps of signal activation and kinetic integration by these two cytokines have not been comprehensively analysed. since the different biological outcomes elicited by il- and il- are most likely encoded in the early events of cytokine stimulation, here we specifically aimed to identify the molecular determinants underlying functional selectivity by il- in human t-cells. we asked how a defined cytokine stimulus is propagated in time over multiple layers of signaling to produce the desired response. to this end, we probed il- and il- signaling at different scales, ranging from cell surface receptor assembly and early stat / effector activation to an unbiased and quantitative multi-omics approach: phospho- proteomics after early cytokine stimulation, kinetics of transcriptomic changes and alteration of the t-cell proteome upon prolonged cytokine exposure. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / il- and il- induced similar levels of assembly of their respective receptor complexes, which resulted in comparable phosphorylation of stat by the two cytokines. il- , on the other hand, triggered a more sustained stat phosphorylation. to decipher the molecular events which determine sustained stat phosphorylation by il- , we mathematically model the stat and stat signaling kinetics induced by each of these cytokines. we identified differential binding of stat and stat to il- ra and gp , respectively, as the main factor contributing to a sustained stat activation by il- . at the transcriptional level, il- triggered the expression of a unique gene program, which strictly required the cooperative action between sustained pstat and irf expression to drive the induction of an interferon- like gene signature that profoundly shaped the t-cell proteome. interestingly, our mathematical models of il- and il- signaling predicted that changes in receptor and stat expression could fundamentally change the magnitude and timescale of the il- and il- responses. we found high levels of stat expression in sle patients when compared to healthy donors, which correlated with biased stat responses induced by il- and il- in these patients. strikingly, we could specifically inhibit stat activation by il- using suboptimal doses of the jak inhibitor tofacitinib. this could provide a new strategy to specifically target individual stats engaged by cytokines. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / results: il- induces a more sustained stat activation than hypil- in human th- cells il- and il- are critical immuno-modulatory cytokines. while il- engages a hexameric surface receptor comprised of two molecules of il- ra and two molecules of gp to trigger the activation of stat and stat transcription factors (figure a), il- binds gp and il- ra to trigger activation of the same stats molecules (figure a). despite sharing a common receptor subunit, gp , and activating similar signaling pathways, these two cytokines exhibit non-redundant immuno-modulatory activities, with il- eliciting a potent pro- inflammatory response and il- acting more as an anti-inflammatory cytokine. here, we set to investigate the molecular rules that determine the functional specificity elicited by il- and il- using human th- cells as a model experimental system. due to the challenging recombinant expression of the human il- , we have recombinantly produced a murine single-chain variant of il- (p and ebi ) which cross-reacts with the human receptors and triggers potent signaling, comparable to the signaling output produced by commercial human il- ( ) (supp. fig. a). in addition, we have used a linker-connected single-chain fusion protein of il- ra and il- termed hyperil- (hypil- ) ( ) to diminish il- signaling variability due to changes in il- ra expression during t cell activation ( ). cd + t cells from human buffy coat samples were isolated by magnetic activated cell sorting (macs) and grew under th- polarizing conditions. th- cells were then used to study in vitro signaling by il- and il- (supp. fig. b). we took advantage of a barcoding methodology allowing high-throughput multiparameter flow cytometry to perform detailed dose/response and kinetics studies induced by hypil- and il- in th- cells ( ) (supp. fig. b). dose- response experiments with il- and hypil- on th- cells showed concentration-dependent phosphorylation of stat and stat . phosphorylation of stat / was more sensitive to activation by il- with an ec of ~ pm compared to ~ pm for hypil- (figure b). despite this difference in sensitivity, both cytokines yielded the same activation amplitude for pstat . for pstat , however, we observed a significantly reduced maximal amplitude for hypil- relative to il- (figure b). we next performed kinetic studies to assess whether the poor stat activation by hypil- was a result from different activation kinetics. for stat , we saw the peak of phosphorylation after ~ - minutes, followed by a gradual decline. both cytokines exhibited an almost identical sustained pstat profile, with ~ % of activation still seen after h of continuous stimulation. interestingly, il- did not only activate stat with higher amplitude but also more sustained than hypil- (figure c). this could be better appreciated when pstat levels were normalized to maximal mfi for each cytokine, with il- inducing clearly a more sustain phosphorylation of stat than hypil- (supp. fig. c). the same phenotype was observed in other t-cell subsets of activated pbmcs (supp. fig. d). as cell surface gp levels are significantly reduced upon t-cell activation ( ), we next investigated whether the transient stat activation profile induced by hypil- resulted from limited availability of gp . for that we generated a rpe cell clone stably expressing ten times higher levels of gp in its surface (figure d, right panel). stimulation of this rpe clone with hypil- resulted in a more sustained activation of stat , with very little effect on stat activation kinetics when compared to rpe wild type cells, suggesting that gp receptor density does not contribute to the transient stat activation kinetics elicited by hypil- (figure d). ligand-induced cell-surface receptor assembly by il- and hypil- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we next investigated whether il- and hypil- elicited differential cell surface receptor engagement that could explain their distinct signaling output. for that, we measured the dynamics of receptor assembly in the plasma membrane of live cells by simultaneous dual- colour total internal reflection fluorescence (tirf) imaging. rpe cells were chosen as a model experimental system since they do not express endogenous il- ra (supp. fig. e). we used previously described rpe gp ko cells (supp. fig. a) ( ) to transfect and express tagged variants of il- ra and gp , to allow quantitative site-specific fluorescence cell surface labelling by dye-conjugated nanobodies (nbs) (figure e) as recently described in ( ). for both il- ra and gp we found a random distribution and unhindered lateral diffusion of individual receptor monomers (figure f). single molecule co- localization combined with co-tracking analysis was then used to identify correlated motion of il- ra and gp which was taken as a readout for receptor heterodimer formation ( ) (figure f, figure supp. movie ). in the resting state, we did not observe pre-assembly of il- ra and gp . however, after stimulation with il- we found substantial heterodimerization (figure f & g, supp. fig. b, figure supp. movie & ). at elevated laser intensities, bleaching analysis of individual complexes confirmed a one-to-one ( : ) complex stoichiometry of il- ra and gp , whereas single-molecule förster resonance energy transfer (fret) further corroborated close molecular proximity of the two receptor chains (figure h). we also observed association and dissociation events of receptor heterodimers, pointing to a dynamic equilibrium between monomers and dimers as proposed for other heterodimeric cytokine receptor systems ( , ) (figure supp. movie ). to measure homodimerization of gp by hypil- , we stochastically labelled gp with equal concentrations of the same nb species conjugated to either of the two dyes ( ). we saw strong homodimerization of gp after stimulation with hypil- (figure g, supp. fig. b , figure supp. movie ). homodimerization was confirmed either by single- color dual-step bleaching or dual-color single-step bleaching as shown for other homodimeric cytokine receptors (supp. fig. c) ( ). for both cytokine receptor systems, we saw a cytokine-induced reduction of the diffusion mobility, which has been ascribed to increased friction of receptor dimers diffusing in the plasma membrane. however, we note that hypil- stimulation impaired diffusion of gp more strongly than il- did, possibly indicating faster receptor internalization (supp. fig. d). based on the dimerization data, we were able to calculate the two-dimensional equilibrium dissociation constants (𝐾!"!) according to the law of mass action for a dynamic monomer-dimer equilibrium: for il- -induced heterodimerization of il- ra and gp , we calculated a d kd of ~ . µm- . in activated t-cells with high levels and a significant excess of il- ra over gp , this 𝐾!"! ensures strong receptor assembly by il- ( ). the d kd for gp homodimerization by hypil- was ~ . µm- . this higher affinity is most likely due to the two high-affinity binding sites engaged in the hexameric receptor complex ( ). however, in t-cells the expression of gp can be particularly low, thus, probably limiting hypil- . taken together, these experiments marked ligand-induced receptor assembly as the initial step triggering downstream signaling for both il- and hypil- , with no obvious differences in their receptor activation mechanism which could support the observed more sustained stat activation elicited by il- . mathematical and statistical analysis of hypil- and il- induced stat kinetic responses to gain further insight into the molecular rules and kinetics that define il- sustained stat phosphorylation, we developed two mathematical models of the initial steps of hypil- and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / il- receptor-mediated signaling, respectively. the mathematical model for each cytokine considers the following events: i) cytokine association and dissociation to a receptor chain (figure a, supp. fig. a and b, top panel), ii) cytokine-induced dimer association and dissociation (supp. fig. a and b, bottom panel), iii) stat (or stat ) binding and unbinding to dimer (supp. fig. c and d), iv) stat (or stat ) phosphorylation when bound to dimer (supp. fig. c and d), v) internalisation/degradation of complexes (supp. fig. e and f), and vi) dephosphorylation of free stat (or stat ) (supp. fig. g). details of model assumptions, model parameters and parameter inference have been provided in the material and methods under mathematical models and bayesian inference. we first wanted to explore if there existed a potential feedback mechanism in the way in which receptor molecules are internalised/degraded over time. to this end, and for each cytokine model, we considered two hypotheses: hypothesis assumes that receptor complexes (supp. fig. e and f) are internalised with rate proportional to the concentration of the species in which they are contained (e.g., different dimer types), and hypothesis , that receptor complexes are internalised with rate proportional to the product of the concentration of the species in which they are contained and the sum of the concentrations of free phosphorylated stat and stat . hypothesis is consistent with a negative feedback mechanism in which pstat molecules translocate to the nucleus, where they increase the production of negative feedback proteins such as socs . as described in the material and methods (mathematical models and bayesian inference) we made use of the rpe experimental data set to carry out mathematical model selection for the two different hypotheses. we found that hypothesis could explain the data better than hypothesis , with a probability of %. this result can be seen in figure b, in which we plot, for different values of the distance threshold between the mathematical model output and the data (see mathematical models and bayesian inference in material and methods, for details), the relative probability of each hypothesis, where hypothesis is denoted 𝐻# and hypothesis is denoted 𝐻". it can be observed that for smaller values of the distance threshold, which indicate better support from the data to the mathematical model, the relative probability of hypothesis is higher than that of hypothesis . we then made use of this result to explore the mathematical models for both cytokines under hypothesis , in particular we performed parameter calibration. to this end (and as described in material and methods under mathematical models and bayesian inference), we carried out bayesian inference together with the mathematical models (hypothesis ) and the experimental data sets to quantify the reaction rates (see supp. fig. ) and initial molecular concentrations (see table and table ). the bayesian parameter calibration of the two models of cytokine signaling allows one to quantify the observed kinetics of pstat / phosphorylation induced by hypil- and il- in rpe and th- cells (figure c). substantial differences in stat association rates to and dissociation rates from the dimeric complexes were inferred to critically contribute to defining pstat / kinetics. figure d shows the kernel density estimates (kdes) for the posterior distributions of the rate constants and initial concentrations in the models. 𝑘$% & denotes the rate at which stat𝑖 binds to gp and 𝑘$' & denotes the rate at which stat𝑖 binds to il- ra, for 𝑖 ∈ { , }. our results indicate that stat and stat exhibit different binding preferences towards il- ra and gp , respectively. while stat exhibits stronger binding to il- ra than gp (𝑘#' & > 𝑘#% & ), stat exhibits stronger binding to gp than il- ra, (𝑘(%& > 𝑘(' & ) in agreement with previous observations ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / il- rα cytoplasmic domain is required for sustained pstat kinetics the bayesian inference carried out with the experimental data and the mathematical models clearly indicated statistically significant differences in the binding rates of stat /stat to gp and il- ra, to account for the different phosphorylation kinetics exhibited by hypil- and il- . thus, we next investigated whether the more sustained stat activation by il- resulted from its specific engagement of il- ra. for that, we used rpe cells, which do not express il- ra (supp. fig. e), to systematically dissect the contribution of the il- ra cytoplasmic domain to the differential pstat activation by il- . il- ra’s intracellular domain is very short and only encodes two tyr susceptible to be phosphorylated in response to il- stimulation, i.e., tyr and ty (figure a). we mutated these two tyr to phe to analyse their contribution to il- induced signaling. we stably expressed wt il- ra as well as different il- ra tyr mutants in rpe cells with comparable cell surface expression levels (figure b). importantly, this reconstituted experimental system mimicked the pstat / activation kinetics of t-cells (supp. fig. a). as the endogenous gp expression levels remain unaltered, all generated clones exhibited very comparable responses to hypil- (figure b, bottom panels). il- triggered comparable levels of stat and stat activation in rpe cells reconstituted with il- ra wt and il- ra y f mutant, suggesting that this tyr residue does not contribute to signaling by this cytokine (figure b and supp. fig. b). in rpe cells reconstituted with the il- ra y f or y f-y f mutants, il- stimulation resulted in % of the stat activation, but only % of the stat activation levels induced by this cytokine relative to il- ra wt (figure b) ( ). these observations suggest a tight coupling of stat phosphorylation to one of the receptor chains; namely, il- ra with pstat and gp with pstat , respectively. we next tested how the cytoplasmic domains of gp and il- ra shape the pstat kinetic profiles. thus, we generated a stable rpe clone expressing a chimeric construct comprised of the extracellular and transmembrane domain of il- ra but the cytoplasmic domain of gp (figure c, supp. fig. a). again, as both cell lines express unaltered endogenous gp levels, they exhibited comparable responses to hyil- (figure c). strikingly, this domain-swap resulted in a transient pstat kinetic response by il- comparable to hypil- stimulation. stat activation on the other hand remained unaltered suggesting that the cytoplasmic domain of il- ra is essential for a sustained pstat response but not for pstat . two plausible scenarios could explain the observed pstat / activation differential by hypil- and il- : i) il- ra-jak complex phosphorylates stat faster than gp -jak complex or ii) pstat is more quickly dephosphorylated in the il- /gp receptor homodimer. in the latter case, pstat deactivation by constitutively expressed phosphatases could be an additional factor of regulation. indeed, shp- has been described to bind to gp and shape il- responses ( ). however, our bayesian inference results (together with the mathematical models and the experimental data) identified the stat/receptor association rates as the only rates that could account for the greater and more sustained activation of stat by il- . we note (as described in the material and methods) that the phosphorylation rate, denoted by q, of stat and stat when bound to a dimer (homo- or hetero-) has been assumed to be independent of the stat type and the receptor chain. moreover, the model also included dephosphorylation of free pstat molecules, and predicted that the rates at which these reactions occur (𝑑# and 𝑑() had rather similar posterior distributions, hence arguing against the potential role of phosphatases to specifically target stat upon hypil- stimulation. to distinguish between the two plausible scenarios, we next .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / determined the rates of pstat / dephosphorylation by blocking jak activity upon cytokine stimulation making use of the jak inhibitor tofacitinib in rpe cells. tofacitinib was added minutes after stimulation with either cytokine and pstat and pstat levels were measured at the indicated times. jak inhibition markedly shortened the pstat / activation profiles induced by both cytokines (figure d, supp. fig. b). the relative dephosphorylation rates could then be determined by the signal intensity ratio of +/- tofacitinib. even though pstat levels were more affected by jak inhibition than those of pstat , the observed relative changes were nearly identical for il- and hypil- . these findings were also confirmed for th- cells (supp. fig. c & d) and indicate, that selective phosphatase activity cannot serve as an explanation for the pstat / differential by hypil- and il- , in agreement with our mathematical modelling predictions. similarly, we tested whether neosynthesis of feedback inhibitors such as socs ( ) would selectively impair signaling by hypil- but not by il- . to this end we pre-treated cells with cycloheximide (chx) and followed the pstat / kinetics induced by the two cytokines (supp. fig. a & b). chx treatment resulted in more sustained pstat activity for both cytokines. to our surprise, stat phosphorylation by il- was even more sustained while pstat levels induced by il- remained unaffected. these observations exclude that feedback inhibitors selectively impair stat activation kinetics by hypil- and thus do not account for the faster stat dephosphorylation kinetics observed under hypil- stimulation. overall our data from the chimera and mutant experiments, which were not used in the bayesian calibration, provide strong and independent support, as well as validation, to the mathematical models of hypil- and il- signaling, and point to the differential association/dissociation of stat and stat to il- ra and gp , respectively, as the main factor defining stat phosphorylation kinetics in response to hypil- and il- stimulation. unique and overlapping effects of il- and hypil- on the th- phosphoproteome thus far, we have investigated the differential activation of stat /stat induced by hypil- and il- . next, we asked whether il- and il- induced the activation of additional and specific intracellular signaling programs that could contribute to their unique biological profiles. to this end, we investigated the il- and hypil- activated signalosome using quantitative mass-spectrometry-based phospho-proteomics. macs-isolated cd + were polarized into th- cells and expanded in vitro for stable isotope labelling by amino acids in cell culture (silac). cells were then stimulated for min with saturating concentrations of il- , hypil- or left untreated. samples were enriched for phosphopeptides (ti-imac), subjected to mass spectrometry and raw files analysed by maxquant software (supp. fig. a). in total we could quantify ~ phosphopeptides from proteins, identified across all conditions (unstimulated, il- , hypil- ) for at least two out of three tested donors. for il- and hypil- we detected similar numbers of significantly upregulated ( vs. ) and downregulated ( vs. ) phosphorylation events (figure a) and systematically categorized them in context with their cellular location and ascribed biological functions (supp. fig. b & c) ( ). the two cytokines shared approximately half of the upregulated and one third of the downregulated phospho-peptides (supp. fig. a) but also exhibited differential target phosphorylation (figure b and supp. fig. b). as expected, we found multiple members of the stat protein family among the top phosphorylation hits by the two cytokines, validating our study (figure b & c). in line with our previous observations, we detected the same relative amplitudes for tyrosine phosphorylated stat and stat . in addition to tyrosine- phosphorylation, we detected robust serine-phosphorylation on s for stat and stat (figure c). while ps-stat activity correlated with py-stat with il- being more potent .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / than hypil- , this was not the case for stat . despite an identical py-stat phosphorylation profile, hypil- induced a ~ % higher ps-stat relative to il- (figure c). these results were corroborated, following the phosphorylation kinetics of ps- stat and ps-stat by flow-cytometry (figure d). given the overlapping phospho-proteomic changes, gene ontology (go) analysis associated several sets of phosphopeptides with biological processes that were mostly shared between both cytokines (figure e, supp. fig. c). a large set of phospho-peptides was linked to transcription initiation (including jak/stat signaling) or mrna modification (figure e). interestingly, il- stimulation was associated to negative regulation of rna polymerase ii, whereas a positive regulation was detected for hypil- . a closer look into the functional regulation of rna-pol ii activity by the two cytokines revealed that multiple proteins involved in this process were differentially regulated by hypil- and il- (figure f). while positive regulators of rna-pol ii transcription, such as negative elongation factor a (nelfa), ppm g, rchy and pol ra, were much more phosphorylated in response to hypil- than il- , negative regulators of rna-pol ii transcription, such as larp , were much more engaged by il- treatment than by hypil- (figure f). interestingly, in a previous study we linked rna-pol ii regulation with the levels of stat s phosphorylation induced by hypil- via recruitment of cdk to stat dependent genes ( ). our phospho-proteomic analysis thus, suggests that il- and hypil- recruit different transcriptional complexes that ultimately could contribute to provide gene expression specificity by the two cytokines. additionally, we identified several interesting il- -specific phosphorylation targets. one example was ubiquitin protein ligase e component n-recognin (ubr ). phosphorylated ubr leads to ubiquitination and subsequent degradation of rorgc ( ), the key transcription factor required for th- lineage commitment, thus limiting th- differentiation (supp. fig. d). a second example is pak , which phosphorylates and stabilizes foxp leading to higher levels of treg cells (supp. fig. d) ( ). moreover, il- stimulation led to a very strong phosphorylation of bcl -associated agonist of cell death (bad), a critical regulator of t-cell survival and a well-known substrate of the pak kinase ( ). overall, our data show a large overlap between the il- and il- signaling program, with a strong focus on jak/stat signaling. however, il- engages additional signaling intermediaries that could contribute to its unique immuno-modulatory activities. further studies will be required to assess how these il- specific signaling pockets contribute to shape il- responses. kinetic decoupling of gene induction programs depends on sustained stat activation and irf expression by il- next, we investigated how the different kinetics of stat activation induced by hypil- and il- ultimately modulated gene expression by these two cytokines. to this end, we performed rna-seq analysis of th- cells stimulated with hypil- or il- for h, h and h to obtain a dynamic perspective of gene regulation. we identified ~ shared genes that could be quantified for all three donors and throughout all tested experimental conditions. in a first step, we compared how similar the gene programs induced by hypil- and il- were. principal component analysis (pca) was run for a subset of genes, found to be significantly up- (total ~ ) or downregulated (total ~ ) by either of the experimental conditions (p value£ . , fold change ³+ or £- ). at one hour of stimulation hypil- and il- induced very similar gene programs, with the two cytokines clustering together in the pca analysis regardless of whether we focused on the subsets of upregulated or downregulated genes (figure a). however, the similarities between the two cytokines changed dramatically in the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / course of continuous stimulation. while the two cytokines induced the downregulation of comparable gene programs at h and h stimulation, as denoted by the close clustering in the pca analysis (figure a, right panel) and the fraction of shared genes (~ %, figure b, supp. fig. a-c, supp. fig. a), this was not observed for upregulated genes. although the two cytokines induced comparable gene upregulation programs after h of stimulation (~ % shared genes), this trend almost completely disappeared at later stimulation times (figure a & b, supp. fig. b). this is well-reflected by the absolute numbers of up- or downregulated genes observed for il- and hypil- (figure c). stimulation with both cytokines yielded a similar trend of gene downregulation (figure c, right panel). however, while hypil- stimulation resulted in a spike of gene upregulation at h that quickly disappeared at later stimulation times, il- stimulation was capable to increase the number of upregulated genes beyond h of stimulation and maintains it even after h (figure c, left panel). this “kinetic decoupling” of gene induction seems to have a striking functional relevance. gene set enrichment analysis (gsea) ( ) identified several reactome pathways to be enriched for il- over the course of stimulation – most of them linked with interferon signaling and immune responses (figure d). in contrast, for hypil- stimulation no pathway enrichment was detected. most importantly, the vast majority of il- -induced genes that were associated to these pathways belonged to genes upregulated by il- treatment and that have been previously linked to stat activation ( , ) (supp. fig. c). although hypil- treatment resulted in the induction of some of these genes, their expression was very transient in time, in agreement with the short stat activation kinetic profile exhibited by hypil- (supp. fig. b & c). next, we performed cluster analysis to find further similarities and discrepancies between the gene expression programs engaged by hypil- and il- (figure e). since genes downregulated by il- and hypil- showed overall good similarity throughout the whole kinetic series, we mainly focused on differences in upregulated gene induction. we identified three functionally relevant gene clusters. the first gene cluster corresponds to genes that are transiently and equally induced by hypil- and il- . these genes peak after one hour and return to basal levels after h and h of stimulation (figure e). interestingly, this cluster contains classical il- -induced and stat -dependent genes, such as members of the nfkb and jun/fos transcriptional complex ( ), as well as the feedback inhibitor suppressor of cytokine signaling (socs ) ( ) and t-cell early activation marker cd . (figure e). a second cluster of genes corresponded to genes that were persistently activated by il- but only transiently by hypil- (figure e). among these genes we found classical stat - dependent genes, such as socs , programmed cell death ligand (pdl = cd ) ( ) and members of the interferon-induced protein with tetratricopeptide repeats (ifit) family. the third cluster of genes corresponded to genes exhibiting strong and sustained activation by il- after h and h stimulation but no activation by hypil- at all. this “ nd wave” of gene induction by il- was almost exclusively comprised of classical interferon stimulated genes (isgs) (supp. fig. c), such as stat & , guanylate binding protein (gbp ), gbp , & , and irf & . it is worth mentioning, that genes in the third cluster appear to require persistent stat activation ( , ) and were the basis for the ifn signature identified in our reactome pathway analysis. still, we were surprised about the magnitude of this nd gene wave. even though il- exerts a sustained pstat kinetic profile, pstat levels were down to ~ % of maximal amplitude after h of stimulation. we reasoned that additional factors could further amplify the stat response for il- but not for hypil- . within the st wave of stat -dependent genes, .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we also spotted the transcription factor interferon response factor (irf ), that was continuously induced throughout the kinetic series in response to il- but only transiently spiking after h of hypil- stimulation (figure e). irf expression was shown to prolong pstat kinetics ( ) and to be required for il- -dependent tr- differentiation and function ( ). we confirmed the kinetics of irf protein expression by flow cytometry and showed higher and more sustained protein levels after il- stimulation relative to hypil- (figure a). next, we tested in our rpe cell system, whether sirna mediated knockdown of irf would alter the gene induction profiles of certain stat or stat -dependent marker genes. in rpe cells, reconstituted with il- ra, irf protein levels were peaking around h after stimulation with il- and transfection with irf -targeting sirna knocked down expression by > % (figure b). importantly, knockdown of irf did not alter the overall kinetics of pstat and pstat activation (figure c). induction of stat -dependent genes stat , gbp and oas as well as stat -dependent gene socs were followed by rt qpcr (figure d). interestingly, up to h of stimulation, the gene induction curves were identical for control- and irf -sirna treated cells. later than h – that is, when irf protein levels are peaking – the gene induction was decreased between - % in absence of irf . strikingly, expression of socs , a classical stat -dependent reporter gene was transient and independent on irf levels, highlighting that irf selectively amplifies stat -dependent gene induction. taken together our data support a scenario whereby il- by exhibiting a kinetic decoupling of stat and stat activation is capable of triggering independent gene expression waves, which ultimately contribute to shape its distinct biology. il- -induced stat response drives global proteomic changes in th- cells next, we aimed to uncover how the distinct gene expression programs engaged by hypil- and il- ultimately relate to alterations of the th- cell proteome. for that, we continuously stimulated silac labelled th- cells for h with saturating doses of il- and hypil- and compared quantitative proteomic changes to unstimulated controls (figure a). we quantified ~ proteins present in all three biological replicates and in all tested conditions (unstimulated/il- /hypil- ). both cytokines downregulated a similar number of proteins (il- : , hypil- : ) (figure b) with approximately half of them being shared by the two cytokines, mimicking our observations in the rna-seq studies (figure c, supp. fig. a). with upregulated proteins, il- was almost twice as potent as hypil- ( proteins) with very little overlap. among the upregulated proteins by il- but not hypil- , we detected several proteins with described immune-modulatory functions on t-cells. one of these proteins was transforming growth factor b (tgf-b), which is a key regulator with pleiotropic functions on t-cells ( ). tgf-b has been identified to synergistically act with il- to induce il- secretion from tr- cells – thus accounting for one of the key anti-inflammatory functions of il- ( ). on the other hand, we also found selplg-encoded protein rsgl- which is critically required for efficient migration and adhesion of th- cells to inflamed intestines ( , ). interestingly, we found larp moderately upregulated by il- . this negative regulator for rna pol ii was also identified in our phospho-target screening and selectively engaged by il- (figure f). il- and hypil- share ~ % of downregulated proteins, but without strong functional patterns. both cytokines downregulated several proteins related to mitotic cell cycle (lig , csnk b, psmb ) mrna processing and splicing (ncbp , pcbp , nudt ) ( ). strikingly, a significant number (~ %) of proteins upregulated by il- belong to the group of isgs (figure b & c, supp. fig. b). this particular set of proteins including stat , .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / stat , mx dynamin like gtpase (mx ), interferon stimulated gene (isg ) or poly(adp-ribose) polymerase family member (parp ) was not markedly altered by hypil- . of note: the overall expression patterns of the most significantly altered proteins are congruent to the gene induction patterns observed after h and h (figure d & e, supp. fig. b). similar to this, gsea reactome analysis identified again pathways associated with interferon signaling and cytokine/immune system but failed to detect any significant functional enrichment by hypil- (figure e, supp. fig. b & c). finally, we correlated rnaseq-based gene induction patterns with detected proteomic changes. to our surprise we only found a relatively low number of shared hits. however, the identified proteins belong exclusively to a group upregulated by il- (figure f). they are all located in the “ nd gene wave” cluster and all of them are regulated by isgs (figure e). taken together these results provide compelling evidence that sustained pstat activation by il- accounts for its gene induction and proteomic profiles, thus, giving a mechanistic explanation for the diverse biological outcomes of il- and il- . our observations are in good agreement with previous findings in cancer cells, showing that particularly the involvement of stat activation is responsible for proteomic remodeling by il- ( ). receptor and stat concentrations determine the nature of the il- /il- response our data suggest that stat molecules compete for binding to a limited number of phospho- tyr motifs in the intracellular domains of cytokine receptors. a direct consequence derived from this hypothesis is that cells can adjust and change their responses to cytokines by altering their concentrations of specific stats or receptors molecules. to assess to what degree immune cells differ in their expression of cytokine receptors and stats, we investigated levels of il- ra, gp , il- ra, stat and stat protein expression across different immune cell populations making use of the immunological proteomic resource (immpres - http://immpres.co.uk) database. strikingly, the level of expression of these proteins change dramatically across the populations studied (figure a), suggesting that these cells could potentially produce very different responses to hypil- and il- stimulation. in order to quantify (and predict) how changes in expression levels of different proteins modify the kinetics of pstat, we made use of the two mathematical models of hypil- and il- stimulation and the parameters inferred with bayesian methods. our mathematical models could accurately reproduce the experimental results generated across our study, i.e., signaling by the il- ra chimeric and il- ra-y f mutant receptors and dose/response studies (supp. fig. a-c), making use of the posterior parameter distributions generated from the bayesian parameter calibration. having developed mathematical models which are able to accurately explain the experimental data (supp. fig. b and c) and reproduce independent experiments (fig. b and c), we then sought to use the models to predict pstat signaling kinetics under different concentration regimes of receptors and stats. to simplify the simulations, we focused our analysis in gp and stat proteins, two of the proteins that greatly vary in the different immune populations (figure a). as baseline values for the concentrations [𝐺𝑃 ( )], [𝐼𝐿 𝑅𝑎( )] [𝑆𝑇𝐴𝑇 ( )] and [𝑆𝑇𝐴𝑇 ( )] we used approximately the median values from the posterior distributions for each parameter: [𝐺𝑃 ( )] = nm, [𝐼𝐿 𝑅𝑎( )] = nm and [𝑆𝑇𝐴𝑇 ( )] = [𝑆𝑇𝐴𝑇 ( )] = nm. to see the effect of varying gp concentrations on pstat signaling, we decreased the initial concentration of gp and simulated the model using the accepted parameters sets from the abc-smc to inform the other parameter values. a tenfold reduction on gp concentration ([𝐺𝑃 ( )] = . 𝑛𝑀) resulted in a striking loss in pstat levels induced by hypil- , with very little effect .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / on pstat levels induced by this cytokine (figure b). pstat / kinetics induced by il- however was not affected by this decrease in gp concentration (figure b). interestingly, the hypil- signaling profile predicted by our model at low gp concentrations strongly resemble the one induced by hypil- in th- cells (figure c), where very low levels of gp are found, further confirming the robustness of the predictions generated by our mathematical models. when the concentration of stat was increased by a factor of ten ([𝑆𝑇𝐴𝑇 ( )] = nm, both hypil- and il- induced significantly higher levels of pstat activation (figure b). pstat levels were not affected for hypil- stimulation but were decreased for il- stimulation (figure b), further indicating the competitive nature of the binding of stat and stat to il- ra and gp . overall, our mathematical model predicts that changes on gp and stat expression produce a substantial remodeling of the hypil- and il- signalosome, which ultimately could lead to aberrant responses. stat protein levels in sle patients modify hypil- and il- signaling responses stat is a classical ifn responsive gene and stat levels are highly increased in environments rich in ifns ( ). thus, we next ask whether stat levels would be increased in sle patients, an examples of disease where ifns have been shown to correlate with a poor prognosis, making use of available gene expression datasets ( ). we did not find differences in the expression of gp , il- ra or il- ra in sle patients (figure c). however, we detected a significant increase in the levels of stat and stat transcripts in these patients when compared to healthy controls, with the increase on stat expression being significantly more pronounced (figure c). since our mathematical model predicted that increases in stat expression could significantly change cytokine-induced cellular responses by hypil- and il- , we next experimentally tested this prediction. for that, we primed th- cells with ifna overnight to increase total stat levels (and to a lower extent stat ) in these cells (supp. fig. a). while both hypil- and il- induced comparable levels of pstat in primed and non-primed th- cells, levels of pstat induced by the two cytokines were significantly upregulated in primed th- cells, resulting in a bias stat response and confirming our model predictions (figure d). we next investigated whether this bias stat activation by hypil- and il- observed in ifna -primed th- cells was also present in sle patients. for that we collected pbmcs from six sle patients or five age-matched healthy controls and measured stat and stat expression, as well as pstat and pstat induction by hyil- and il- after min treatments in cd t cells. importantly, comparable results to those obtained with ifn-primed th- cells were obtained, with signaling bias towards pstat in cd + t cells from sle patients stimulated with hypil- and il- (figure e, supp. fig. b & c), further supporting the fact that stat concentrations play a critical role in defining cytokine responses in autoimmune disorders. our data show that stat and stat compete for phospho-tyr motifs in gp , with stat having an advantage resulting from its tighter affinity to gp . finally, we asked whether crippling jak activity by using sub-saturating doses of jak inhibitors could differentially affect stat and stat activation by hypil- and therefore rescue the altered cytokine responses found in sle patients. to test this, rpe and th- cells were stimulated with saturated concentrations of hypil- and titrating the concentrations of tofacitinib, a clinically approved jak inhibitor. strikingly, tofacitinib inhibited hypil- induced pstat more efficiently than pstat in both rpe cells and th- cells (figure f). at nm concentration, tofacitinib inhibited pstat levels induced by hypil- by %, while only inhibited pstat levels by % (figure f) – an effect that we did not observe for il- stimulation (supp. fig. d). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / overall, our results show that the changes in stats concentration found in autoimmune disorders shape cytokine signaling responses and could contribute to disease progression. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / discussion: cytokine pleiotropy is the ability of a cytokine to exert a wide range of biological responses in different cell types. this functional pleiotropy has made the study of cytokine biology extremely challenging given the strong cross-talk and shared usage of key components of their signaling pathways, leading to a high degree of signaling plasticity, yet still allowing functional selectivity ( , ). here we aimed to identify the underlying determinants that define cytokine functional selectivity by comparing il- and il- at multiple scales – ranging from cell surface receptors to proteomic changes. we show that il- triggers a more sustained stat phosphorylation than il- , via a high affinity stat /il- ra interaction centered around tyr on il- ra. this in turn results in a more sustained irf expression induced by il- , which leads to the upregulation of a second wave of gene expression unique to il- and comprised of classical isgs. we go one step further and show that this strong receptor/stat coupling is altered in autoimmune disorders where stats concentrations are often dysregulated. increased expression of stat in sle patients biases hypil- and il- responses towards stat activation, further contributing to the worsening of the disease. by using suboptimal doses of the jak inhibitor tofacitinib we show that specific stat proteins engaged by a given cytokine can be targeted. overall, our study highlights a new layer of cytokine signaling regulation, whereby stat affinity to specific cytokine receptor phospho-tyr motifs controls stat phosphorylation kinetics and the identity of the gene expression program engaged, ultimately ensuing the generation of functional diversity through the use of a limited set of signaling intermediaries. the tight coupling of one receptor subunit to one particular stat that we have identified in our study is a rather unusual phenomenon for heterodimeric cytokine receptor complexes, which has been first suggested by owaki et al. ( ). generally, the entire signaling output driven by a cytokine-receptor complex emanates from a dominant receptor subunit, which carries several tyr residues susceptible of being phosphorylated ( , ). this in turn results in competition between different stats for binding to shared phospho-tyr motifs in the dominant receptor chain, leading to different kinetics of stat phosphorylation as observed for il- stimulation ( ) (figure b). moreover, this localized signaling quantum allows phosphatases and feedback regulators – induced upon cytokine stimulation – to act in synergy to reset the system to its basal state, generating a very synchronous and coordinated signaling wave. although very effective, this molecular paradigm presents its limitations. stat competition for the same pool of phospho-tyr makes the system very sensitive to changes in stat concentration. ifng primed cells, which exhibit increased stat levels, trigger an ifng- like stat response upon il- stimulation ( ). il- anti-inflammatory properties are lost in cells with high levels of stat expression, as a result of a pro-inflammatory environment rich in ifns ( ). indeed, we show that stat transcripts levels are increased in crohn’s disease and sle patients and they contributed to alter il- responses. strikingly, il- appears to have evolved away from this general model of cytokine signaling activation. our results show that stat activation by il- is tightly coupled to il- ra, while stat activation by this cytokine mostly depends on gp . this decoupled stat and stat activation by il- is possible thanks to the presence of a putative high affinity stat binding site on il- ra that resembles the one present in ifngr ( ). as a result of this, il- can trigger sustained and independent phosphorylation of both stat and stat . this unique feature of il- allows it to induce robust responses in dynamic immune environments. indeed, our mathematical models of cytokine signaling and bayesian inference, together with the experimental observations show that changes in receptor concentration minimally affected .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pstat / induced by il- , while they fundamentally alter il- responses. overall, our data show that cytokine responses are versatile and adapt to the continuously changing cell proteome, highlighting the need to measure cytokine receptors and stats expression levels, in addition to cytokine levels, in disease environments to better understand and predict altered responses elicited by dysregulated cytokines. in recent years, it has become apparent that the stability of the cytokine-receptor complex influences signaling identity by cytokines ( ). short-lived complexes activate less efficiently those stat molecules that bind with low affinity phospho-tyr motif in a given cytokine receptor ( ). our current results further support this kinetic discrimination mechanism for stat activation. our statistical inference identified differences in stat recognition to the cytokine receptor phospho-tyr motifs as one of the major determinants of stat phosphorylation kinetics. this parameter alone was sufficient to explain transient and sustained stat phosphorylation induced by il- and il- , respectively, without the need to invoke the action of phosphatases or negative feedback regulators such as socss. indeed, our results indicate that the rate of stat dephosphorylation is similar between the il- and il- systems, suggesting that phosphatases do not contribute to these early kinetic differences. moreover, blocking protein translation, and therefore the upregulation of negative feedback regulators by il- treatment did not result in a more sustained stat phosphorylation by il- , again indicating that the transient kinetics of stat phosphorylation by il- is encoded at the receptor level and does not require further regulation. however, recent reports have found that the amplitude of stat phosphorylation in response to il- is regulated by levels of ptpn expression, suggesting that phosphatases can play additional roles in shaping il- responses beyond controlling the kinetics of stat activation ( ). stat phosphorylation levels by il- on the other hand were significantly more sustained in the absence of protein translation, suggesting that negative feedback mechanisms are required to downmodulate signaling emanating from high affinity stat-receptor interactions. overall our results suggest that while phosphatases and negative feedback regulators play an important role in maintaining cytokine signaling homeostasis ( ), the kinetics of stat activation appears to be already encoded at the level of receptor engagement, thus ensuring maximal efficiency and signal robustness. cytokine signaling plasticity can occur at the level of receptor activation. in the past years, a scenario has emerged suggesting that the absolute number of signaling active receptor complexes is a critical determinant for signal output integration. accordingly, specific biological responses were shown to be tuned either by abundance of cell surface receptors ( , ) or by the level of receptor assembly ( , , ). here, we show for the first time that il- - induced dimerization of il- ra and gp at the cell surface of live cells – in good agreement with previous studies on heterodimeric cytokine receptor systems ( , ). for il- , the receptor subunits il- ra and gp can be expressed at different ratios as seen for naïve vs. activated t-cells ( ) as well as intestinal cells ( ). on t-cells, particularly after activation, il- ra is expressed in strong excess over gp , rendering gp as the limiting factor for receptor complex assembly ( ). interestingly, we observe that in addition to a faster kinetic of stat phosphorylation, hypil- treatment induces a lower maximal amplitude in pstat activation in t cells. this is in stark contrast to our results in rpe cells, where high abundance of gp (~ - copies of cell surface gp ) is found. in these cells both cytokines elicited similar amplitudes of stat phosphorylation. our results suggest that surface receptor density in synergy with stats binding dynamics to phospho-tyr motif .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / on cytokine receptors act to define the amplitude and kinetics of stat activation in response to cytokine stimulation. the distinct stat and stat kinetic profiles induced by il- and il- are the prerequisite for time-correlated decoupling of genetic programs: a “shared gp /stat -dependent wave” and an il- -“unique il- ra/stat -dependent wave”. however, pstat levels induced by il- at h were down to ~ % of maximal amplitude, suggesting that additional factors would be required to amplify the initial stat response elicited by il- . we observed that il- induces the expression of an early wave of classical stat -dependent genes, which is also shared by il- . however, while il- induces the upregulation of these genes throughout the entire duration of the experiment, il- only resulted in a transient spike. we reasoned that this additional factor required for il- signal amplification would be among these early stat -dependent genes. among this set of genes we found the transcription factor irf , which had been shown to act as a feedback amplificant for pstat activity ( ). importantly, irf protein levels have been shown to be upregulated in response to il- and ifng but not to il- stimulation in hepatocytes ( ). irf plays a key role in chromatin accessibility which is critically required for il- -induced differentiation of tr cells and subsequent il- secretion ( ). here, we could prove that the contribution of irf on stat - but not stat -dependent genes is a generic feature of il- signaling. this readily explains the significant transcriptomic overlap of il- with type i ( ) or type ii interferons ( ) after long-term stimulation with these cytokines. along this line, it is not surprising that il- – beyond its well-described effects on t-cell development – can also mount a considerable antiviral response as shown in hepatic cells and pbmcs ( , ). our results suggest that by modulating the kinetics of stat phosphorylation, cytokines can modulate the expression of accessory transcription factors, such as irf , that act in synergy with stats to fine-tune gene expression and provide functional diversity. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / acknowledgments we thank members of the moraga, molina-parís, piehler and mitra laboratories for helpful advice and discussion. we thank g. hikade and h. kenneweg for technical support, c. p. richter for providing software for single-molecule image analysis, r. kurre (integrated bioimaging facility osnabrück) for support with fluorescence microscopy and the fingerprints proteomics facility (dundee) for support with the mass spectrometry data. this work was supported by the stg, ls , wellcome-trust- /z/ /z (im ep), erc- -stg grant (im jmf ep pkf), embo (sw – ), dfg (sfb , p /z, jp), national heart, lung and blood institute (k hl , mk) and contrat de plan etat région hauts de france and institut pour la recherche sur le cancer de lille (sm sg). cmp and gl were supported by h , quantii. pj is supported by the epsrc, astrazeneca and smith institute (smith institute case studentship, award reference ). numerical work was undertaken on arc , which is part of the high performance computing facilities at the university of leeds, uk. competing interests the authors declare that they have no competing interests. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / material and methods protein expression and purification: murine il- was cloned as a linker-connected single-chain variant (p +ebi ) as described in ( ). human hyperil- (hypil- ), and murine single-chain il- were cloned into the pacgp -a vector (bd biosciences) in frame with an n-terminal gp signal sequence and a c-terminal hexahistidine tag, and produced using the baculovirus expression system, as described in ( ). baculovirus stocks were prepared by transfection and amplification in spodoptera frugiperda (sf ) cells grown in sf ii media (invitrogen) and protein expression was carried out in suspension trichoplusiani ni (high five) cells grown in insectxpress media (lonza). purification was performed using the method described in ( ). for il- , the cells were pelleted with centrifugation at rpm, prior to a precipitation step through addition of tris ph . , cacl and nicl to final concentrations of mm, mm and mm respectively. the precipitate formed was then removed through centrifugation at rpm. nickel-nta agarose beads (qiagen) were added and the target proteins purified through batch binding followed by column washing in hbs-hi buffer (hbs buffer supplemented to mm nacl and % glycerol, ph . ). elution was performed using hbs-hi buffer plus mm imidazole. final purification was performed by size exclusion chromatography on an enrich sec column (biorad), again equilibrated in hbs-hi. concentration of the purified sample was carried out using kda millipore amicon-ultra spin concentrators. for hypil- , proteins were purified likewise, but in mm hepes (ph . ) containing mm nacl. recombinant cytokines were purified to greater than % homogeneity. for cell surface labeling, the anti-gfp nanobody (nb) “enhancer” and “minimizer” were used, which bind megfp with subnanomolar binding affinity ( ). nb was cloned into pet- a with an additional cysteine at the c-terminus for site-specific fluorophore conjugation in a : fluorophore:nanobody stoichiometry. furthermore, (pas) sequence to increase protein stability and a his-tag for purification were fused at the c-terminus. protein expression in e. coli rosetta (de ) and purification by immobilized metal ion affinity chromatography was carried out by standard protocols. purified protein was dialyzed against hepes ph . and reacted with a two-fold molar excess of dy maleimide (dyomics), atto maleimide (at ) and atto rho maleimide (rho ) (atto-tec gmbh), respectively. after h, a -fold molar excess (with respect to the maleimide) of cysteine was added to quench excess dye. protein aggregates and free dye were subsequently removed by size exclusion chromatography (sec). a labeling degree of . - : fluorophore:protein was achieved as determined by uv/vis spectrophotometry. cd + t cell purification and th- differentiation: human buffy coats were obtained from the scottish blood transfusion service and peripheral blood mononuclear cells (pbmcs) of healthy donors were isolated from buffy coat samples by density gradient centrifugation according to manufacturer’s protocols (lymphoprep, stemcell technologies). from each donor, x pbmcs were used for isolation of cd + t-cells. cells were decorated with anti-cd fitc antibodies (biolegend, # ) and isolated by magnetic separation according to manufacturer’s protocols (macs miltenyi) to a purity > % cd +. freshly isolated resting cd + t cells ( x per donor) were activated under th- polarizing conditions using immunocult™ human cd /cd t cell activator (stemcell, cat# ) following manufacturer instructions for days in rpmi- , % v/v .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fbs, u/ml penicillin-streptomycin (gibco) in the presence of the cytokines il- (novartis, # , ng/ml), anti-il- antibody ( ng/ml, bd biosciences, # ), il- ( ng/ml, biolegend, # ). after three days of priming, cells were expanded for another days in the presence of il- ( ng/ml). human sle patient samples: this study was authorized by the french competent authority dealing with research on human biological samples namely the french ministry of research. the authorization number is ech / . to issue such authorization, the ministry of research has sought the advice of an independent ethics committee, namely the “comité de protection des personnes,” which voted positively, and all patients gave their written informed consent. the healthy volunteer was recruited to serve as healthy control individuals. healthy and patients’ blood samples were collected in heparinized tubes (bd vacutainer , bd biosciences san jose, ca, usa) and pbmc samples were isolated using ficoll (pancoll, pan biotech #p - ) density gradient centrifugation. the isolated pbmcs were washed with pbs and the remaining red blood cells were lysed using rbc lysis buffer (ack lysing buffer, gibco #a - ), incubate min at room temperature. cells were washed in pbs and resuspend the cells with ml of freezing medium (with dmso, pan biotech, #p - ) and transfer the cells in a cryotube. cryotube in a freezing container (nalgene) and at - °c and then transferred into liquid nitrogen container for long term storage. classification and demographic information about sle patients and healthy controls: sle patients were included if they fulfilled the american college of rheumatology (acr) classification criteria (hochberg mc. updating the american college of rheumatology revised criteria for the classification of systemic lupus erythematosus ( ). exclusion criteria were current intake of mg or more of prednisone or equivalent and/or use of immunosupressants within the previous months before inclusion. use of hydroxychloroquine was not an exclusion criterion. patients were mostly in clinical remission, half with biological remission, half with persistent anti native dna autoantibodies. all sle patients and healthy controls were females between and years old. (phospho-) proteomics: for (phospho-) proteomic experiments, th- cells from each donor were split into three different conditions after initial expansion: light silac media ( mg/ml l-lysine k (sigma, #l ) and mg/ml l-arginine r (sigma, #a )), medium silac media ( mg/ml l- lysine u- c k (ckgas, #clm- - . ) and mg/ml l-arginine u- c r (ckgas, #clm- - . )) and heavy silac media ( . mg/ml l-lysine u- c ,u- n k (ckgas, #cnlm- -h- . ) and . mg/ml l-arginine u- c ,u- n r (ckgas, #cnlm- -h- . )) prepared in rpmi silac media (thermo scientific, # ) supplemented with % dialyzed fbs (hyclone, #sh . ), ml l-glutamine (invitrogen, # ), ml pen/strep (invitrogen, # ), ml mem vitamin solution (thermo scientific, # ), ml selenium-transferrin-insulin (thermo scientific, # ) and expanded in the presence of ng/ml il- and ng/ml anti-il for another days in order to achieve complete labelling. media was exchanged every two days. incorporation of medium and heavy version of lysine and arginine was checked by mass spectrometry and samples with an incorporation greater than % were used. after expansion, cells were starved without il- for hours before stimulation with nm il- or nm hyil- for minutes (phosphoproteomics) or h (global proteomic changes). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / cells were then washed three times in ice-cold pbs, mix in a : : ratio, resuspended in sds- containing lysis buffer ( % sds in mm triethylammonium bicarbonate buffer (teab)) and incubated on ice for min to ensure cell lysis. then, cell lysates were centrifuged at g for minutes at + °c and supernatant was transferred to a clean tube. protein concentration was determined by using bca protein assay kit (thermo, # ), and mg of protein per experiment were reduced with mm dithiothreitol (dtt, sigma, #d ) for h at °c and alkylated with mm iodoacetamide (iaa, sigma, #i ) for min at rt. protein was then precipitated using six volumes of chilled (- °c) acetone overnight. after precipitation, protein pellet was resuspended in ml of mm teab and digested with trypsin ( : w/w, thermo, # ) and digested overnight at .c. then, samples were cleared by centrifugation at g for min at + °c, and peptide concentration was quantified with quantitative colorimetric peptide assay (thermo, # ). phosphopeptide enrichment in the peptide fractions generated as described above was carried out using magresyn ti-imac following manufacturer instructions ( bscientific, mrtim ). high ph reverse phase fractionation for phosphoproteomics: samples were dissolved in μl of mm ammonium formate buffer ph . and peptides are fractionated using high ph rp chromatography. a c column from waters (xbridge peptide beh, Å, . µm . x mm, ireland) with a guard column (xbridge, c , . µm, . x mm, waters) are used on a ultimate hplc (thermo-scientific). buffers a and b used for fractionation consist, respectively of mm ammonium formate in milliq water (buffer a) and mm ammonium formate in % acetonitrile (buffer b), both buffers were adjusted to ph . with ammonia. fractions are collected using a wps- fc autosampler (thermo-scientific) at min intervals. column and guard column were equilibrated with % buffer b for min at a constant flow rate of . ml/min and a constant temperature f oc. samples ( µl) are loaded onto the column at . ml/min, and separation gradient started from % buffer b, to % b in min, then from % b to % b within min and finaly from % b to % b in min. the column is washed for min at % buffer b and equilibrated at % buffer b for min as mentioned above. the fraction collection started min after injection and stopped after min (total of fractions, µl each). each peptide fraction was acidified immediately after elution from the column by adding to µl % formic acid to each tube in the autosampler. the total number of fractions concatenated was set to . the content of fractions from each set was dried prior to further analysis. lc-ms/ms analysis: lc-ms analysis was done at the fingerprints proteomics facility (university of dundee). analysis of peptide readout was performed on a q exactive™ plus, mass spectrometer (thermo scientific) coupled with a dionex ultimate rs (thermo scientific). lc buffers used are the following: buffer a ( . % formic acid in milli-q water (v/v)) and buffer b ( % acetonitrile and . % formic acid in milli-q water (v/v). dried fractions were resuspended in µl, % formic acid and aliquots of μl of each fraction were loaded at μl/min onto a trap column ( μm × cm, pepmap nanoviper c column, μm, Å, thermo scientific) equilibrated in . % tfa. the trap column was washed for min at the same flow rate with . % tfa and then switched in-line with a thermo scientific, resolving c column ( μm × cm, pepmap rslc c column, μm, Å). the peptides were eluted from the column .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / at a constant flow rate of nl/min with a linear gradient from % buffer b to % buffer b in min then from % buffer b to % buffer b in min, and finally from % buffer b to % buffer b in min. the column was then washed with % buffer b for min and re- equilibrated in % buffer b for min. the column was kept at a constant temperature of oc. q-exactive plus was operated in data dependent positive ionization mode. the source voltage was set to . kv and the capillary temperature was oc. a scan cycle comprised ms scan (m/z range from - , ion injection time of ms, resolution and automatic gain control (agc) x ) acquired in profile mode, followed by sequential dependent ms scans (resolution ) of the most intense ions fulfilling predefined selection criteria (agc x , maximum ion injection time ms, isolation window of . m/z, fixed first mass of m/z, spectrum data type: centroid, intensity threshold x , exclusion of unassigned, singly and > charged precursors, peptide match preferred, exclude isotopes on, dynamic exclusion time s). the hcd collision energy was set to % of the normalized collision energy. mass accuracy is checked before the start of samples analysis. mass spectrometry data analysis: q exactive plus mass spectrometer .raw files were analyzed, and peptides and proteins quantified using maxquant ( ), using the built-in search engine andromeda ( ). all settings were set as default, except for the minimal peptide length of , and andromeda search engine was configured for the uniprot homo sapiens protein database (release date: _ ). peptide and protein ratios only quantified in at least two out of the three replicates were considered, and the p-values were determined by student’s t test and corrected for multiple testing using the benjamini–hochberg procedure (benjamini and hochberg, ). plasmid constructs: for single molecule fluorescence microscopy, monomeric non-fluorescent (y f) variant of egfp was n-terminally fused to gp . this tag (mxfpm) was engineered to specifically bind anti-gfp nanobody “minimizer” (agfp-minb). this construct was inserted into a modified version of psems- m (covalys) using a signal peptide of igk. the orf was linked to a neomycin resistance cassette via an ires site. a mxfpe-il- ra construct was designed likewise but is recognized by agfp nanobody “enhancer” (mxfpe). the chimeric construct mxfp-il- ra (ecd & tmd)-gp (icd) was a fusion construct of il- ra (aa - ) and gp (aa - ). cell lines and media: hela cells were grown in dmem containing % v/v fbs, penicillin-streptomycin, and l- glutamine ( mm). rpe cells were grown in dmem/f containing % v/v fbs, penicillin- streptomycin, and l-glutamine ( mm). rpe cells were stably transfected by mxfpe-il- ra, mutants and the chimeric construct by pei method according to standard protocols. using g selection ( . mg/ml) individual clones were selected, proliferated and characterized. for comparing receptor cell surface expression levels of stable clones expressing variants of il- ra, cells were detached using pbs+ mm edta, spun down ( g, min) and incubated with “enhancer” agfp-ennbdy ( nm, min on ice). after incubation, cells were washed with pbs and run on cytometer. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / flow cytometry staining and antibodies: for measuring dose-response curves of stat / phosphorylation (either th- cells or rpe clones), -well plated were prepared with µl of cell suspensions at x cells/ml/well for th- and x cells/ml/well for rpe . the latter were detached using accutase (sigma). cells were stimulated with a set of different concentrations to obtain dose-response curves. to this end cells were stimulated for min at °c with the respective cytokines followed by pfa fixation ( %) for min at rt. for kinetic experiments, cell suspensions were stimulated with a defined, saturating concentration of cytokines ( nm il- , nm hypil- , nm wt-il- ) in a reverse order so that all cell suspensions were pfa-fixed ( %) simultaneously. for pstat / kinetic experiments at jak inhibition, tofacitinib ( μm, stratech, #s -sel) was added after min of stimulation and cells were pfa-fixed in correct order. after fixation ( min at rt), cells were spun down at g for min at °c. cell pellets were resuspended and permeabilized in ice-cold methanol and kept for min on ice. after permeabilization cells were fluorescently barcoded according to ( ). in brief: using two nhs- dyes (pacificblue, # , dylight , # , thermo scientific), individual wells were stained with a combination of different concentrations of these dyes. after barcoding, cells are pooled and stained with anti-pstat alexa (cell signaling technologies, # ) and anti- pstat alexa (biolegend, # ) at a : dilution in pbs+ . %bsa for h at rt. t-cells were also stained with anti-cd alexaflour ( : , biolegend, # ), anti-cd pe ( : , biolegend, # ), anti-cd brilliantviolet ( : , biolegend, # ). cells were analzyed at the flow cytometer (beckman coulter, cytoflex s) and individual cell populations were identified by their barcoding pattern. mean fluorescence intensity (mfi) of pstat and pstat was measured for all individual cell populations. for measuring total stat levels, methanol-permeabilized cells were stained with anti- stat alexa ( : , biolegend, # ) or anti-stat apc ( : , biolegend, # ). total irf levels methanol-permeabilized cells were stained with anti-irf alexa ( : , biolegend, # ). for measuring cell surface levels of gp , cells were detached with accutase (sigma) and stained with anti-gp apc ( : , biolegend, # ) for h on ice. rna transcriptome sequencing: human th- cells from three donors each (stemcell technologies) were cultivated and stimulated as described in above. cells were washed in hank’s balanced salt solution (hbss, gibco) and snap frozen for storage. rna was isolated using the rneasy kit (quiagen) according to manufacturer’s protocol. all rna / ratios were above . . of each sample, μg of rna was used. transcriptomic analysis was done by novogene as follows. sequencing libraries were generated using nebnext® ultratm rnalibrary prep kit for illumina® (neb, usa) following manufacturer’s recommendations and index codes were added to attribute sequences to each sample. briefly, mrna was purified from total rna using poly-t oligo-attached magnetic beads. fragmentation was carried out using divalent cations under elevated temperature in nebnext first strandsynthesis reaction buffer ( x). first strand cdna was synthesized using random hexamer primer and m-mulv reverse transcriptase (rnase h-). second strand cdna synthesis was subsequently performed using dna polymerase i and rnase h. remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. after adenylation of ’ ends of dna fragments, nebnext adaptor with hairpin loop structure were ligated to prepare for hybridization. in order to select .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / cdna fragments of preferentially ~ bp in length, the library fragments were purified with ampure xp system (beckman coulter, beverly, usa). then μl user enzyme (neb, usa) was used with size-selected, adaptor-ligated cdna at °c for min followed by min at °c before pcr. then pcr was performed with phusion high-fidelity dna polymerase, universal pcr primers and index (x) primer. at last, pcr products were purified (ampure xp system) and library quality was assessed on the agilent bioanalyzer system. rna sequencing data analysis: primary data analysis for quality control, mapping to reference genome and quantification was conducted by novogene as outlined below. quality control: raw data (raw reads) of fastq format were firstly processed through in- house scripts. in this step, clean data (clean reads) were obtained by removing reads containing adapter and poly-n sequences and reads with low quality from raw data. at the same time, q , q and gc content of the clean data were calculated. all the downstream analyses were based on the clean data with high quality. mapping to reference genome: reference genome and gene model annotation files were downloaded from genome website browser (ncbi/ucsc/ensembl) directly. paired-end clean reads were mapped to the reference genome using hisat software. hisat uses a large set of small gfm indexes that collectively cover the whole genome. these small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. quantification: htseq was used to count the read numbers mapped of each gene, including known and novel genes. and then rpkm of each gene was calculated based on the length of the gene and reads count mapped to this gene. rpkm, (reads per kilobase of exon model per million mapped reads), considers the effect of sequencing depth and gene length for the reads count at the same time and is currently the most commonly used method for estimating gene expression levels. for each identified gene, the fold change was calculated by the ratio of cytokine stimulated/unstimulated expression levels within each donor and an unpaired, two-tailed t test was applied to calculate p values. genes were considered to be significantly altered if: p value £ . , and log fold change ³+ or £- . genes with an rpkm of less than in two or more donors were excluded from analysis so as to remove genes with abundance near detection limit. genes without annotated function were also removed. functional annotation of genes (kegg pathways, go terms) was done using david bioinformatics resource functional annotation tool ( , ). clustered heatmap was generated using r studio pheatmap package. sirna-mediated knockdown of irf in rpe cells: a set of four irf -sirnas were purchased from dharmacon and tested individually to determine levels of knockdown achieved. the sirna providing the highest level of irf . knockdown (horizon, lq- - - , sirna # : ugaacucccugccagauau) were subsequently used in all the experiments. rpe -il ra cells were plated in -well dishes ( . x cells per well) and transfected the next day with irf -sirna or control-gapdh sirna (horizon, d- - - ) (dharmacon) using dharmafect transfection reagent (dharmacon) following the manufacturer’s instructions for h. at different timepoints of il- ( nm) or hypil- ( nm) stimulation, samples were collected from each one -well. cells were .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / trypsinized and each sample was spun down and pellets snap-frozen in liquid nitrogen for subsequent rna isolation ( %) or pfa-fixed for total irf staining ( %) by flow cytometry. real-time quantitative pcr: cells were subject to rna isolation using the qiagen rneasy kit. rna ( ng) was reverse transcribed to complementary dna (cdna) using an iscript cdna synthesis kit (biorad, # ), which was used as template for quantitative pcr. powertrack™ sybr green master mix (takara, #a ) was used for the reaction with the following primers: b-actin was used as housekeeping gene for normalization. each sirna knockdown experiment was performed in three replicates with each sample for qpcr being done in two technical replicates. mathematical models and bayesian inference: we developed two new mathematical models, making use of ordinary differential equations (odes), for the initial steps of cytokine-receptor binding, dimer formation and signal activation by hypil- and il- , respectively; namely, a set of odes for the hypil- system and a separate set of odes for the il- system (see end of this section for the set of odes included in each model). these odes describe the rate of change of the concentration for each molecular species considered in the receptor-ligand systems (hypil- and il- ) over time. by solving these odes, a time-course for the concentration of total (free and bound) phosphorylated stat and stat can be obtained and compared to the experimental data (supp. fig. b & c). the hypil- and il- mathematical models differ due to the reactions involved in the formation of the signaling dimer for each cytokine. under stimulation with hypil- , two hypil- bound gp monomers are required to form the homodimer (supp. fig. a), whereas under il- stimulation, we assume that il- binds to the il- ra chain and not to gp (supp. fig. b) and hence the heterodimer is comprised of an il- molecule bound to an il- ra monomer and one gp chain. in the mathematical models, we assume that upon formation of the dimers (homo- or heterodimer), these receptor chains become immediately phosphorylated. the models do not consider jak molecules explicitly. we are assuming that these molecules are constitutively bound to their corresponding receptor chains and that they phosphorylate immediately upon receptor phosphorylation (dimer formation). after the formation of the dimer, which we denote by 𝐷) or 𝐷"*, formed by hypil- or il- respectively, the biochemical reactions included in each mathematical model are similar, and are summarized as follows. table provides a description of the rates for each reaction considered in each (and both) mathematical model(s). in what follows we assume mass action kinetics for all the reactions. a free cytoplasmic unphosphorylated stat or stat molecule can bind to either receptor chain in the dimer, provided that the intracellular tyrosine residue of the receptor in the dimer is free (supp. fig. c & d). the stat or stat target for rev size b-actin catgtacgttgctatccaggc ctccttaatgtcacgcacgat bp stat ctagtggagtggaagcggag caccacaaacgagctctgaa bp gbp tcctcggattattgctcggc cctttgcgcttcagcctttt bp oas gaaggcagctcacgaaacc aggcctcagcctcttgtg bp socs gtccccccagaagagcctatta ttgacggtcttccgacagagat .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / molecule can subsequently dissociate from the receptor chain in the dimer or can become phosphorylated (with rate 𝑞) whilst bound to the dimer. we have assumed that the rate of stat or stat phosphorylation when bound does not depend on the stat type ( or ) or on the receptor chain (supp. fig. c & d). phosphorylated stat (pstat ) and stat (pstat ) molecules can dissociate from the dimer. once free in the cytoplasm, they can then dephosphorylate (supp. fig. g). we have assumed that this rate of stat dephosphorylation only depends on the concentration of the respective pstat type, free in the cytoplasm. we note that no allostery has been considered in the models and hence, phosphorylated and unphosphorylated stat molecules dissociate from the receptor with the same rate (supp. fig. c & d). finally, any molecular species containing receptor molecules can be removed from the system, due to internalisation or degradation, via one of two hypothesised mechanisms (supp. fig. e & f): • hypothesis (h ): receptors (free or bound, phosphorylated or unphosphorylated) are internalised/degraded with a rate proportional to the concentration of the species in which they are contained, or • hypothesis (h ): receptors (free or bound, phosphorylated or unphosphorylated) are internalised/degraded with a rate proportional to the product of the concentration of the species in which they are contained and the sum of the concentrations of free cytoplasmic phosphorylated stat and stat . we note that hypothesis assumes that receptor molecules (free or bound, phosphorylated or unphosphorylated) are being internalised/degraded as part of the natural cellular trafficking cycle. hypothesis is consistent with a potential feedback mechanism, whereby the free cytoplasmic pstat molecules would migrate to the nucleus and increase the production of negative feedback proteins, such as socs , which down-regulate cytokine signaling. thus, the internalisation/degradation rate of receptor molecules (free or bound, phosphorylated or unphosphorylated) under hypothesis increases with the total amount of free cytoplasmic phosphorylated stat and stat , to account for this surface receptor down-regulation. a depiction of the reactions in both the hypil- and il- mathematical models and under each hypothesis is given in supp. fig. where a), c), e) and g) describe the hypil- model and b), d), f) and g) describe the il- model. in this figure, 𝑖 ∈ { , } so that the reactions shown can either involve stat or stat . above or below the reaction arrows is a symbol which represents the rate at which the reaction occurs (under the assumption of mass action kinetics). the notation for the rate constants and initial concentrations in the models, along with their descriptions and units, are given in table . parameter description unit 𝑟#,) & ,𝑟#,"* & rate of receptor-ligand binding nm- s- 𝑟#,) , ,𝑟#,"* , rate of receptor-ligand dissociation s- 𝑟",) & ,𝑟","* & rate of monomers binding to form a dimer nm- s- 𝑟",) , ,𝑟","* , rate of dissociation of the dimer s- 𝑘$% & rate of stat𝑖 binding to gp nm- s- 𝑘$' & rate of stat𝑖 binding to il- ra nm- s- 𝑘$% , rate of stat𝑖 dissociating gp s- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑘$' , rate of stat𝑖 dissociating il- ra s- 𝑞 rate of stat phosphorylation on the dimer s- 𝑑$ rate of free pstat𝑖 dephosphorylation s - 𝛽),𝛽"* rate of receptor internalisation/degradation under hypothesis s- 𝛾),𝛾"* rate of receptor internalisation/degradation under hypothesis nm- s- [𝑅#( )] initial concentration of gp nm [𝑅"( )] initial concentration of il- rα nm [𝑆$( )] initial concentration of stat𝑖 nm table : notation, definitions and units for the parameter values used in the mathematical models, where 𝑖 ∈ { , } so that stat𝑖 corresponds to stat or stat . the hypil- mathematical model was formulated based on reactions involving the following species: • 𝐿) = hypil- , • 𝑅# = gp , • 𝐶# = gp - hypil- monomer, • 𝐷) = phosphorylated gp - hypil- - hypil- - gp homodimer, • 𝑆# = unbound cytoplasmic unphosphorylated stat , • 𝑆( = unbound cytoplasmic unphosphorylated stat , • 𝐷) ⋅ 𝑆# = dimer bound to stat , • 𝐷) ⋅ 𝑆( = dimer bound to stat , • 𝐷) ⋅ 𝑝𝑆# = dimer bound to pstat , • 𝐷) ⋅ 𝑝𝑆( = dimer bound to pstat , • 𝑆# ⋅ 𝐷) ⋅ 𝑆# = dimer bound to two molecules of stat , • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆# = dimer bound to two molecules of stat , one of which is phosphorylated, • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆# = dimer bound to two molecules of pstat , • 𝑆( ⋅ 𝐷) ⋅ 𝑆( = dimer bound to two molecules of stat , • 𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆( = dimer bound to two molecules of stat , one of which is phosphorylated, • 𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆( = dimer bound to two molecules of pstat , • 𝑆# ⋅ 𝐷) ⋅ 𝑆( = dimer bound to one molecule of stat and one of stat , • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆( = dimer bound to one molecule of pstat and one of stat , • 𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆( = dimer bound to one molecule of stat and one of pstat , • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆( = dimer bound to one molecule of pstat and one of pstat , • 𝑝𝑆# = unbound cytoplasmic phosphorylated stat , • 𝑝𝑆( = unbound cytoplasmic phosphorylated stat . the initial reactions in the hypil- signaling pathway can then be described by the odes ( ) – ( ), under the law of mass action, where the terms involving the parameter 𝛽) apply only to the model under hypothesis and the terms involving the parameter 𝛾) apply only to the model under hypothesis . square brackets around a species is a notation that denotes the concentration of this species with unit nm, and “⋅” implies a reaction bond between two molecules/species. the odes are valid for any time 𝑡, with 𝑡 ≥ , but time has been omitted in the species concentration for ease of notation. we note here that, for example [𝑅#] = [𝑅#](𝑡) for all 𝑡 ≥ . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝑅 ] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿)] + 𝑟 , − [𝐶 ] − 𝛽 [𝑅 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑅 ] ( ) 𝑑[𝐿)] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿)] + 𝑟 , − [𝐶 ] ( ) 𝑑[𝐶 ] 𝑑𝑡 = 𝑟 , + [𝑅 ][𝐿)] − 𝑟 , − [𝐶 ] − 𝑟 , + [𝐶 ] + 𝑟 , − [𝐷 ] − 𝛽 [𝐶 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐶 ] ( ) 𝑑[𝐷 ] 𝑑𝑡 = 𝑟 , + [𝐶 ] − 𝑟 , − [𝐷 ] − 𝑘 𝑎 + [𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − 𝑘 𝑎 + [𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − 𝛽 [𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ] ( ) 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]( [𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]( [𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆# ⋅ 𝐷 ⋅ 𝑆(] − 𝑞[𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆) ⋅ 𝐷* ⋅ 𝑆)] − 𝑞[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] − 𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / −𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] −𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆)] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] −𝛽*[𝑝𝑆+ ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷* ⋅ 𝑝𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 + [𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] −𝑘+,- [𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝑞[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝑘),- [𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] −𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] −𝑘),- [𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝑞[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝑘+,- [𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] −𝛽*[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+](𝑘),- + 𝑘+,- ) − 𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] −𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] ( ) 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) similarly, and with some species in common with the hypil- model, the il- model has been formulated based on reactions involving the following species: • 𝐿"* = il- , • 𝑅# = gp , • 𝑅" = il- ra, • 𝐶" = il- ra - il- monomer, • 𝐷"* = phosphorylated il- ra - il- - gp heterodimer, • 𝑆# = unbound cytoplasmic unphosphorylated stat , • 𝑆( = unbound cytoplasmic unphosphorylated stat , • 𝑆# ⋅ 𝐷"* = dimer bound to stat via 𝑅#, .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / • 𝑆( ⋅ 𝐷"* = dimer bound to stat via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* = dimer bound to pstat via 𝑅#, • 𝑝𝑆( ⋅ 𝐷"* = dimer bound to pstat via 𝑅#, • 𝐷"* ⋅ 𝑆# = dimer bound to stat via 𝑅", • 𝐷"* ⋅ 𝑆( = dimer bound to stat via 𝑅", • 𝐷"* ⋅ 𝑝𝑆# = dimer bound to pstat via 𝑅", • 𝐷"* ⋅ 𝑝𝑆( = dimer bound to pstat via 𝑅", • 𝑆# ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to two molecules of stat , • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅#, • 𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅", • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound to two molecules of pstat , • 𝑆( ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to two molecules of stat , • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅#, • 𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound to two molecules of pstat , • 𝑆# ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to stat via 𝑅# and stat via 𝑅", • 𝑆( ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to stat via 𝑅" and stat via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to pstat via 𝑅# and stat via 𝑅", • 𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound to pstat via 𝑅" and stat via 𝑅#, • 𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound to stat via 𝑅# and pstat via 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to stat via 𝑅" and pstat via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound pstat via 𝑅# and pstat via 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound pstat via 𝑅# and pstat via 𝑅#, • 𝑝𝑆# = unbound cytoplasmic phosphorylated stat , • 𝑝𝑆( = unbound cytoplasmic phosphorylated stat . again, under the law of mass action, the initial reactions in the il- signaling pathway can be described by the odes ( ) – ( ). 𝑑[𝑅 ] 𝑑𝑡 = −𝑟 , + [𝐶 ][𝑅 ] + 𝑟 , − [𝐷 ] − 𝛽 [𝑅 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑅 ] ( ) 𝑑[𝑅 ] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿 ] + 𝑟 , − [𝐶 ] − 𝛽 [𝑅 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑅 ] ( ) 𝑑[𝐿 ] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿 ] + 𝑟 , − [𝐶 ] ( ) 𝑑[𝐶 ] 𝑑𝑡 = 𝑟 , + [𝑅 ][𝐿 ] − 𝑟 , − [𝐶 ] − 𝑟 , + [𝐶 ][𝑅 ] + 𝑟 , − [𝐷 ] − 𝛽 [𝐶 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐶 ] ( ) 𝑑[𝐷 ] 𝑑𝑡 = 𝑟 , + [𝐶 ][𝑅 ] − 𝑟 , − [𝐷 ] − m𝑘 𝑎 + + 𝑘 𝑏 + n[𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − m𝑘 𝑎 + + 𝑘 𝑏 + n[𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − 𝛽 [𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]([𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑘 𝑏 + [𝑆 ]([𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]([𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑘 𝑏 + [𝑆 ]([𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ] − 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑆 ][𝐷 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑆 ] − 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ] − 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑆 ][𝐷 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑆 ] − 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ] 𝑑𝑡 = −𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑝𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑝𝑆 ⋅ 𝐷 ] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ] 𝑑𝑡 = −𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑝𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑝𝑆 ⋅ 𝐷 ] ( ) 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘) [𝑆) ⋅ 𝐷 ][𝑆)] − 𝑘) - [𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑘),- [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝑘) - [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)](𝑘),- + 𝑘) - ) − 𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘+ [𝑆+ ⋅ 𝐷 ][𝑆+] − 𝑘+ - [𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑘+,- [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝑘+ - [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+](𝑘+,- + 𝑘+ - ) − 𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘+ [𝑆) ⋅ 𝐷 ][𝑆+] − 𝑘+ - [𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘) [𝑆+ ⋅ 𝐷 ][𝑆)] − 𝑘) - [𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑘),- [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑘+,- [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝑘+ - [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝑘) - [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+](𝑘),- + 𝑘+ - ) − 𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)](𝑘+,- + 𝑘) - ) − 𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝑝𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝑝𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) similarly to the hypil- model, the terms in equations ( ) - ( ) involving the parameter 𝛽"* apply only to the model under hypothesis and the terms involving the parameter 𝛾"* apply only to the model under hypothesis . we now describe how we have made use of the experimental data (fig. b and c supp.) to parameterise the mathematical models described above. since the experimental outputs are levels of pstat and pstat as a function of time under hypil- and il- stimulation (fig. b and c supp.), we consider two model outputs of interest for the hypil- and il- mathematical models, which are proportional to the experimental data in supp. figure b and c; namely, the sum of all molecular species (variables) containing phosphorylated stat (free or bound) ([𝑝𝑆#]-,., for 𝑗 ∈ { , }) and the sum of all species (variables) containing phosphorylated stat (free or bound) ([𝑝𝑆(]-,., for 𝑗 ∈ { , }). the concentrations of the two model outputs of interest at any time 𝑡 are given by [𝑝𝑆#]-,)(𝑡) = [𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆#](𝑡), ( ) [𝑝𝑆(]-,)(𝑡) = [𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆#](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆(](𝑡), ( ) for the hypil- model, and by [𝑝𝑆#]-,"*(𝑡) = [𝑝𝑆# ⋅ 𝐷"*](𝑡) + [𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆#](𝑡) + [𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆(](𝑡) + [𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆#](𝑡), ( ) [𝑝𝑆(]-,"*(𝑡) = [𝑝𝑆( ⋅ 𝐷"*](𝑡) + [𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆(](𝑡) + [𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆#](𝑡) + [𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆(](𝑡), ( ) for the il- model. having developed two mathematical models for the stimulation of the experimental system with hypil- and il- , it was then our objective to parameterise these models making use of approximate bayesian computation sequential monte carlo (abc-smc). firstly, a bayesian model selection was carried out to determine which hypothesis (mechanism) of internalisation/degradation of receptor molecules is most likely given the data. once a hypothesis was selected, together with the experimental data, the abc-smc method allows one to obtain posterior distributions for each of the parameter values and initial concentrations in the mathematical models. in this way, we can learn about which reactions and parameters in the models are causing the differential signaling by pstat observed when stimulating with hypil- and il- . the experimental data we used to compare with the mathematical model .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / outputs, was the mean relative fluorescence intensity of total phosphorylated stat and total phosphorylated stat in both rpe and th- cells (supp. figure b and c). we normalised the data to obtain dimensionless values, which can be compared with the mathematical model outputs. firstly, we constructed a linear model for the fluorescence intensity (background fluorescence) of antibodies for phosphorylated stat and stat in unstimulated cells. we subtracted the value of this linear model at each time point from the corresponding fluorescence intensity in hypil- and il- stimulated cells, for each repeat of the experiment and each cell type. denoting by 𝑓 the experimental fluorescence intensity, 𝑓(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) corresponds to the fluorescence intensity for the 𝑟th repeat, 𝑟 ∈ 𝑅 = { , , , } with antibody for stat𝑖, 𝑖 ∈ 𝐼 = { , } at time point 𝑡𝑝 ∈ 𝑇𝑃 = { 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛} under stimulation by cytokine il-𝑗 (hypil-𝑗 when 𝑗 = ), with 𝑗 ∈ 𝐽 = { , } and in cell type 𝑑 ∈ 𝐷 = {rpe ,th- }. each data point 𝑑𝑎𝑡𝑎(𝑟, 𝑖, 𝑡𝑝,𝑗,𝑑), to be used in the bayesian inference and bayesian model selection was then computed as 𝑑𝑎𝑡𝑎(𝑟, 𝑖, 𝑡𝑝,𝑗,𝑑) = 𝑓(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) 𝑓(𝑟, 𝑖, 𝑡𝑝 = 𝑚𝑖𝑛,𝑗 = ,𝑑) . to compare the model output, 𝑠𝑖𝑚, with the data, the output was normalised in the same way as the data, i.e., 𝑠𝑖𝑚(𝑖,𝑡𝑝,𝑗,𝑑) = [𝑝𝑆$]-,.(𝑡𝑝,𝑑) [𝑝𝑆$]-,"*( 𝑚𝑖𝑛,𝑑) , where [𝑝𝑆$]-,.(𝑡𝑝,𝑑) denotes the total concentration of phosphorylated stat𝑖 at time 𝑡𝑝 (see equations - ) when considering cell type 𝑑. in this way, experimental data and the mathematical model outputs are comparable. the similarity between the model output and the data points is then computed by the introduction of a distance measure 𝛿(𝑠𝑖𝑚,𝑑𝑎𝑡𝑎). here, this distance measure was chosen as a generalisation of the euclidean distance, where 𝛿/(𝑠𝑖𝑚,𝑑𝑎𝑡𝑎)" = z z zm𝑠𝑖𝑚(𝑖,𝑡𝑝,𝑗,𝑑) − 𝜇/% %(𝑖,𝑡𝑝,𝑗,𝑑)n " .∈ ∈- $∈ , for 𝑑 ∈ 𝐷 = {rpe ,th- }, where 𝜇/% %(𝑖,𝑡𝑝,𝑗,𝑑) is the mean of the four repeats of the data and is given by 𝜇/% %(𝑖,𝑡𝑝,𝑗,𝑑) = z𝑑𝑎𝑡𝑎(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) # . to carry out the bayesian model selection and bayesian parameter inference, prior beliefs about the parameters were firstly defined. each of the parameters (reaction rates) and initial concentrations in the model were sampled from a prior distribution, where the distribution was informed by experimental data or values from the literature, when possible. the choice of prior distributions is given in table . parameter prior distribution reference 𝑟#,) & for 𝑟 ∼ 𝑁(− , . ) * 𝑟#,) , for 𝑟 ∼ 𝑁(− . , . ) * 𝑟#,"* & for 𝑟 ∼ 𝑁(− . , . ) * .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑟#,"* , for 𝑟 ∼ 𝑁(− . , . ) * 𝑟",$ & for 𝑗 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ( ) 𝑟",$ , for 𝑗 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ( ) 𝑘$% & ,𝑘$' & for 𝑖 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ** 𝑘$% , ,𝑘$' , for 𝑖 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ** 𝑞 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) assumed 𝑑$ for 𝑖 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− ,− ) *** β. for 𝑗 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− ,− ) † [𝑅#( )] 𝑁( . , . ) ‡ [𝑅"( )] 𝑁( . , . ) ‡ [𝑆#( )] 𝑁( , ) ( ) [𝑆(( )] 𝑁( , ) ( ) table : prior distributions assigned to each parameter and initial concentration in the model. * these distributions are centred around measurements obtained from cell surface receptor quantification experiments. ** these distributions were derived based on 𝐾/ values obtained from the literature ( ). *** these distributions are based on values derived from experimental data in which the cells were treated with tofacitinib. † these distributions were based on values derived from experimental data in which the cells were treated with cycloheximide. ‡ these distributions were based on computations involving approximate cell sizes and average numbers of molecules per cell. we made use of the prior distributions from table to then carry out a bayesian model selection to determine which hypothesis is most likely given the rpe data for both hypil- and il- signaling. we ran ) simulations for each mathematical model (hypil- and il- ) and for each hypothesis, sampling model parameters from their prior distributions. we then computed a summary statistic for varying values of 𝛿 :#,∗, the distance threshold between the mathematical model and data at which parameters are accepted (or rejected) in the abc. finally, we computed 𝑓(𝐻<), the number of accepted parameter sets for hypothesis 𝑘, where the parameter sets are accepted if they result in a distance value less than or equal to 𝛿 :#,∗, the distance threshold. this allowed us to compute the relative probability, 𝑝(𝐻=), for each hypothesis, as defined by the following equation 𝑝(𝐻=|δ :#,∗) = 𝑓(𝐻=|δ :#,∗) 𝑓(𝐻#|δ :#,∗) + 𝑓(𝐻"|δ :#,∗) , for 𝑘 ∈ { , }. the results of the model selection analysis for rpe are shown in figure d, where the relative probability of hypothesis increases as 𝛿 :#,∗ tends to , whilst the relative probability of hypothesis decreases as a function of 𝛿 :#,∗. we hence concluded that the experimental data together with the mathematical models for hypil- and il- signaling provide greater support to hypothesis (around %) when compared to hypothesis (around %). we note that as the distance threshold, 𝛿 :#,∗, is increased, both hypotheses become equally likely, as is to be expected. given the results of the model selection, the bayesian parameter inference for the mathematical models of hypil- and il- signaling was only carried out for hypothesis . we used the abc, sequential monte carlo (abc-smc), approach ( ), to obtain posterior distributions for the parameters in table , making use of the prior distributions in table . all .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / model parameters in table were estimated for the rpe data set. a subset of the parameters, which we would expect may vary with cell type, were then estimated for the th- data set. in particular, the parameters not being estimated for th- were sampled from the posterior distributions obtained via the abc-smc for rpe , and those parameters estimated separately for th- were: 𝑞, 𝑑#, 𝑑(, 𝛽), 𝛽"*, [𝑅#( )], [𝑅"( )], [𝑆#( )] and [𝑆(( )]. to further validate the two mathematical models of cytokine signaling, we aimed to reproduce additional experimental results making use of the posterior parameter predictions from the rpe data abc-smc. firstly, and in order to replicate the experimental dose response curve seen in supp. fig. a, we run both models using the accepted parameters sets from the abc-smc for different values of cytokine concentration, within the range [ , – "] log nm. the results of this analysis are seen in supp. fig. b. we also modified the mathematical models to allow them to describe the il- rα-gp chimera experiments (fig. c). in particular, a new mathematical model for the chimera experiments was developed as follows: it consisted of the odes from the il- model which are involved in the formation of the dimer, (equations ( ) – ( )) and the odes from the hypil- model post-dimer formation (equations ( ) – ( )), in which 𝐷) was replaced by 𝐷"*. the ode for the il- induced dimer in the chimera model was as follows 𝑑[𝐷"*] 𝑑𝑡 = 𝑟","* & [𝐶"][𝑅#] − 𝑟","* , [𝐷"*] − 𝑘#% & [𝐷"*][𝑆#] + 𝑘#% , ([𝑆# ⋅ 𝐷"*] + [𝑝𝑆# ⋅ 𝐷"*]) − 𝑘(% & [𝐷"*][𝑆(] + 𝑘(% , ([𝑆( ⋅ 𝐷"*] + [𝑝𝑆( ⋅ 𝐷"*]) − β"*[𝐷"*]. we simulated both the original mathematical model of il- and the chimera model using the accepted parameter sets from the abc-smc. the results can be seen in supp. fig. a. finally, we focussed on one of the mutant varieties of il- rα, y f and sought to reproduce the results of fig. b making use of the mathematical model of il- signaling. since the mutation decreases the affinity of stat to il- rα, we fixed the association and dissociation rates of stat to the il- rα chain,𝑘#' & and 𝑘#' , , at values which resulted in a high µm affinity. the specific values chosen were 𝑘#' & = ,> nm- s- and 𝑘#' , = # s- which yields an affinity of " µm. the rate 𝑘#' , was chosen as approximately the median of the posterior distribution for this parameter from the abc-smc, and the rate 𝑘#' & was then significantly decreased in order to increase the affinity value. we simulated the mathematical model of il- signaling using the accepted parameter sets from the abc-smc, but where the rates 𝑘#' & and 𝑘#' , were fixed as described above. the pointwise medians and % credible intervals of these simulations are plotted in supp. fig. c, as well as the simulations for the wt, without altering any of the parameter values from the posterior distributions. altering the binding affinity of stat to il- rα in this way in the mathematical model allows us to generate results which replicate reasonably well, the experimental observations for the y f mutant in figure b. live-cell dual-color single-molecule imaging studies: single molecule imaging experiments were carried out by total internal reflection fluorescence (tirf) microscopy with an inverted microscope (olympus ix ) equipped with a triple-line total internal reflection (tir) illumination condenser (olympus) and a back-illuminated electron multiplied (em) ccd camera (ixon du d, x pixel, andor technology) as recently described ( - ). a x magnification objective with a numerical aperture of . (uapo / . tirfm, olympus) was used for tir illumination. all experiments were carried out at room temperature in medium without phenol red supplemented with an oxygen scavenger and a redox-active photoprotectant to minimize photobleaching ( ). for heterodimerization experiments of il- ra and gp cell surface labeling of rpe gp ko, co-transfected with mxfpe-il- ra and mxfpm-gp , was achieved by adding agfp-ennbrho and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / agfp-minbdy to the medium at equal concentrations ( nm) and incubated for at least min prior to stimulation with il- ( nm) or hypil- ( nm). for homodimerization experiments with mxfpm-gp , agfp-minbdy and agfp-minbrho ( ) were used for cell surface receptor labelling as described above. the nanobodies were kept in the bulk solution during the whole experiment in order to ensure high equilibrium binding to mxfp- gp . for simultaneous dual color acquisition, agfp-nbrho was excited by a nm diode-pumped solid-state laser at . mw (~ w/cm ) and agfp-nbdy by a nm laser diode at . mw (~ w/cm ). fluorescence was detected using a spectral image splitter (dualview, optical insight) with a dcxr dichroic beam splitter (chroma) in combination with the bandpass filter / (semrock) for detection of rho and / (chroma) for detection of dy dividing each emission channel into x pixel. image stacks of frames were recorded at ms/frame. single molecule localization and single molecule tracking were carried out using the multiple- target tracing (mtt) algorithm ( ) as described previously ( ). step-length histograms were obtained from single molecule trajectories and fitted by two fraction mixture model of brownian diffusion. average diffusion constants were determined from the slope ( - steps) of the mean square displacement versus time lapse diagrams. immobile molecules were identified by the density-based spatial clustering of applications with noise (dbscan) algorithm as described recently ( ). for comparing diffusion properties and for co-tracking analysis, immobile particles were excluded from the data set. prior to co-localization analysis, imaging channels were aligned with sub-pixel precision by using a spatial transformation. to this end, a transformation matrix was calculated based on a calibration measurement with multicolour fluorescent beads (tetraspeck microspheres . mm, invitrogen) visible in both spectral channels (cp tform of type ‘affine’, the mathworks matlab a). individual molecules detected in the both spectral channels were regarded as co-localized, if a particle was detected in both channels of a single frame within a distance threshold of nm radius. for single molecule co-tracking analysis, the mtt algorithm was applied to this dataset of co-localized molecules to reconstruct co-locomotion trajectories (co- trajectories) from the identified population of co-localizations. for the co-tracking analysis, only trajectories with a minimum of steps (~ ms) were considered in order to robustly remove random receptor co-localizations ( ). for heterodimerization experiments of mxfpe-il- ra and mxfpm-gp , the relative fraction of dimerized receptors was calculated from the number of co-trajectories relative to the number of il- ra trajectories. gp was expressed in moderate excess (~ . - fold), so that maximal receptor assembly was not limited by abundance of the low-affinity subunit gp . for homodimerization experiments with gp , the relative fraction of co-tracked molecules was determined with respect to the absolute number of trajectories and corrected for gp stochastically double-labelled with the same fluorophore species as follows: 𝐴𝐵∗ = ?@ "×bc ! !"# d×c # !"# de , 𝑟𝑒𝑙.𝑐𝑜 − 𝑙𝑜𝑐𝑜𝑚𝑜𝑡𝑖𝑜𝑛 = "×?@ ∗ (?&@) where a, b, ab and ab* are the numbers of trajectories observed for rho , dy , co- trajectories and corrected co-trajectories, respectively. the two-dimensional equilibrium dissociation constants (𝐾!"!) were calculated according to the law of mass action for a monomer-dimer equilibrium: heterodimerization (il- ra+gp ): .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝐾! "! = m[𝐺𝑃 ] − (𝛼 × [𝐼𝐿 𝑅𝑎])n × m[𝐼𝐿 𝑅𝑎] − (𝛼 × [𝐼𝐿 𝑅𝑎])n (𝛼 × [𝐼𝐿 𝑅𝑎]) or 𝐾! "! = [𝐺𝑃 ] × j 𝛼 − k + [𝐼𝐿 𝑅𝑎] × (𝛼 − ) with: 𝛼 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝐼𝐿 𝑏𝑜𝑢𝑛𝑑 𝐼𝐿 𝑅𝑎 𝑖𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝑤𝑖𝑡ℎ 𝐺𝑃 homodimerization (gp +gp ): 𝐾! "! = [i]% [!] = ([i]&,"[!])% [!] 𝐾! "! = k[l #(m],"×(n×[l #(m])o % "×(n×[l #(m]) with: 𝛼 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝐺𝑃 ℎ𝑜𝑚𝑜𝑑𝑖𝑚𝑒𝑟𝑠 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑡𝑜 [𝐺𝑃 ]/ where [m] and [d] are the concentrations of the monomer and the dimer, respectively, and [m] is the total receptor concentration. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / references: . j. j. o'shea, r. plenge, jak and stat signaling molecules in immunoregulation and immune-mediated disease. immunity , - ( ). . s. pflanz et al., il- , a heterodimeric cytokine composed of ebi and p protein, induces proliferation of naive cd + t cells. immunity , - ( ). . h. yoshida, c. a. hunter, the immunobiology of interleukin- . annu rev immunol , - ( ). . j. s. stumhofer et al., interleukin negatively regulates the development of interleukin -producing t helper cells during chronic inflammation of the central nervous system. nat immunol , - ( ). . c. diveu et al., il- blocks rorc expression to inhibit lineage commitment of th cells. j immunol , - ( ). . d. c. fitzgerald et al., suppression of autoimmune inflammation of the central nervous system by interleukin secreted by interleukin -stimulated t cells. nat immunol , - ( ). . j. s. stumhofer et al., interleukins and induce stat -mediated t cell production of interleukin . nat immunol , - ( ). . c. pot, l. apetoh, a. awasthi, v. k. kuchroo, induction of regulatory tr cells and inhibition of t(h) cells by il- . semin immunol , - ( ). . m. j. boulanger, d. c. chow, e. e. brevnova, k. c. garcia, hexameric structure and assembly of the interleukin- /il- alpha-receptor/gp complex. science , - ( ). . s. rose-john, interleukin- family cytokines. cold spring harb perspect biol , ( ). . c. a. hunter, s. a. jones, il- as a keystone cytokine in health and disease. nature immunology , - ( ). . t. korn et al., il- controls th immunity in vivo by inhibiting the conversion of conventional t cells into foxp + regulatory t cells. proc natl acad sci u s a , - ( ). . a. kimura, t. kishimoto, il- : regulator of treg/th balance. eur j immunol , - ( ). . g. w. jones et al., loss of cd + t cell il- r expression during inflammation underlines a role for il- trans signaling in the local maintenance of th cells. j immunol , - ( ). . c. rolvering et al., crosstalk between different family members: il recapitulates ifn gamma responses in hcc cells, but is inhibited by il -type cytokines. bba-mol cell res , - ( ). . a. p. costa-pereira et al., mutational switch of an il- response to an interferon- gamma-like response. p natl acad sci usa , - ( ). . j. schmitz, m. weissenbach, s. haan, p. c. heinrich, f. schaper, socs exerts its inhibitory function on interleukin- signal transduction through the shp recruitment site of gp . journal of biological chemistry , - ( ). . h. yasukawa et al., il- induces an anti-inflammatory response in the absence of socs in macrophages. nat immunol , - ( ). . b. a. croker et al., socs negatively regulates il- signaling in vivo. nat immunol , - ( ). . c. brender et al., suppressor of cytokine signaling regulates cd t-cell proliferation by inhibition of interleukins and . blood , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . a. camporeale, v. poli, il- , il- and stat : a holy trinity in auto-immunity? front biosci (landmark ed) , - ( ). . g. regis, s. pensa, d. boselli, f. novelli, v. poli, ups and downs: the stat :stat seesaw of interferon and gp receptor signalling. semin cell dev biol , - ( ). . s. lucas, n. ghilardi, j. li, f. j. de sauvage, il- regulates il- responsiveness of naive cd (+) t cells through stat -dependent and -independent mechanisms. p natl acad sci usa , - ( ). . s. kamiya et al., an indispensable role for stat in il- -induced t-bet expression but not proliferation of naive cd (+) t cells. journal of immunology , - ( ). . a. takeda et al., cutting edge: role of il- /wsx- signaling for induction of t-bet through activation of stat during initial th commitment. journal of immunology , - ( ). . c. neufert et al., il- controls the development of inducible regulatory t cells and th cells via differential effects on stat . eur j immunol , - ( ). . t. owaki et al., stat is indispensable to il- -mediated cell proliferation but not to il- -induced th differentiation and suppression of proinflammatory cytokine production. journal of immunology , - ( ). . k. hirahara et al., asymmetric action of stat transcription factors drives transcriptional outputs and cytokine specificity. immunity , - ( ). . s. oniki et al., interleukin- and interleukin- exert quite different antitumor and vaccine effects on poorly immunogenic melanoma. cancer res , - ( ). . m. fischer et al., i. a bioactive designer cytokine for human hematopoietic progenitor cell expansion. nat biotechnol , - ( ). . h. h. oberg, d. wesch, s. grussel, s. rose-john, d. kabelitz, differential expression of cd and cd mediates different stat- phosphorylation in cd +cd - and cd high regulatory t cells. int immunol , - ( ). . p. o. krutzik, m. r. clutter, a. trejo, g. p. nolan, fluorescent cell barcoding for multiplex flow cytometry. curr protoc cytom chapter , unit ( ). . u. a. betz, w. muller, regulated expression of gp and il- receptor alpha chain in t cell maturation and activation. int immunol , - ( ). . j. martinez-fabregas et al., kinetics of cytokine receptor trafficking determine signaling and functional selectivity. elife , ( ). . c. gorby et al., engineered il- variants elicit potent immunomodulatory effects at low ligand doses. sci signal , ( ). . v. ruprecht, weghuber, j., wieser, s., schütz, g. j, in advances in planar lipid bilayers and liposomes. ( ), vol. ,, pp. - . . i. moraga et al., instructive roles for cytokine-receptor binding parameters in determining signaling and functional potency. science signaling , ( ). . s. wilmes et al., receptor dimerization dynamics as a regulatory valve for plasticity of type i interferon signaling. j cell biol , - ( ). . s. wilmes et al., mechanism of homodimeric cytokine receptor activation and dysregulation by oncogenic mutations. science , - ( ). . i. moraga et al., tuning cytokine receptor signaling by re-orienting dimer geometry with surrogate ligands. cell , - ( ). . s. pflanz et al., wsx- and glycoprotein constitute a signal-transducing receptor for il- . j immunol , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . m. wiederkehr-adam et al., characterization of phosphopeptide motifs specific for the src homology domains of signal transducer and activator of transcription (stat ) and stat . j biol chem , - ( ). . a. pradhan, q. t. lambert, l. n. griner, g. w. reuther, activation of jak -v f by components of heterodimeric cytokine receptors. j biol chem , - ( ). . h. kim, t. s. hawley, r. g. hawley, h. baumann, protein tyrosine phosphatase (shp- ) moderates signaling by gp but is not required for the induction of acute- phase plasma protein genes in hepatic cells. mol cell biol , - ( ). . d. w. huang, b. t. sherman, r. a. lempicki, systematic and integrative analysis of large gene lists using david bioinformatics resources. nat protoc , - ( ). . j. bancerek et al., cdk kinase phosphorylates transcription factor stat to selectively regulate the interferon response. immunity , - ( ). . s. rutz et al., deubiquitinase duba is a post-translational brake on interleukin- production in t cells. nature , - ( ). . k. l. o'hagan, s. d. miller, h. phee, pak is essential for the function of foxp +regulatory t cells through maintaining a suppressive treg phenotype. sci rep- uk , ( ). . d. z. ye, j. field, pak signaling in cancer. cell logist , - ( ). . y. liao, j. wang, e. j. jaehnig, z. shi, b. zhang, webgestalt : gene set analysis toolkit with revamped uis and apis. nucleic acids res , w -w ( ). . j. satoh, h. tabunoki, a comprehensive profile of chip-seq-based stat target genes suggests the complexity of stat -mediated gene regulatory mechanisms. gene regul syst bio , - ( ). . i. rusinova et al., interferome v . : an updated database of annotated interferon- regulated genes. nucleic acids res , d - ( ). . h. n. suh et al., role of interleukin- in the control of dna synthesis of hepatocytes: involvement of pkc, p / mapks, and ppardelta. cell physiol biochem , - ( ). . a. v. villarino et al., il- limits il- production during th differentiation. j immunol , - ( ). . k. hirahara et al., interleukin- priming of t cells controls il- production in trans via induction of the ligand pd-l . immunity , - ( ). . x. hu et al., sensitization of ifn-gamma jak-stat signaling during macrophage activation. nat immunol , - ( ). . v. francois-newton, m. livingstone, b. payelle-brogard, g. uze, s. pellegrini, usp establishes the transcriptional and anti-proliferative interferon alpha/beta differential. biochem j , - ( ). . k. zenke, m. muroi, k. i. tanamoto, irf supports dna binding of stat by promoting its phosphorylation. immunol cell biol , - ( ). . k. karwacz et al., critical role of irf and batf in forming chromatin landscape during type regulatory cell differentiation. nat immunol , - ( ). . a. yoshimura, y. wakabayashi, t. mori, cellular and molecular basis for the regulation of inflammation by tgf-beta. j biochem , - ( ). . a. awasthi et al., a dominant function for interleukin in generating interleukin - producing anti-inflammatory t cells. nat immunol , - ( ). . j. b. brown et al., p-selectin glycoprotein ligand- is needed for sequential recruitment of t-helper (th ) and local generation of th t cells in dextran sodium sulfate (dss) colitis. inflamm bowel dis , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . m. matsumoto et al., cd collaborates with p-selectin glycoprotein ligand- to mediate e-selectin-dependent t cell migration into inflamed skin. j immunol , - ( ). . d. n. slenter et al., wikipathways: a multifaceted pathway database bridging metabolomics to other omics research. nucleic acids res , d -d ( ). . a. petretto et al., proteomic analysis uncovers common effects of ifn-gamma and il- on the hla class i antigen presentation machinery in human cancer cells. oncotarget , - ( ). . l. h. wong, i. hatzinisiriou, r. j. devenish, s. j. ralph, ifn-gamma priming up- regulates ifn-stimulated gene factor (isgf ) components, augmenting responsiveness of ifn-resistant melanoma cells to type i ifns. j immunol , - ( ). . m. tokuyama et al., ervmap analysis reveals genome-wide transcription of human endogenous retroviruses. proc natl acad sci u s a , - ( ). . c. garbers et al., plasticity and cross-talk of interleukin -type cytokines. cytokine growth factor rev , - ( ). . s. kang, m. narazaki, h. metwally, t. kishimoto, historical overview of the interleukin- family cytokine. j exp med , ( ). . r. umeshita-suyama et al., characterization of il- and il- signals dependent on the human il- receptor alpha chain : redundancy of requirement of tyrosine residue for stat activation. int immunol , - ( ). . o. w. nadeau et al., the proximal tyrosines of the cytoplasmic domain of the beta chain of the type i interferon receptor are essential for signal transducer and activator of transcription (stat) activation. evidence that two stat sites are required to reach a threshold of interferon alpha-induced stat tyrosine phosphorylation that allows normal formation of interferon-stimulated gene factor . j biol chem , - ( ). . m. n. sharif et al., ifn-alpha priming results in a gain of proinflammatory function by il- : implications for systemic lupus erythematosus pathogenesis. j immunol , - ( ). . d. richter et al., ligand-induced type ii interleukin- receptor dimers are sustained by rapid re-association within plasma membrane microcompartments. nat commun , ( ). . j. p. twohig et al., activation of naive cd (+) t cells re-tunes stat signaling to deliver unique cytokine responses in memory cd (+) t cells. nat immunol , - ( ). . p. c. heinrich et al., principles of interleukin (il)- -type cytokine signalling and its regulation. biochem j , - ( ). . d. levin, d. harari, g. schreiber, stochastic receptor expression determines cell fate upon interferon treatment. mol cell biol , - ( ). . i. moraga, d. harari, g. schreiber, g. uze, s. pellegrini, receptor density is key to the alpha /beta interferon differential activities. mol cell biol , - ( ). . c. c. m. ho et al., decoupling the functional pleiotropy of stem cell factor by tuning c-kit signaling. cell , - e ( ). . p. charlot-rabiega, e. bardel, c. dietrich, r. kastelein, o. devergne, signaling events involved in interleukin (il- )-induced proliferation of human naive cd + t cells and b cells. j biol chem , - ( ). . j. diegelmann, t. olszak, b. goke, r. s. blumberg, s. brand, a novel role for interleukin- (il- ) as mediator of intestinal epithelial barrier protection mediated via differential signal transducer and activator of transcription (stat) protein .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / signaling and induction of antibacterial and anti-inflammatory proteins. journal of biological chemistry , - ( ). . h. bender et al., interleukin- displays interferon-gamma-like functions in human hepatoma cells and hepatocytes. hepatology , - ( ). . t. imamichi, j. yang, w. huang da, b. sherman, r. a. lempicki, interleukin- induces interferon-inducible genes: analysis of gene expression profiles using affymetrix microarray and david. methods mol biol , - ( ). . j. m. fakruddin et al., noninfectious papilloma virus-like particles inhibit hiv- replication: implications for immune control of hiv- infection by il- . blood , - ( ). . a. c. frank et al., interleukin- , an anti-hiv- cytokine, inhibits replication of hepatitis c virus. j interferon cytokine res , - ( ). . s. l. laporte et al., molecular and structural basis of cytokine receptor pleiotropy in the interleukin- / system. cell , - ( ). . j. b. spangler, i. moraga, k. m. jude, c. s. savvides, k. c. garcia, a strategy for the selection of monovalent antibodies that span protein dimer interfaces. j biol chem , - ( ). . a. kirchhofer et al., modulation of protein properties in living cells using nanobodies. nat struct mol biol , - ( ). . m. c. hochberg, updating the american college of rheumatology revised criteria for the classification of systemic lupus erythematosus. arthritis rheum , ( ). . j. cox, m. mann, maxquant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. nat biotechnol , - ( ). . j. cox et al., andromeda: a peptide search engine integrated into the maxquant environment. j proteome res , - ( ). . p. o. krutzik, g. p. nolan, fluorescent cell barcoding in flow cytometry allows high- throughput drug screening and signaling profiling. nat methods , - ( ). . w. huang da, b. t. sherman, r. a. lempicki, bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. nucleic acids res , - ( ). . w. huang da, b. t. sherman, r. a. lempicki, systematic and integrative analysis of large gene lists using david bioinformatics resources. nat protoc , - ( ). . n. kozer et al., exploring higher-order egfr oligomerisation and phosphorylation--a combined experimental and theoretical approach. mol biosyst , - ( ). . d. n. itzhak, s. tyanova, j. cox, g. h. borner, global, quantitative and dynamic mapping of protein subcellular localization. elife , ( ). . t. toni, d. welch, n. strelkowa, a. ipsen, m. p. stumpf, approximate bayesian computation scheme for parameter inference and model selection in dynamical systems. j r soc interface , - ( ). . j. vogelsang et al., a reducing and oxidizing system minimizes photobleaching and blinking of fluorescent dyes. angew chem int ed engl , - ( ). . a. kirchhofer et al., modulation of protein properties in living cells using nanobodies. nat struct mol biol , -u ( ). . a. serge, n. bertaux, h. rigneault, d. marguet, dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. nat methods , - ( ). . c. you et al., receptor dimer stabilization by hierarchical plasma membrane microcompartments regulates cytokine signaling. sci adv , e ( ). . f. roder, a. lubk, d. wolf, t. niermann, noise estimation for off-axis electron holography. ultramicroscopy , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure legends: figure cytokine receptor activation by il- and (hyp)il- : a) cartoon model of stepwise assembly of the il- and hypil- -induced receptor complex and subsequent activation of stat and stat . b) dose-dependent phosphorylation of stat and stat as a response to il- and hypil- stimulation in th- cells, normalized to maximal il- stimulation. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. c) phosphorylation kinetics of stat and stat followed after stimulation with saturating concentrations of il- ( nm) and hypil- ( nm) or unstimulated th- cells, normalized to maximal il- stimulation. data was obtained from five biological replicates with each two technical replicates, showing mean ± std dev. d) top: phosphorylation kinetics of stat and stat followed after stimulation with hypil- ( nm) or left unstimulated, comparing wt rpe and rpe gp ko reconstituted with high levels of mxfpm-gp (= x [gp ]). data was normalized to maximal stimulation levels of wt rpe . left: cell surface gp levels comparing rpe gp ko, wt rpe and rpe gp ko stably expressing mxfpm-gp measured by flow cytometry. data was obtained from one biological replicate with each two technical replicates, showing mean ± std dev. bottom right: cell surface levels of gp measured by flow cytometry for indicated cell lines. e) cartoon model of cell surface labeling of mxfp-tagged receptors by dye-conjugated anti-gfp nanobodies (nb) and identification of receptor dimers by single molecule dual-colour co-localization. f) raw data of dual-colour single-molecule tirf imaging of mxfpe-il- rαnb-rho and gp nb-dy after stimulation with il- . particles from the insets (il- ra: red & gp : blue) were followed by single molecule tracking ( frames ~ . s) and trajectories > steps ( ms) are displayed. receptor heterodimerization was detected by co-localization/co-tracking analysis. g) relative number of co-trajectories observed for heterodimerization of il- rα and gp as well as homodimerization of gp for unstimulated cells or after indicated cytokine stimulation. each data point represents the analysis from one cell with a minimum of cells measured for each condition. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. h) stoichiometry of the il- –induced receptor complex revealed by bleaching analysis. left: intensity traces of mxfpe-il rαnb-rho and gp nb-dy were followed until fluorophore bleaching. middle: merged imaging raw data for selected timepoints. right: overlay of the trajectories for il- rα (red) and gp (blue). figure : mathematical modelling results in rpe and th- cells. a) simplified cartoon model of il- /hypil- signal propagation layers and coverage of the mathematical modelling approach. b) model selection results showing the relative probabilities of each hypothesis, for different values of the distance threshold, 𝛿∗, in rpe cells. c) pointwise median and % credible intervals of the predictions from the mathematical model, calibrated with the experimental data, using the posterior distributions for the parameters from the abc-smc. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / d) kernel density estimates of the posterior distributions for the parameters 𝑝 ∈ {𝑟#,. & ,𝑟#,. , ,𝑟",. & ,𝑟",. , ,𝑘$% & ,𝑘$% , ,𝑘$' & ,𝑘$' , ,𝑞,𝑑$,𝛽., [𝑅#( )],[𝑅"( )],[𝑆#( )],[𝑆(( )]} in the mathematical models where 𝑗 ∈ { , } and 𝑖 ∈ { , }. figure : il- rα cytoplasmic domain is required for sustained pstat kinetics. a) representation of the cytoplasmic domain of il- rα with its highlighted tyrosine residues y and y . b) stat and stat phosphorylation kinetics of rpe clones stably expressing wt and mutant il- rα after stimulation with il- ( nm, top panels) or after stimulation with hypil- ( nm, bottom panels), normalized to maximal levels of wt il- rα stimulated with il- (top) or hypil- (bottom). data was obtained from three experiments with each two technical replicates, showing mean ± std dev. bottom right: cell surface levels variants measured by flow cytometry for indicated il- rα cell lines. c) cytoplasmic domain of il- rα is required for sustained pstat activation. left: cartoon representation of receptor complexes. right: stat and stat phosphorylation kinetics of rpe clones stably expressing wt il- rα and il- rα- gp chimera after stimulation with il- ( nm, top panels) or after stimulation with hypil- ( nm, bottom panels). data was normalized to maximal levels for each cytokine and cell line. data was obtained from two experiments with each technical replicates, showing mean ± std dev. d) phosphatases do not account for differential pstat / activity induced by il- and hypil- . left: schematic representation of workflow using jak inhibitor tofacitinib. right: mfi ratio of tofacitinib-treated and non-treated rpe mxfpe-il- rα cells for pstat and pstat after stimulation with il- ( nm) and hypil- ( nm). data was obtained from two experiments with each two technical replicates, showing mean ± std dev. figure : unique and overlapping effects of il- and hypil- on the phosphoproteome of th- cells. a) volcano plot of the phospho-sites regulated (p value £ . , fold change ³+ . or £- . ) by il- (left) and hypil- (right). data was obtained from three biological replicates. b) heatmap representation (examples) of shared and differentially up- (left) and downregulated (right) phospho-sites after il- and hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. c) tyrosine and serine phosphorylation of selected stat proteins after stimulation with il- (red) and hypil- (blue). *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. d) ps -stat and ps -stat phosphorylation kinetics in th- cells after stimulation with il- or hypil- , normalized to maximal il- stimulation. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. e) go analysis “biological processes” of the phospho-sites regulated by il- (red) and hypil- (blue) represented as bubble-plots. f) phosphorylation of target proteins associated with stat /cdk transcription initiation complex after stimulation with il- (blue) and hypil- (red) and schematic representation of transcription regulation of rna polymerase ii with identified phospho-sites (red flags). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure : kinetic decoupling of gene induction programs depends on sustained stat activation by il- . a) principal component analysis for genes found to be significantly upregulated (left) or downregulated (right) for at least one of the tested conditions (time & cytokine). data was obtained from three biological replicates. b) kinetics of gene induction shared between il- and hypil- (relative to il- ) for upregulated genes (red) or downregulated genes (green). c) kinetics of gene numbers induced after il- and hypil- stimulation for upregulated genes (left) and downregulated genes (right). d) gsea reactome analysis of selected pathways with significantly altered gene induction in response to il- or hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. e) cluster analysis comparing the gene induction kinetics after il- or hypil- stimulation. gene induction heatmaps for example genes as well as induction kinetics (mean) are shown for highlighted gene clusters. data represents the mean (log ) fold change of three biological replicates. figure : il- -induced upregulation of irf amplifies induction of stat -dependent genes a) kinetics of irf protein expression as a response to continuous il- and hypil- stimulation in th- cells. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. dotted line indicates baseline level. b) kinetics of irf protein expression and sirna-mediated irf knockdown in rpe il- rα cells stimulated with il- ( nm). data was obtained from one representative experiment with each two technical replicates, normalized to maximal irf induction ( h), showing mean ± std dev. c) kinetics of stat (left) and stat (right) phosphorylation after sirna-mediated irf knockdown in rpe il- rα cells stimulated with il- ( nm). data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. d) kinetics of gene induction (stat , gbp , oas , socs ) followed by rt qpcr in rpe il- rα cells stimulated with il- ( nm) with and without sirna-mediated knockdown of irf . data was obtained from three experiments with each two technical replicates, showing mean ± sem. figure : il- -induced stat response drives global proteomic changes in th- cells. a) workflow for quantitative silac proteomic analysis of th- cells continuously stimulated ( h) with il- ( nm), hypil- ( nm) or left untreated. b) global proteomic changes in th- cells induced by il- (left) or hypil- (right) represented as volcano plots. proteins significantly up- or downregulated are highlighted in red (p value £ . , fold change ³+ . or £- . ). significantly altered isg-encoded proteins by il- are highlighted in yellow. data was obtained from three biological replicates. c) venn diagrams comparing unique upregulated (left) and downregulated (right) proteins by il- (blue) and hypil- (red) as well as shared altered proteins. isg-encoded proteins are highlighted in yellow. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / d) heatmaps of the top up- and downregulated proteins by il- compared to hypil- . data representation of the mean (log ) fold change of three biological replicates. e) heatmap representation and enrichment plot of proteins identified by gsea reactome pathway enrichment analysis “cytokine signaling and immune system” induced by il- . data representation of the mean (log ) fold change of three biological replicates. f) correlation of il- and hypil- -induced rna-seq transcript levels (³+ or £- fc) with quantitative proteomic data (³+ . or £- . fc). data representation of the mean (log ) fold change of three biological replicates. figure : receptor and stat concentrations determine the nature of the cytokine response. a) copy numbers of indicated proteins determined for different t-cell subsets using mass- spectrometry based proteomics (immpres - http://immpres.co.uk). b) model predictions for varying levels of stat and stat (left panel) or il- rα and gp (right panel) for phosphorylation kinetics of stat and stat . c) gene expression profiles determined by rnaseq analysis comparing indicated genes of a cohort of sle risk patients with a cohort of healthy controls. data obtained from: proc natl acad sci u s a , - . *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. d) dose-dependent phosphorylation of stat and stat as a response to il- and hypil- stimulation in naive and ifnα -primed ( nm, h) th- cells, normalized to maximal il- stimulation (ctrl). data was obtained from four biological replicates with each two technical replicates, showing mean ± std dev. e) phosphorylation of stat (left) and stat (right) as a response to il- ( nm, min) and hypil- ( nm, min) stimulation in healthy control (ctrl) and sle patient cd + t-cells. data was obtained from five healthy control donors ( ) and six sle patients. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. f) tofacitinib titration to inhibit stat and stat phosphorylation by hypil- ( nm, min) in th- cells (left) and rpe cells stably expressing wt il- rα (right). supp. figure : a) comparison of dose-dependent phosphorylation (stat / ) of purchased il- and mil- sc in activated cd + cells, normalized to maximal mfi levels. data was obtained from one (purchased) or two (mil- sc) biological replicates with each two technical replicates, showing mean ± std dev. b) schematic workflow of t-cell isolation, th differentiation, fluorescence barcoding and gating strategy for high throughput flow cytometry. c) phosphorylation kinetics of stat and stat followed after stimulation with il- ( nm) and hypil- ( nm) or unstimulated th cells. data (from fig. c) was normalized to maximal mfi levels for each cytokine. data was obtained from five biological replicates with each two technical replicates, showing mean ± std dev. d) phosphorylation kinetics of activated pbmcs (cd +, cd +) of stat and stat followed after stimulation with il- ( nm) and hypil- ( nm) or unstimulated cells. data was normalized to maximal il- stimulation. data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. e) dose-response experiments in wt rpe cells for pstat (left) and pstat (right), stimulated with il- or hypil- , normalized to maximal hypil- stimulation. data was .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / obtained from one representative experiment with each two technical replicates, showing mean ± std dev. supp. figure : a) dose-response experiments for pstat and pstat comparing rpe gp ko cells (left), wt rpe (middle) and rpe mxfpe-il ra (right) after stimulation with il- or hypil- , normalized to maximal hypil- stimulation. data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. b) ligand-induced receptor dimerization: top panel: dual-colour co-tracking of il- rα and gp in the absence (top) and presence (bottom) of il- ( nm). trajectories ( frames, ~ . s) of individual mxfpe-il rαnb-rho (red) and gp nb-dy (blue) and co-trajectories (magenta) are shown for a representative cell. bottom panel: dual-colour co-tracking of gp in the absence (top) and presence (bottom) of hypil- ( nm). trajectories ( frames, ~ . s) of individual mxfpe-il rαnb-rho (red) and gp nb-dy (blue) and co-trajectories (magenta) are shown for a representative cell. c) top: cartoon model of cell surface labeling of mxfp-tagged gp by dye-conjugated anti-gfp nanobodies (nb) and formation of single-colour homodimers (left) or dual- colour homodimers (right). below: examples for intensity traces of single-colour dual- step bleaching (left) or dual-colour single-step bleaching (right). insets show raw data for selected timepoints and corresponding trajectories. d) top: comparison of diffusion coefficients (d) for mxfpe-il- rαnb-rho (red) and mxfpmgp nb-dy (blue) in presence and absence of il- stimulation ( nm), as well as co-trajectories after il- stimulation (magenta). bottom: comparison of diffusion coefficients for mxfpm-gp nb-rho (red) in presence and absence of hypil- stimulation ( nm), as well as co-trajectories after hypil- stimulation (magenta). each data point represents the analysis from one cell with a minimum of cells measured for each condition. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. supp. figure : a) reactions involving ligand binding and dimerization in the hypil- model. b) reactions involving ligand binding and dimerization in the il- model. c) reactions involving the stat molecules (𝑆. 𝑓𝑜𝑟 𝑗 ∈ { , }) in the hypil- model. d) reactions involving the stat molecules (𝑆. 𝑓𝑜𝑟 𝑗 ∈ { , }) in the il- model. e) reactions involving receptor internalisation/degradation in the hypil- model. here 𝐻 = 𝛽) and 𝐻 = 𝛾)([𝑝𝑆 ] + [𝑝𝑆 ]). f) reactions involving receptor internalisation/degradation in the il- model. here 𝐻 = 𝛽"* and 𝐻 = 𝛾"*([𝑝𝑆 ] + [𝑝𝑆 ]). g) dephosphorylation of (𝑆. 𝑓𝑜𝑟 𝑗 ∈ { , }) in the cytoplasm. this reaction occurs in both models. h) key for the molecules in the reactions. supp. figure : a) stat (left) and stat (right) phosphorylation kinetics of rpe clones stably expressing wt il- rα after stimulation with il- or after stimulation with hypil- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / normalized to maximal il- stimulation. data was obtained from three experiments with each two technical replicates, showing mean ± std dev. b) dose-response experiments for pstat (left) and pstat (right) in rpe cells stably expressing wt il- rα or tyrosine-mutants after stimulation with il- , normalized to maximal stimulation of wt il- rα. data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. supp. figure : a) dose-response experiments for pstat (left) and pstat (right) in rpe cells stably expressing wt il- rα or il- ra-gp chimera after stimulation with il- . data normalized to maximal stimulation of wt il- rα. data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. b) stat (left) and stat (right) phosphorylation kinetics in rpe il- rα cells stimulated with il- or hypil- with and without jak inhibition by tofacitinib. data was normalized to maximal il- stimulation. data was obtained from two experiments with each two technical replicates, showing mean ± std dev. c) stat (left) and stat (right) phosphorylation kinetics in th- cells stimulated with il- or hypil- with and without jak inhibition by tofacitinib. data was normalized to to maximal il- stimulation. data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. d) mfi ratio of tofacitinib-treated and non-treated th- cells for pstat (left) and pstat (right) after stimulation with il- ( nm) and hypil- ( nm). data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. supp. figure : a) stat (left) and stat (right) phosphorylation kinetics in rpe il- rα cells stimulated with il- or hypil- with and without pretreatment with cycloheximide (chx). data was normalized to to maximal il- stimulation. data was obtained from two experiments with each two technical replicates, showing mean ± std dev. b) stat (left) and stat (right) phosphorylation kinetics in th cells stimulated with il- or hypil- with and without pretreatment with cycloheximide (chx). data was normalized to to maximal il- stimulation. data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. supp. figure : a) workflow for quantitative silac phospho-proteomic analysis of th- cells stimulated ( min) with il- ( nm), hypil- ( nm) or left untreated. b) schematic representation of the main go terms regulated by il as inferred from our p-proteomics studies. red represents downregulated p-sites and blue represents upregulated p-sites upon il stimulation of human primary th- cells. c) schematic representation of the main go terms regulated by hyil as inferred from our p-proteomics studies. red represents downregulated p-sites and blue upregulated p-sites upon hyil stimulation of human primary th- cells. supp. figure : .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a) venn diagrams comparing the numbers of unique upregulated (left) and downregulated (right) phospho-sites by il- (blue) and hypil- (red) as well as the number of shared phospho-sites. b) list of most strongly altered phosphosites (downregulated: green; upregulated: red) in response to il- (left) or hypil- (right). c) go analysis “cellular location” and “up keywords” of the phospho-sites regulated by il (red) and hypil- (blue) represented as bubble-plots. d) phosphorylation of target proteins related to treg functions and schematic representation of their activity on t-cells. supp. figure : a) kinetics of gene induction in th- cells induced by il- represented as volcano plots. genes significantly up- or downregulated are highlighted in red (p value £ . , fold change ³+ or £- ). data was obtained from three biological replicates. b) kinetics of gene induction in th- cells induced by hypil- represented as volcano plots. genes significantly up- or downregulated are highlighted in red (p value £ . , fold change ³+ or £- ). data was obtained from three biological replicates. c) kinetics of gene induction in th- cells induced by hypil- represented as volcano plots. genes identified to be significantly up- or downregulated by il- are highlighted in red (p value £ . , fold change ³+ or £- ). data was obtained from three biological replicates. supp. figure : a) gene induction kinetics represented as pie-charts, separated for upregulated genes (top panel) and downregulated genes (bottom panel). b) kinetics of isg induction (examples) as heatmap representation comparing il- with hypil- (top) and gsea reactome pathway enrichment “ifn signaling” for genes induced by il- after h (bottom). data represents the mean (log ) fold change of three biological replicates. c) heatmaps of the top up- and downregulated genes by il- compared to hypil- for h, h and h. data represents the mean (log ) fold change of three biological replicates. d) kinetics of irf protein expression as a response to continuous il- and hypil- stimulation in th- cells. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. supp. figure : a) pie charts of proteomic changes (unique & shared) for upregulated (left) and downregulated (right) proteins in response to il- or hypil- stimulation in th- cells. b) left: gsea reactome pathway enrichment analysis “interferon signaling” for proteins induced by il- . middle: heatmap representation pathway-associated proteins comparing il- with hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. right: localization of the identified proteins in context to the data distribution of il- -induced proteomic changes. pathway-associated proteins are highlighted for il- (blue) and hypil- (red) as well as non-significant data distribution (grey). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / c) left: gsea reactome pathway enrichment analysis “cytokine signaling and immune system” for proteins induced by il- . middle: heatmap representation pathway- associated proteins comparing il- with hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. right: localization of the identified proteins in context to the data distribution of il- -induced proteomic changes. pathway-associated proteins are highlighted for il- (blue) and hypil- (red) as well as non-significant data distribution (grey). d) average intensity distribution of untreated proteomic data. top up- and downregulated proteins (≥ + x or ≤ - x change) altered by il- (left) or hypil- (right) stimulation are indicated. supp. figure : a) pointwise median and % credible intervals of the wt and chimera mathematical models, using the posterior distributions for the parameters from the abc-smc. b) dose response curve in rpe using the posterior distributions from the abc-smc and varying the concentrations of hypil- and il- in the model. c) pointwise median and % credible intervals of the wt mathematical model and simulations of a mutant model with 𝑘#' & = ,> nm- s- and 𝑘#' , = m s- , using the posterior distributions for the parameters from the abc-smc for the other parameters. supp. figure : a) fold induction of total stat and stat levels in th- measured by flow cytometry. data was obtained from two biological replicates. b) total levels of stat and stat measured in cd + by flow cytometry for healthy control (ctrl) and lupus patients (sle). data was obtained from five (ctrl) and six (sle) biological replicates. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. c) ratio of pstat and pstat after il- ( min, nm) or hypil- ( min, nm) stimulation measured in cd + by flow cytometry for healthy control (ctrl) and lupus patients (sle). data was obtained from five (ctrl) and six (sle) biological replicates normalized to mean ratio of healthy control samples. d) tofacitinib titration to inhibit stat and stat phosphorylation by il- ( nm) in th- cells (left) and rpe cells stably expressing wt il- rα (right). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supp. movie : single-molecule co-tracking as a readout for dimerization of cytokine receptors. cell surface labelling of mxfpe-il- rα by enbrho (left, top) and mxfpm-gp by mnbdy (left, bottom) after stimulation with il- ( nm). in the overlay of the zoomed section of both spectral channels (mxfpe-il- rαrho : red, mxfpm-gp dy : blue), yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time. supp. movie : dynamics of il- -induced receptor assembly. formation of a single-molecule heterodimer of mxfpe-il- rαrho (red) and mxfpm-gp dy (blue) in presence of il- . yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time with break at time of receptor dimerization. supp. movie : ligand-induced heterodimerization of il- rα and gp . overlay of the two spectral channels (mxfpe-il- rαrho : red, mxfpm-gp dy : blue) in absence (left) or presence (right) of il- ( nm). yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time. supp. movie : ligand-induced homodimerization of gp . overlay of the two spectral channels (mxfpm- gp rho : red, mxfpm-gp dy : blue) in absence (left) or presence (right) of hypil- ( nm). yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . . . . . . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- fig. il- rα p ebi il- jak jak gp hypil- il- il- rα(ecd) pstat / a) b) e) time / min time / min ps ta t / re l. m fi ps ta t / re l. m fi pstat pstat 𝚫 𝚫 𝚫 𝚫 𝚫 - - - - . . . . . . . il- hypil- - - - - . . . . . . . c / log nmc / log nm ps ta t / re l. m fi ps ta t / re l. m fi pstat pstat 𝚫 c) µm gp il- il- rα gp co-localization enbrho mnbdy il- rα r el . c o- lo co m ot io n in te ns ity . / a .u . il- rα gp time / s il- rα gp dimers f) s . s . s . s nmil- rα gp rho bleached 𝚫fret rho bleached dy bleached g) h) d) time / mintime / min ps ta t / re l. m fi ps ta t / re l. m fi pstat pstat . . . . . . . heterodimerization il- rα + gp +hypil- +il- homodimerization gp + gp *** *** . . . . . . . wt [gp ] unstim. x [gp ] unstim. wt [gp ] + hypil- x [gp ] + hypil- . . . . . . . co un t receptor expression gp ko wt [gp ] x [gp ] a) fig. . receptor assembly . proteome changes . gene induction il- il- rα gp pstat / stat / . stat activation mathematical modelling ps ta t / re l. m fi ps ta t / re l. m fi time / min time / min 𝜹∗ n o. a cc ep te d pa ra m et er s c) b) d) . . . . . . . unstim. wt y f y f y f-y f . . . . . . . . . . . . . . . . . . . . . . . . . . . . unstim. wt chimera . . . . . . . . . . . . . . unstim. wt chimera . . . . . . . il- rα cytoplasmic domain y y tsgrcyhlrhkvlprwvwekvpdpansssgqphmeqvpeaqplgdlpileveemepppvmess qpaqatapldsgyekhflptpeelgllgpprpqvla* fig. min min min min min min min min +t of ac iti ni b unstim. +il- +hypil- time / min ps ta t / re l. m fi ps ta t / re l. m fi time / min - % pstat - % pstat b) a) d) . . . . . . . . il- hypil- time / min r at io p s ta t + /- to f. +tofacitinib . . . . . . . . il- hypil- time / min r at io p s ta t + /- to f. +tofacitinib il- rα gp +il- il- rα-gp gp +il- gp gp +hypil- ps ta t / re l. m fi time / min hypil- pstat ps ta t / re l. m fi time / min il- pstat 𝚫 𝚫 𝚫 𝚫 il- pstat hypil- pstat ps ta t / re l. m fi ps ta t / re l. m fi time / min time / min c) time / min ps ta t / re l. m fi ps ta t / re l. m fi time / min hypil- pstat il- pstat il- pstat hypil- pstat pstat pstat co un t receptor expression ctrl wt y f y f y f- y f jak jak ne lfa s pp m g t rc hy s la rp s po lr a s po lr a s po lr a s fig. - - - - fold change / log p v al u e / - lg unchanged downregulated upregulated - - - - fold change / log p v al u e / - lg unchanged downregulated upregulated map b chd scaf wrnip bola bad stat stat ubr stat map b chd scaf wrnip bola rchy nelfa stat stat ppm g b) a) il- hypil- c)shared and differentially regulated p-sites lgalsl (s) bad (s) stat (y) stat (y) stat (y) stat a,b (y) ptpn (y) ppm g (t) sugp (s) card (s) stat (s) rnase (s, t) ahnak (s) clk (s) ahnak (t) bad (s) arl ip (s) ubr (s) piezo (s) reps (s) srrm (s) ankrd c (t) cdca l (s) nelfa (s) ndrg (s) prr (s) rchy (s) osbpl (s) znf (s) rps ka (s) > cdh (s) map b (s) znf c (s,t) adgrf (t,y) zc hc a (s) bola (s) gtf i (s) tacc (s, y) scaf (s) abcc (s) wrnip (s) sec ip (s) osbpl (s) stau (s) lrrfip (s) top b (s) zcrb (s) rfx (s) pabpn (s) arhgdia (s) fam e (t,y) nudt (s) hnrnpf (s) tpr (s) taldo (s) pcnx (s) klc (s) rbm (s) irs (s) pml (s) - - - - < - il- hy pil - fc / lo g il- hy pil - fc / lo g fo ld c ha ng e p tef b sk snrnp larp ppm g rna pol- nelfacy clin t cdk stat p rchy cyclin c cdk mediator complex f) . . . . . il- hypil- time / min . . . . . . . il- hypil- time / min ps -s ta t r el . m fi e) fo ld c ha ng e stat y stat y stat y stat y stat s stat s tyrosine-p serine-p il- hypil- * * * ** *** ** *** il- hypil- ps -s ta t r el . m fi mr na p ro ce ss ing mr na s pli cin g mr na ex po rt ja k/ st at ca sc ad e ce ll-c ell ad he sio n tr an sc rip tio n po sit ive r na po l ii re gu lat ion ne ga tiv e r na po l ii re gu lat ion nu cle ar po re co mp lex as se mb ly re gu lat ion r ho si gn ali ng hi sto ne h -k t rim eth yla tio n dn a me th yla tio n re gu lat ion r na po l ii d) fos socs cd ifng egr nfkbia klf jun osm rhob il - - - - - il- hypil- - - il- hypil- gbp gbp gbp gbp ifi il rb il irf irf jak mx oas parp stat stat trafd trim trim ube l usp cd ifit ifit ifit ifit irf rgs socs - h h h h h h il- hypil- h h h h h h interferon signature stat dependent genes stat dependent genes - - il- hypil- fo ld c ha ng e / l og fo ld c ha ng e / l og h h h h h h h h il- hypil- fc / log fc / log fc / log il- hypil- il- hypil- time / h h h h h fig. z x - - - - - y il- hypil- h h h h h h y x - - - - - - - z h h h h h h . . . . . . upregulated genes downregulated genes upregulated genes downregulated genesa) time / h fr ac tio n sh ar ed w ith il - b) e) time / h fo ld c ha ng e / l og time / h il- hypil- il- hypil- ge ne s ge ne s time / h time / h upregulated downregulatedc) d) interferon signaling immune system interferon alpha/beta signaling interferon gamma signaling cytokine signaling in immune system h h h h fc / log il- hypil- h h fo ld c ha ng e / l og fig. . . . . . . . control sirna irf sirna ir f /r el . m fi time / h irf protein levels control sirna irf sirna gapdh sirna control sirna fo ld in du ct io n time / h fo ld in du ct io n time / h stat oas control sirna irf sirna control sirna irf sirna fo ld in du ct io n time / h fo ld in du ct io n time / h gbp socs b) c) irf protein levels ir f / m fi time / h a) control sirna irf sirna untransfected ps ta t / m fi time / h pstat control sirna irf sirna untransfected ps ta t / m fi time / h pstat d) il- hypil- - - - - - - - - - - differentiate to th in silac media light (r k ) medium (r k ) high (r k ) stimulation hisolate pbmcs from buffy coat & cd + isolation mix : cell numbers fractionation lc-ms/ms maxquant peptide quantification lyse reduce alkylate digest unstim. il- hypil- il- hypil- mx stat stat ifitm gbp gbp vps tgfb isg ube l unchanged changed isgs upregulated proteins il- hypil- downregulated proteins il- hypil- in du ct io n tgfb smarcd vps rala selplg drg atp b prkar a larp abcb tceal mapk hla-c rap c fam a suz bcat arid b arf mien mettl uvrag pip k a zmym nb cox isy eif c b m hbs l dnajc tmed itga mllt acsl foxo atg b ppp r slc b rnf dnajc rbm cul b casp ppp r rock mcm dennd c ndufa tmed sde kpna jak arhgap coa snx limd selk rnf cndp erbb ip pmpca hla-e srcap sec b anapc btaf ccdc rpl myh il r tubb rtn lancl aars qtrtd scpep ccdc hist h a kti gtf c rpap nudt l otulin acot gstm hist h e p rx myadm abcb pld gtf b npepps naa cbx mt-co luc l tp bp gdi sptbn ywhag rbm hla-dqb kdm a qars pcbp ehd yif b dnase lig gbf nudt rpl btn a txnrd lmnb tbc d b exosc ndufa ncbp mcm ap mipep cbx hmha csnk b tbc d b bop mlst snapin gbp ube l gbp stat trafd parp stat parp ddx mx isg gbp nmi bst nub ifi xrn lgals bp lap trank trim nt c a plscr dnaja gbp oas ifitm pml tympalox ap ppp r acadm prkcsh zcchc srpk mecp hmgn eif e psmb e nr ic hm en t s co re r an ke d lis t m et ri c rank in ordered dataset gsea pathway reactome: cytokine signaling and immune system il- hypil- tgfb gbp rala ube l gbp stat stat mx isg gbp mapk ifitm hla-c fig. a) b) d) c) e) gbp ube l gbp stat trafd parp stat parp mx gbp ddx ifi xrn lgals bp trim gbp h h h h h h h h fc/ log tra ns cr ipt pr ot ein tra ns cr ipt pr ot ein il- hypil- f) fc/ log fc / lo g ( / ) ( / ) ( / )( / ) ( / ) ( / ) isgs dennd c dnajc tgfb smarcd ndufa vps gbp rala rbm ube l selplg gbp stat trafd prkar a parp stat parp larp abcb tceal mx isg cul b drg gbp casp mapk atp b ddx ppp r bop tp bp ccdc alox ap tbc d b csnk b scpep hmha snapin cbx luc l qtrtd mlst mt-co nudt gbf aars lig btaf dnase yif b ehd lancl cbx pcbp mipep mcm ap qars ncbp - - - - - > il - hy pi l- ncbp dennd c dnaj c fold change / log fold change / log p va lu e / - lo g p va lu e / - lo g fig. ps ta t (n or m al iz ed ) c / log μm f) co py n um be rs n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il gp il- rα il- rα stat stat - - - . . . . . . . pstat pstat - - - . . . . . . . pstat pstat ps ta t (n or m al iz ed ) c / log μm th- rpe e) b) a) unstim. ctrl unstim. sle il- ctrl il- sle hypil- ctrl hypil- sleps ta t / m fi ps ta t / m fi pstat n.s. ** ** n.s. *** ** pstat ps ta t / re l. m fi c / log nm ps ta t / re l. m fi c / log nm d) - - - - . . . . . . . . . . il- il- primed hypil- hypil- primed - - - - . . . . . . . . . . il- il- primed hypil- hypil- primed pstat pstat time / min time / min time / min time / min ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi il- rα gp il- rα r p k m r p k m n.s. n.s.n.s. stat stat **** sle dis. risk healthy control c) supp. fig. - - - - . . . . . . . il- (miltenyi) mil- sc - - - - . . . . . . . il- (miltenyi) mil- sc il- / log nm ps ta t / re l. m fi pstat il- / log nm ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat cd + cd + b) d) . . . . . . . unstim. il- hypil- time / min ps ta t / re l. m fi pstat . . . . . . . unstim. il- hypil- time / min ps ta t / re l. m fi pstat 𝚫 𝚫 𝚫 c) dose-response or kinetic exp. ii) stimulation & sample barcoding iii) merge cells & ab staining leukocytes cd + cd + cd + leukocytes cd + cd -/cd + barcodeall data iv) flow cytometryi) pbmc isolation and th differentiation a) ps ta t / r el . m fi c / log nm ps ta t / r el . m fi c / log nm e) - - - . . . . . . . rpe + il- rpe + hypil- - - - . . . . . . . rpe + il- rpe + hypil- pstat pstat . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- heterodimerization il- rα gp trajectories rho trajectories dy co-trajectories homodimerization gp gp unstim. +il- unstim. +hypil- µm c) . . . . . . . . . . nm nm fl uo re sc en ce in t. / a .u . time / s fl uo re sc en ce in t. / a .u . time / s dual-color dimersingle-color dimer single-color dual-step bleaching dual-color single-step bleaching labels label 𝚫fret dy bleached label bleached label bleached rho bleached hypil- . s . s . s . s . s . s . s . s . . . . . . . . . . . . . . d / µm s - gp il- rα dimer +il- +il- +il- d / µm s - gp dimer +hypil- d) +hypil- ** n.s. *** *** *** supp. fig. b) - - - - . . . . . . . - - - - . . . . . . . 𝚫gp 𝚫il- rα +gp 𝚫il- rα +gp +il- rα - - - - . . . . . . . il- pstat il- pstat hypil- pstat hypil pstat c / log nm ps ta t / r el . m fi c / log nm ps ta t / r el . m fi c / log nm ps ta t / r el . m fi a) a) b) c) d) e) f) g) h) supp. fig. b) il- / log nm ps ta t / re l. m fi il- / log nm ps ta t / re l. m fi - - - - . . . . . . . - - - - . . . . . . . - wt y f y f y f-y f 𝚫y f 𝚫y f . . . . . . . . unstim. il- hypil- ps ta t / re l. m fi ps ta t / re l. m fi time / min time / min 𝚫 𝚫 𝚫 𝚫 a) . . . . . . . . unstim. il- hypil- pstat pstat pstat pstat supp. fig. th cells (ratio +/- tofacitinib) . . . . . . . . il- hypil- . . . . . . . . il- hypil- time / min r at io p s ta t + /- to f. +tofacitinib +tofacitinib r at io p s ta t + /- to f. time / min d) - - - - . . . . . . . . . il- rα(wt) il- rα-gp ps ta t / r el . m fi il- / log nm a) - - - - . . . . . . . . . il- rα(wt) il- rα-gp ps ta t / r el . m fi il- / log nm c) . . . . . . . il- hypil- il- + tof. hypil- + tof. . . . . . . . il- hypil- il- + tof. hypil- + tof. time / min ps ta t / re l. m fi rpe il- rα cells th cells time / min ps ta t / re l. m fi b) +tofac. +tofac. . . . . . . . il- hypil- il- + tof. hypil- + tof. . . . . . . . il- hypil- il- + tof. hypil- + tof. time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi +tofac. +tofac. supp. fig. supp. fig. . . . . . . . . il- hypil- il- + chx hypil- + chx . . . . . . . . il- hypil- il- + chx hypil- + chx . . . . . . . il- hypil- il- + chx hypil- + chx . . . . . . . il- hypil- il- + chx hypil- + chx b) time / min ps ta t / re l. m fi rpe il- rα cells th cells time / min ps ta t / re l. m fi a) time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi il- gp il- rα p-s pias p-y s stat p-y s stat p-y stat p-y stat a p-y stat b jak/stat cascade cell-cell adhesion p-t s ahnak p-s ppfibp p-s pak p-y s stat p-s lima p-s s lrrfip p-s s micall p-s add p-s s aldoa p-t eif g p-s sept p-s snx p-s tmpo actin cytoskeleton p-t s ahnak p-s lima p-s s aldoa p-s sept p-s cd ap p-s fyb p-s cfl pre-autophagosomal structures p-t nbr p-s atg a p-s s sqstm regulation of rna pol ii negative regulation of rna pol ii p-s etv p-s hist h c p-s hist h d p-s hist h b p-s t smarca p-s rfx p-s dnmt a p-s sap p-s pias p-y s stat p-y s stat p-s s sqstm p-s s s spen p-s t znf c p-s spen aaa mrna processing p-s arl ip p-s rbm b p-s phrf p-s s scaf p-s sugp p-t acin p-t adar p-s ccar p-s mettl p-s s srrm mrna splicing p-s ncbp p-s rbm b p-s srrm p-s alyref p-s spen p-s s s polr a p-s hnrnpup-s mettl p-s s srrm p-s pabpn p-s srrm p-s s s spen mrna nuclear export p-s alyref p-s nup p-s s srrm p-s ncbp p-s nup p-s nup histone h -k methylation p-s hist h d p-s kmt a p-s hist h c dna methylation p-s baz a p-s kmt a p-s dnmt a transcription p-s dennd ap-t bclaf p-s s lrrfip p-s mrgbp p-s mysm p-s nfkbib p-s paxbp p-s pou f p-s rbm b p-s t smarca p-s baz b p-s baz a p-s ccar p-s chaf b p-s chd p-s gtf c p-s gon l p-s msl p-s naca p-s pphln p-s s ptmap-s rfx p-s rps p-s s s spen p-s tfdp p-s mga p-s phf p-s phf p-s rbl p-s sap bp p-s sap p-s itgb bp p-s pias p-y s stat p-y s stat p-y stat p-y stat a p-y stat b p-s spen p-s t znf c p-s znf p-s znf p-s znf p-y stat p-y stat p-y s stat p-y stat p-y stat a p-y stat b jak/stat cascade cell-cell adhesion p-s ndrg p-s ahnak p-y stat p-t ahnak p-s anxa p-s s snx p-s micall p-s t sept p-s lrrfip p-ss clint p-s tmpo golgi apparatus hypil- gp actin filament p-s akap p-y hck p-s s s akap p-s fkbp p-s myo b p-y hck p-s lrba p-y lyn p-s pask p-s rab fip p-s raf p-s wdr p-s clint p-s pphln p-s slc a p-t arhgef p-s arfgap p-s htt p-s osbpl p-s zdhhc regulation of rna pol ii p-s rbl p-s mrgbp p-s s lrrfip p-s rbbp p-s t smarca p-s gtf i p-s rfx p-s tfdp p-s nfatc p-y s stat p-y stat a p-y stat b positive regulation of rna pol ii p-s nelfa p-s s nucks p-s raf p-s sqstm p-s trim p-s thrap p-s pml p-s safbp-s nfatc p-s ncoa p-s rps ka p-s ybx p-s pknox p-s tp bp p-s arhgef aaa mrna processing p-s tfip p-s ccar p-s casc p-s s scaf p-s sugp p-s rbm p-s rbbp p-s rbm b p-s xrn p-s srrm mrna splicing p-s tfip p-s hnrnpf p-s casc p-s s spen p-s cdc p-s rnpc p-s srsf p-s srsf p-s srrm p-s pabpn p-s hnrnpd p-s ybx mrna nuclear export p-s nup p-s pom p-s srrm p-s cdc p-s srsf p-s casc transcription p-s dennd a p-s gatad bp-t bclaf p-s pml p-s rbm b p-s rbm p-s baz b p-s ccar p-s gtf c p-s hnrnpd p-s ncor p-s pphln p-s tp bp p-s s spen p-s t znf c p-s znf p-s znf p-s lrrfip p-s mga p-s phf p-s mier p-y stat p-s znf p-s cdca l p-s itgb bp p-s ncoa p-y stat p-y s stat p-y stat p-y stat a p-y stat b p-s actl a p-s nfkbib rho signaling p-s raf p-s s s akap p-s arhgdia p-s myo b p-t arhgef p-s akap p-s rbbp p-y stat p-s gtf i p-s lrrfip p-s s nucks p-s arid a p-s nfatc p-s actl a p-y stat b p-y s stat p-y stat a p-s safb p-y s stat p-y stat p-y stat p-y stat a p-y stat b p-y stat p-s thrap p-s srsf p-s srsf p-s tpr nuclear pore assembly p-s tpr p-s ahctf p-s nup p-s arid a p-s safb differentiate to th- in silac media light (r k ) medium (r k ) high (r k ) stimulation: min isolate pbmcs from buffy coat & cd + isolation mix : cell numbers fractionation lc-ms/ms maxquant peptide quantification lyse reduce alkylate digest unstim. il- hypil- phosphopeptide enrichment (tio ) a) b) c) supp. fig. nucleus membrane cytoplasm pre-autophagosomal struct. actin cytoskeleton actin filament golgi apparatus il- hypil- nucleus methylation cytoplasm transcription mrna processing chromatin regulator mrna transport actin cytoskeleton actin filament golgi apparatus golgi apparatus il- hypil- cellular location up keywords peptide fold change / log peptide fold change / log chd s - . lgalsl s . map b s - . rnase s t . znf c s t - . ahnak s t . adgrf t y - . bad s . zc hc a s - . clk s . bola s - . stat y . gtf i s - . dcp b s . tacc s y - . stat y . scaf s - . stat y . abcc s - . stat a/b y /y . wrnip s - . ptpn y . sec ip s - . bad s . rbm b s - . arl ip s . mecp s - . ubr s . psmd s - . piezo s . ospbl s - . ppm g t . peptide fold change / log peptide fold change / log tacc s y - . lgalsl s . cdh s - . stat y . map b s - . myo b s . znf c s t - . ankrd c t . adgfr t y - . cdca l s . zc hc a s - . stat y . bola s - . nelfa s . wrnip s - . ppm g t . fam e t y - . bad s . scaf s - . ndrg s . abcc s - . stat y . nudt s - . sugp s . gtf i s - . prr s . zc h s - . stat s . sec ip s - . ptpn y . psmd s - . rchy s . b) c) d) il- hypil- ubr s bad s pak s * il- hypil- downregulated phospho-sites upregulated phospho-sites il- hypil- th treg p-ubr p-pak p-bad a) fo ld c ha ng e supp. fig. a) b) c) - - - - - - fold induction / log p v al u e / - lg unchanged regulated h h h - - - - - - fold induction / log p v al u e / - lg unchanged regulated - - - - - - fold induction / log p v al u e / - lg unchanged regulated - - - - - - fold induction / log p v al u e / - lg unchanged regulated il- h h h - - - - - - fold induction / log p v al u e / - lg unchanged regulated - - - - - - fold induction / log p v al u e / - lg unchanged regulated h h h - - - - - - fold induction / log p v al u e / - lg - - - - - - fold induction / log p v al u e / - lg - - - - - - fold induction / log p v al u e / - lg hypil- hypil- (il- regulated genes highlighted) supp. fig. il- top up & downregulated genes fosb rgs ifit fos ifit c orf socs socs cd nfkbiz ptchd p prr rgs cmpk c orf pmaip dusp ccl ifng egr sgk ifit cfl grm klf nfkbia dnajb klf jun znf bcdin d plekhf zkscan senp tnfsf alg l hist h j b galt pars ajuba kbtbd efna id dusp trgv p igip adrb znf zswim sowahd hsa-mir- a gusbp cebpe cdk r arl d nuak nog sertad zfp l ddit - ifit ctsl ifi l rgs rsad gbp p slc a slamf lamp etv chac gbp fam b gtf ird gbp lrrc gbp sema g ptchd p cetp socs slc a stat cmpk wars hapln smtnl bcl l ifit epsti gas l rassf igfbp hbegf adora cgn fgf tnfrsf d p ha ddit nek tmem nptx mt dp dusp p ha il matn pde b hspg cd ak dtx ppfia cfd dhdh egr fos pfkfb mir hg - - - - - ifi l c orf gbp p ifi spag ifit ifit rsad slamf fcrl gbp rgs gbp etv lamp usp stat cmpk nfix rufy cetp gbp ifit wars alg -as ifi lrrn frmd tnfsf b bcl l map cdc ep itgax hspg aicda hist h bo apba vldlr c orf rimkla sdk atoh kiss r hist h bl dtx emp wnt ccdc b ak oscp pfkfb stc s a spon egr fos vegfa adora mir hg ppfia - - - - - - il - hy pi l- il - hy pi l- il - hy pi l- total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared upregulated genes downregulated genes time h h h il- hypil- interferon stimulated genes (isgs) h h h h h h gbp gbp gbp ifit ifit ifit ifng irf irf irf mx oas parp rgs socs socs stat stat usp - a) b) c) h h h gsea pathway enrichment: ifn signalling rank in ordered dataset en ric hm en t sc or e . . lis t m et ric - upregulated genes downregulated genes fc / lo g fc / lo g fc / lo g fc / lo g supp. fig. gsea pathway reactome: interferon signalling - protein id fo ld c h an g e / l o g data distribution il- hypil- e nr ic hm en t s co re r an ke d lis t m et ri c il- hypil- gbp ube l gbp stat stat mx isg gbp ifitm hla-c bst ifi trim b m oas . . . fc/ log a) b) c) e nr ic hm en t s co re r an ke d lis t m et ri c rank in ordered dataset gsea pathway reactome: cytokine signalling and immune system il- hypil- tgfb gbp rala ube l gbp stat stat mx isg gbp mapk ifitm hla-c - protein id fo ld c h an g e / l o g data distribution il- hypil- upregulated proteins downregulated proteins total= . % il- . % hypil- . % shared total= . % il- . % hypil- . % shared fc/ log supp. fig. rank in ordered dataset a) b) c) supp. fig. time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / r el . m fi c / log nm ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / r el . m fi ps ta t (n or m al iz ed ) c / log μm ps ta t (n or m al iz ed ) c / log μm - - - . . . . . . . pstat pstat - - - . . . . . . . pstat pstat th- rpe tofacitinib titration – il- signaling supp. fig. a) d) . . . . . . stat stat fo ld in du ct io n time / h ctrl sle ctrl sle s ta t / m fi s ta t / m fi total stat total stat b) p: . p: . . . . . . . . il- ctrl il- sle hypil- ctrl hypil- sle ra tio p s ta t /p s ta t p: . p: . c) biorxiv.org - the preprint server for biology skip to main content home about submit alerts / rss search for this keyword advanced search subject areas all articles animal behavior and cognition biochemistry bioengineering bioinformatics biophysics cancer biology cell biology clinical trials developmental biology ecology epidemiology evolutionary biology genetics genomics immunology microbiology molecular biology neuroscience paleontology pathology pharmacology and toxicology physiology plant biology scientific communication and education synthetic biology systems biology zoology view by month a mammalian methylation array for profiling methylation levels at conserved sequences a mammalian methylation array for profiling methylation levels at conserved sequences adriana arneson , , amin haghani , michael j. thompson , matteo pellegrini , soo bin kwon , , ha vu , , caesar z. li , ake t. lu , bret barnes , kasper d. hansen , , wanding zhou , charles e. breeze , jason ernst , , - #, steve horvath , # affiliations bioinformatics interdepartmental program, university of california, los angeles, ca , usa department of biological chemistry, university of california, los angeles, los angeles, california, usa; dept. of human genetics, david geffen school of medicine, university of california los angeles, los angeles, ca , usa; molecular, cell and developmental biology, university of california los angeles, los angeles, ca , usa; dept. of biostatistics, fielding school of public health, university of california los angeles, los angeles, ca , usa; illumina, inc, illumina way, san diego, ca , usa; department of biostatistics, johns hopkins bloomberg school of public health, baltimore, maryland, usa; department of genetic medicine, johns hopkins school of medicine, baltimore, maryland, usa; van andel research institute, grand rapids, michigan, usa; altius institute for biomedical sciences, seattle, wa, usa; eli and edythe broad center of regenerative medicine and stem cell research at university of california, los angeles, los angeles, california, usa; computer science department, university of california, los angeles, los angeles, california, usa; .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / department of computational medicine, university of california, los angeles, los angeles, california, usa. jonsson comprehensive cancer center, university of california, los angeles, los angeles, california, usa; molecular biology institute, university of california, los angeles, los angeles, california, usa. # joint senior authorship correspondence: shorvath@mednet.ucla.edu and jason.ernst@ucla.edu summary infinium methylation arrays are widely used to robustly measure methylation of dna in humans. however, such arrays are not available for the vast majority of non-human mammals. moreover, even if species-specific arrays were available, probe differences between them would confound cross-species comparisons. to address these challenges, we developed the mammalian methylation array, a single custom infinium array that measures cytosine methylation levels of over thousand cpg sites that are well conserved across species within the mammalian class. by design, the probes on the array tolerate cross-species mutations. to design the array, we developed the conserved methylation array probe selector (cmaps) algorithm, which takes as input a multi-species sequence alignment and probe design constraints. a greedy search algorithm was used to identify oligonucleotide sequences (probes) with high coverage across different mammalian species. we annotate the probes on the array with respect to genes in different species and provide details on the sequence context including cpg island status and chromatin states. our calibration experiments demonstrate the high fidelity of this array in humans, rats, and mice. the mammalian methylation array has several strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of specific cytosines facilitating the development of highly robust epigenetic biomarkers, and it covers highly conserved cpgs which greatly increases the probability that biological insights .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / gained in one species will readily translate to others. the mammalian methylation array is expected to find many applications in preclinical studies, comparative biology, and epigenetic studies of aging and development. introduction methylation of dna by the attachment of a methyl group to cytosines is one of the most widely studied epigenetic modifications in vertebrates, due to its implications in regulating gene expression across many biological processes including disease (ooi et al., ; robertson, ; smith and meissner, ). a variety of different assays have been proposed for measuring dna methylation including microarray based methylation arrays (bibikova et al., , ) and sequencing based assays such as whole genome bisulfite sequencing (wgbs)(cokus et al., ; lister et al., ) and reduced representation bisulfite sequencing (rrbs)(meissner et al., ). despite the availability of sequencing based assays, array based technology remains widely used for measuring dna methylation due to its low-cost and high reproducibility and reliability(pidsley et al., ). the first human methylation array (illumina infinium k) was introduced by illumina inc in (bibikova et al., ), which were followed by the k(bibikova et al., ) and epic arrays with larger coverage(pidsley et al., ). more recently, illumina released a mouse methylation array (infinium mouse methylation beadchip) that profiles over k markers across diverse murine strains. it will probably not be economical to develop similar methylation arrays for less frequently studied mammalian species (e.g. elephants or marine mammals) due to insufficient demand. moreover, even if costs were no impediment, species-specific arrays would likely be sub-optimal in comparative studies across different species as the measurement platforms would be different. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / to address these challenges, we developed a single mammalian methylation array designed to be used to measure dna methylation across mammals. the array targets cpgs for which the cpg and flanking sequence are highly conserved across many mammals so that the methylation of many of these cpgs can be measured in each mammal. the design repurposes the degenerate base technology (originally used by illumina infinium probes to tolerate within- human variation) to tolerate cross-species mutations across mammalian species. to select the specific probe sequences including tolerated mutations that appear on the array we developed the conserved methylation array probe selector (cmaps). cmaps takes as input a multiple sequence alignment to a reference genome and a set of probe design constraints, and selects a set of probe sequences including tolerated mutations, which can be used to query methylation in many species. we apply cmaps to select over thousand cpgs for the mammalian methylation array, which we complemented with close to two thousand known human biomarker cpgs. we characterize the cpgs on the mammalian methylation array with various genomic annotations. further, we use calibration data to evaluate the fidelity of individual probes in humans, mice, and rats. cmaps has led to the design of the mammalian methylation array, which will facilitate the study of cytosine methylation at conserved loci across all mammal species. results designing the mammalian methylation array the cmaps algorithm is designed to select a set of illumina infinium array probes such that for a target set of species many probes are expected to work in each species (methods). array probes are sequences of length bp flanking a target cpg based on the human reference genome. selecting sequences present in the human reference genome increases the likelihood that measurements in other species will transfer to human. the mammalian methylation array adapts the degenerate base technology for tolerating human snps so that probes can tolerate a .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / limited number of cross-species mutations. the cmaps algorithm is provided as input a multiple- species sequence alignment to a reference genome. cmap uses these inputs to then select the cpgs to target on the array. as part of selecting the cpgs, cmap also selects the probe sequence design to target them including the specific set of degenerate bases. for designing the mammal methylation array, cmaps was applied to the subset of mammals within a -way alignment of vertebrate genomes with human genome(haeussler et al., ), but we note the cmaps method is general. in designing a probe for a cpg, cmaps considers multiple different options. one option is the type of probe. illumina’s current methylation array technology allows up to two types of probes: infinium i and infinium ii. the latter is newer technology requiring only one silica bead to query the methylation of a cpg, while the former requires two beads. by only requiring one bead infinium ii probes allow under fixed array capacity limits interrogating more cpgs, though infinium i probes are better able to query cpgs in cpg rich regions (bibikova et al., ). another option for each of these two types of probes is whether the probe is on the forward or reverse genomic strand, giving four total combinations of options for probe type and strand for each cpg. in addition, cmaps has options for the position and nucleotides identity of tolerated mutation across correspond to degenerate bases. the array degenerate base technology allows for potentially up to three degenerate bases per probe sequence, which are positions that can be designed to tolerate variation in the sequence being interrogated. for some probes fewer than three degenerate bases could be designed, which was determined based on a design score computed by illumina for each probe and in the case of infinium ii probes also the number of cpgs within the probe sequence. cmaps uses a greedy algorithm to select the tolerated mutations for each combination of probe type and strand. the algorithm aims to maximize the number of species in the alignment the probe is expected to work based on just local alignment information that is without considering how uniquely mappable the probe is across the genome. a probe for a cpg .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / is expected to work in a non-human species based on local alignment information if there are no differences in the alignment between the human genome sequence and the other species excluding those accounted for by the probe’s degenerate bases (figure a, methods). for each cpg site in the human genome, cmaps retained for further consideration the infinium i probe out of the two options (forward or reverse of the cpg) which had the greater number of species for which the probe was expected to work, and likewise for infinium ii. we next applied a series of rules to identify a reduced subset of candidate probes. first, we included all , infinium ii probes that were expected to work in mouse (based on the mm genome), which maximizes the expected array utility for one of the most widely used model organisms. for the remaining set of cpg not selected in the previous step, we sorted them in descending order of the number of species for which an infinium ii probe was expected to work. we then added the top , cpg sites for a total of , cpg sites. next, we ranked the cpgs targeted on the illumina epic array (pidsley et al., ) in descending order of the number of species for which a probe targeting the cpg is expected to work. for this the probe was required to be of the same probe type and strand as on the epic array, but used the degenerate bases picked by the cmaps algorithm. the probe was allowed to differ in terms of degenerate base positions, as epic probes typically do not account for degenerate bases across species. for this we selected the top , cpg sites ranked sites that had not already been picked based on the earlier criteria. lastly, we sorted the cpg sites in descending order of number of species they can target and picked the top , cpgs targeted by infinium i probes that had not already been included. the infinium i probes were selected to allow querying cpg dense regions such as cpg islands, as cpgs do not count towards the limited number of positions of variation as for infinium ii probes. this resulted in a set targeting , cpgs (figure b). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for some of these , cpgs, the sequence of the probe targeting it can map to multiple locations in a genome, which could result in a confounded signal coming from multiple cpg sites. this issue is compounded by individual probes corresponding to multiple sequences reflecting different possible combinations of the degenerate bases. to identify a subset of probes less susceptible to such confounders, for high quality genomes, we computed for each probe how many of its versions map uniquely in that genome (see methods). we then filtered cpgs down by requiring all versions of a probe targeting it map uniquely in at least % of the species they are expected to target out of the high quality genomes, unless the probe is expected to target at least mammals from the alignment, in which case the mapping criterion was discarded. this reduced the set of candidate cpgs to , cpgs. we added probes targeting cpgs to the mammalian methylation array based on their utility for human biomarker studies (supplementary data). these probes, which were previously implemented in human illumina infinium arrays (epic, k, k), were selected due to their utility for human biomarker studies estimating age, blood cell counts, or the proportion of neurons in brain tissue(guintivano et al., ; hannum et al., ; horvath, ; horvath and levine, ; horvath et al., ; houseman et al., ; levine et al., ). the final manufactured mammalian methylation array measures cytosine levels of , cytosines: , of these cytosines are followed by a guanine (cpgs) and are followed by another nucleotide (non-cpgs). the probe identifiers (cg numbers) of of these cytosines ends with either ". " or ". ", i.e. these are duplicate probes for genomic locations. a detailed analysis of the infinium probe context of the mammalian array and relation to human and mouse arrays is presented in supplementary figure s . the mammalian methylation array focus on highly conserved regions led to a an array that is distinct from other currently available infinium arrays that focus on specific species. for example, the mammalian array only shares .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / probes with the illumina mousemethylation array and only cpgs with the illumina epic array. mappability analysis all cpgs profiled on the mammalian methylation array apply to humans, but only a subset of these cpgs applies to other species. when conducting analyses in a specific species it can thus be desirable to restrict analyses to the subset of cpg that apply in that species. one approach for doing this is simply omit cpgs whose detection p-values from normalization methods (methods) are insignificant. this approach has the advantage of being applicable to species that have not yet been sequenced. mapping sequences to genomes has the added benefit of providing a candidate position of the sequence in the target genome from which other information about the cpg can be inferred such as the nearest gene or cpg island status. we have mapped the array cpgs to species, which also provides a candidate position from which a gene for the cpg can be associated. as expected, the closer a species is to humans, the more cpgs map to the genome of this species. over k cpgs on the array map to most placental mammalian genomes (eutherians, figure a, supplementary data). roughly k cpgs map to most non-placental mammalian genomes (marsupials), such as kangaroos or opossums. far fewer cpgs map to egg laying mammalian genomes (monotremes), such as platypus (figure ). a cpg that is adjacent to a given gene in humans may not map to a position adjacent the corresponding (orthologous) gene in another species. between k to k cpgs (over %) were assigned to human orthologous species based on their mapped position in most phylogenetic orders (rodents, bats, carnivores, figure b,c and supplementary data). these numbers surrounding orthologous genes are probably overly conservative (i.e. lower than the true numbers) because we found the majority of cpgs (about %) that do not map to orthologous .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / genes in the non-human species are located in intergenic regions outside of promoters (methods), which suggests that one of the gene assignments was inaccurate. chromosome and gene region coverage of array we analyzed the chromosome and gene region coverage of the mammalian methylation array for human and mouse. the mammalian methylation has substantial coverage of all chromosomes (human, - ; and mouse, - probes per chromosome), with the exception of chry that only has probes in both species (supplementary figure s a). when we assign the probes to the closest gene neighbor, around % of the probes are proximal to a gene in both of these species (supplementary figure s b). the remaining % of probes are neither aligned to a promoter nor a gene body. the distribution of gene region and the distances to transcriptional start sites are comparable between human and mouse (supplementary figure s b). cpgs on the mammalian array cover human and mouse genes when each cpgs is assigned uniquely to its closest gene neighbor (supplementary figure s c). the gene coverage is uneven: while on average a gene is covered by cpgs some genes are covered by as many as cpgs. in mouse, % of cpgs ( , ) were assigned to a human orthologous genes (supplementary figure s d), suggesting many cpg measurements from the array in mice will be informative to humans (and vice versa). gene sets represented in mammalian array we analyzed gene set enrichments of all genes that are represented on the mammalian array using great(mclean et al., ). significant gene sets covered implicated gene sets that were found to play a role in development, growth, transcriptional regulation, metabolism, cancer, mortality, aging, and survival (supplementary figure s ). we also used the tissueenrich(jain and tuteja, ) software to analyze gene expression (methods). the majority of mammalian .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / methylation array probes (~ %) are adjacent to genes that are expressed in all considered human and mouse tissue (supplementary figure s a,b). however, the mammalian array also contains cpgs that are adjacent to genes that are expressed in a tissue-specific manner, notably testis and cerebral cortex (supplementary figure s c). cpg island and methylation status we analyzed the cpg island and dna methylation properties of cpgs on the mammalian array. in general, an average of ( %) of probes in the mammalian array are located in cpg island depending on the species (figure a). we used a cpg island detection algorithm (gcluster software (li et al., )) that additionally provided several species-level quantitative measures for each cpg island including the length, gc content, and cpg density that we provide as a resource (supplementary data). we also analyzed the dna methylation levels in human for fractional methylation called from whole genome bisulfite sequencing data across human tissues(roadmap epigenomics consortium et al., ) (supplementary figure ). this confirmed that the mammalian methylation array target cpgs across a wide range of fractional methylation levels. chromatin state annotation of array probes we analyzed the overlap of human cpg’s targeted on the mammal methylation array with chromatin states for cell and tissues. the cpgs cover all available chromatin states including different types of promoters (including bivalent promoters), regions repressed by polycomb group proteins, transcription start and end site, and enhancer regions (figure b). among enhancers, cpg’s had greater overlap with brain and neurosphere than other tissue groups. in addition to analyzing the array cpg’s overlap for cell and tissue specific chromatin states, we also analyzed them for a universal chromatin state annotation, which provides a single annotation to the genome .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / per position based on data from more than cell and tissue types (vu and ernst, ) (supplementary figure s ). this revealed the greatest enrichment for bivalent promoter states and also strong enrichment for other promoter states and a state associated with polycomb repression. while the mammalian methylation array was specifically designed to profile cpgs in highly conserved stretches of dna based on sequence conservation, we assessed whether there was also evidence of conservation at the functional genomics level using human-mouse lecif scores (kwon and ernst, ). the human-mouse lecif quantifies evidence of conservation between human and mouse at the functional genomics level using chromatin state and other functional genomic annotations. in general, probes on the array had higher lecif score than regions that align between human and mouse in general (figure c). mammalian array study of calibration data to validate the accuracy of the mammalian methylation array we applied it to synthetic dna methylation samples for three species: human (n= arrays), mouse (n= ), and rat (n= ), where the methylation levels were known. the dna samples from human, mouse and rat were engineered such that the fractional methylation at all cpg sites in their genomes approximately %, %, %, % and % (methods). the calibration data thus allow us to define a benchmark annotation measure “proportionmethylated” (with ordinal values , . , . , . , ). the distribution of the intensity of the probes in each human sample is roughly centered around the benchmark measure (proportionmethylated) (figure a). however, as expected, the distributions in the mouse and rat samples of all the probes show somewhat different patterns in these two species compared to the human samples likely because many probes in the design of our array do not map to these genomes (figure b-c). we also evaluate these for each species after removing the probes that were not designed to map to that species, and normalizing the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / array data using the sesame package, which defines beta (relative intensity) values for each probe (zhou et al., ). after this procedure, we see sharper peaks close to and , though the quantification of absolute methylation levels are somewhat degraded around the beta value . as we move away from humans (figure d-f). additionally, for each species, dna methylation levels of each cpg we computed the correlation with the benchmark variable "proportionmethylated" across the arrays. high positive correlations would be evidence for the accuracy of the array, which is indeed what we observe. cpgs that map to the human, mouse, and rat genome have a median pearson correlation of r= . with an interquartile range of [ . , . ], r= . with iqr=[ . , . ], and r= . with iqr=[ . , . ] with the benchmark variable proportionmethylated in the respective species. the numbers of cpgs on the mammalian array that pass a given correlation threshold (irrespective of the mappability to a given species) are reported in table . we also compare the sesame normalization with the "noob" normalization that is implemented in the minfi r package (aryee et al., ; triche et al., ) (table ). we find that sesame slightly outperforms minfi when it comes to the number of cpgs that exceed a given correlation threshold with proportionmethylated. comparison with the human epic methylation array study in calibration data we compared the mammalian methylation to the human epic methylation array, which profiles k cpgs in the human genome, for non-human samples. some of the epic array probes are expected to apply to the mouse and rat genomes as well (needhamsen et al., ). to facilitate a comparison between the mammalian methylation array and the human epic array for non-human samples we applied the latter to calibration data from mouse (n= arrays) and rat (n= ). the same engineered dna data methylation data were analyzed on the human epic array as on the mammalian methylation array above. in particular, we were able to correlate each .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / cpg on the epic array with a benchmark measure (proportionmethylated) in mice and rats (table ). only (out of k) cpgs on the human epic exceed a correlation of . with proportionmethylated in mice. by contrast, cpgs on the mammalian array exceed the same correlation threshold in mice. similarly, the mammalian array outperforms the epic array in rats: only cpgs on the epic array exceed a correlation of . with proportionmethylated compared with cpgs on the mammalian array. the results are similar for the correlation thresholds of . and . (table ). the epic array contains cpgs that were also prioritized by the cmaps algorithm based on high levels of conservation, excluding the cpgs from human biomarker studies. out of these shared cpgs, and cpgs map to the mouse and rat genome, respectively. while human epic probes target the same cpg, the corresponding mammalian probe is typically different from epic probe due to differences in probe type (type i versus type ii probe), dna strand, or the handling of mutations across species degenerate bass. in the following comparison, we limited the analysis to the and probes when analyzing calibration data from mice or rats, respectively. we find that the mammalian array probes are better calibrated than the corresponding epic array probes when applied to mouse and rat calibration data according to two different analysis that focus on shared cpgs between the two platforms. first, the mammalian array outperforms the epic array when considering mean methylation levels across the shared cpgs (figure ). second, when correlating each of the shared cpgs with the benchmark value proportionmethylated we observe median correlation of . for both mice and rat calibration data generated on the epic array. for the same probes we observe median correlations of . and . for mice and rat calibration data generated on the mammalian array (sesame normalization), respectively. we are distributing the methylation data and results from our calibration data analysis in three species (supplementary data). these calibration results will .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / allow users to select cytosines whose methylation have a high correlation with the benchmark data in human, mice or rat. discussion the mammalian methylation array, which was enabled by the cmaps algorithm for selecting conserved probes, is applicable to all mammals and hence drives down the cost per chip through economies of scale. the mammalian methylation array has unique strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of specific cytosines which is a prerequisite for developing robust epigenetic biomarkers, and its focus on highly conserved cpgs increases the chances that findings in one species will translate to those in another species. we expect that the mammalian methylation array is particularly well suited for dna methylation based biomarker studies in mammals. our calibration data demonstrate that the array largely leads to high quality measurements in three species: human, mouse and rat. our calibration data shows that the mammalian methylation array greatly outperforms the human epic chip when it comes to high fidelity measurement applications to mice and rats. the array thus should be preferable for most non- human applications unless high-fidelity measurements are not needed in which case the larger content of the epic array may make it preferable. the mammalian methylation array has several limitations. first, only a fraction of genes in a given species are represented by cpgs. second, it focuses on cpgs in highly conserved stretches of dna and hence does not cover parts that are specific to a given species. third, it provides worse coverage in more distal species, particularly in marsupials than in placental mammals (eutherians). finally, the calibration data suggests there are some shifts in the absolute methylation levels detected for intermediate methylation levels, but the relative order is preserved. the correct relative ordering of beta values is of primary importance in most statistical tests and analyses. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / several software tools have been adapted for use with the mammalian methylation array that range from normalization to higher level gene enrichment analysis. software tools for generating normalized data include sesame and the minfi r package (aryee et al., ; zhou et al., ). the eforge software (breeze et al., ), which has been adapted for the use with the mammalian array, facilitates chromatin state analysis and transcription factor binding site analysis. many researchers will be interested in genome coordinates of the mammalian cpgs in different species. toward this end, we provide genome coordinates in species. this list of species will increase as more high quality genomes become available. detailed gene annotations in many species are available including details on gene region (e.g. exon, promoter, prime untranslated region) and cpg island status (supplementary data). for human and mice we provide chromatin state annotations (ernst and kellis, ; gorkin et al., ; roadmap epigenomics consortium et al., ; vu and ernst, ) and the lecif score on evidence of conservation at the functional genomics level between human and mouse(kwon and ernst, ). in other articles, we will describe the application of the mammalian methylation array to many different mammalian species. these upcoming studies will demonstrate that the mammalian methylation array is useful for many applications that involve mammalian species. methods conserved methylation array probe selector (cmaps) given a multi-species sequence alignment and reference genome, for each cg site and each of the four different possible probe designs, cmaps computes an estimate of the number of species from the alignment that could be targeted if the use of degenerate base technology is optimized for tolerated mutations. the four probe designs involve each combination of probe type (infinium i vs. infinium ii), and whether the probe sequence is on the forward or reverse dna strand. for .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / each probe option, cmaps conducts a greedy search to select tolerated mutations, including position and allele that maximize species coverage for the probe. the maximum number of degenerate bases that can be included in a probe is a function of a design score provided by illumina inc. for infinium ii probes only, cpgs present in the probe sequence count as if they are a degenerate base. more specifically, the algorithm for determining the number of species and selecting the mutations to handle performs the following steps for each probe design: . let m be the maximum number of degenerate bases that can be designed into a specific probe, based on the design score . for each species s in the alignment, let ms be the number of mismatches in the alignment between that species and the human reference sequence of the probe a. if ms > m or the species does not have the target cpg, continue to next species b. if ms <= m, i. for each mismatch in species s, add each degenerate position to a multiset p ii. add the species to a set f of feasible species to target with this probe . for all |p| choose m combinations of possible degenerate positions: a. for each unique position in the combination i. for each possible alternate nucleotide count the number of species in f that contain that alternate nucleotide ii. pick the top k alternate nucleotides based on the count in i., where k is the number of occurrences of the current position in s b. compute the number of species that match the human reference when accounting for the degenerate substitutions handled in a . select the combination of positions in s that maximizes .b .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / our procedure for selecting the specific targeted cpg and probe designs are described in the main text. we note that of the cpgs selected for the mammalian methylation array based on the conservation criteria (using the sequence alignment) overlap with the human biomarker cpgs. the design of the probes targeting them could differ however. the probe names of different probes targeting the same cpg are distinguished by extensions ". " and ". ". for example cg . and cg . target the same cytosine but use different probe chemistry. the array contains four probes that measure cytosines that are not followed by a guanine selected by human biomarkers, which are indicated with a "ch" instead of a "cg". the cmaps algorithm was applied with human hg as the reference genome and using the multiz alignment of vertebrates with the hg human genome downloaded from the ucsc genome browser (haeussler et al., ; rosenbloom et al., ). for the purpose of designing the mammalian array, only the mammalian species in this alignment were considered and for the mappability analysis. however, the current version of the mappability analysis provides genome coordinates for species. the mammalian methylation array includes an additional human snp markers (whose probe names start with "rs" for human studies), which can be used to detect plate map errors when dealing with multiple tissue samples collected from the same person. finally, the mammalian array also adopted a standard suite of probes from the illumina epic array for measuring bisulfite conversion efficiency in humans. mapping probes to genomic coordinates we used two different approaches for mapping probes to genomes. the first approach (bsbolt software) was primarily used in designing the array. subsequently, we adopted a second mappability approach (quasr software) that allowed us to map more probes to more species. mappability approach : bsbolt .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for version of our mappability analysis (i.e. for designing the array), we applied the bsbolt mapping approach to high quality genomes from: baboon (papham ), cat (felcat ), chimp (pantro ), cow (bostau ), dog(canfam ), gibbon(nomleu ), green monkey (chlsab ), horse, (equcab ), human (hg ), macacque (macfas ), marmoset(caljac ), mouse (mm ), rabbit (orycun ), rat (rn ), rhesus monkey (rhemac ), sheep (oviari ). we utilized the bsbolt software (farrell et al., ) package from https://github.com/nuttylogic/bsbolt to perform the alignments. for each species’ genome sequence, bsbolt creates an ‘in silico’ bisulfite-treated version of the genome. as many of the currently available genomes are in a low quality assembly state (e.g. thousands of contigs or scaffolds), we used the utility “threader” (which can be found in bsbolt’s forebear bsseeker (guo et al., ) as a standalone executable) to reformat these fasta files into concatenated and padded pseudo-chromosomes. the set of nucleotide sequences of the designed probes, which includes degenerate base positions, was explicitly expanded into a larger set of nucleotide sequence representing every possible combination of those degenerate bases. for infinium i probes, which have both a methylated and an unmethylated version of the probe sequence, only the methylated version was used as bsbolt’s version of the genome treats all cg sites as methylated. the initial k probe sequences resulted in a set of , sequences to be aligned against the various species genomes. we then ran bsbolt with parameters align -m –db [path to bisulfite- treated genome] -bt bowtie -bt -p -bt -k -bt -l -f [probe sequence file] –o [alignment output file] –s to align the enlarged set of probe sequences to each prepared genome. as we were not interested in the final bsbolt style output, we made a small modification to the code to retain its temporary output of alignment results in "sam" format. from these files, we collected only alignments where the entire length of the probe perfectly matched to the genome .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/nuttylogic/bsbolt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / sequence (i.e. the cigar string ‘ m’ and flag xm= ”). then, for each genome we collapsed all the sequence variant alignments for each probeid down to a list of loci for that genome and for that probe. mappability approach : quasr for version of our mappability analysis, we aligned the probe sequences to all available mammalian genomes in ensembl and ncbi refseq databases using the quasr package (gaidatzis et al., ). the fasta sequence files for each genome were downloaded from these public databases. the alignment assumed that the dna has been subjected to a bisulfite conversion treatment. for each species’ genome sequence, quasr creates an in-silico-bisulfite- treated version of the genome. the probes were aligned to these bisulfite treated genome sequences, which does not consider c-t as a mismatch. the alignment was ran with quasr (a wrapper for bowtie ) with parameters -k --strata --best -v and bisulfite = "undir” to align the enlarged set of probe sequences to each prepared genome. from these files, we collected the best candidate unique alignment to the genome. additionally, the estimated cpg coordinates at the end of each probe was used to extract the sequence from each genome fasta files and exclude any probes with mismatches in the target cpg location. genomic loci annotations gene annotations (gff ) for each genome considered were also downloaded from the same sources as the genome. following the alignment, the cpgs were annotated to genes based on the distance to the closest transcriptional start site using the chipseeker package(yu et al., ). genomic location of each cpg was categorized as either intergenic region, ’ utr, ’ utr, promoter (minus kb to plus bp from the nearest tss), exon, or intron. the unique .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / region assignment is prioritized as follows: exons, promoters, introns, ’ utr, ' utr, and intergenic. additional genomic annotations, including human ortholog ensembl id, were extracted from the biomart ensembl database(yates et al., ). the candidate gene for each probe was compared with human orthologous ensembl id to examine the similarity of the alignment with the human. for each probe, we examined if the assigned species ensembl id is identical to human-to-other-species-orthologous ensembl id in human mappability file. orthologous comparison with human was done for genomes that could be matched to human genome by “targetspecies_homolog_associated_gene_name" in biomart using getlds() function. cell and tissue specific chromatin state annotations were based on the -state chromhmm model based on imputed data for -marks (ernst and kellis, ; roadmap epigenomics consortium et al., ). the chromatin state annotations from a chromhmm model that was not specific to a single cell or tissue type were from (vu and ernst, ). we also provide in the annotation files of the array chromhmm chromatin state annotations for mouse from (gorkin et al., ). the human-mouse lecif score was from (kwon and ernst, ). cpg island annotation we called cpg islands using the “gcluster” algorithm(gómez-martín et al., ). this algorithm uses clustering methods to identify the sequences that have high g+c content and cpg density with the default parameters. besides cpg island status, this algorithm calculated several other attributes including length, gc content, and cpg density for each defined island. the outcome of this algorithm was a bed file that was used to annotate the probes using the “annotatr” package in r by checking the overlap of the aligned probes and cpg island genomic coordinates. human dna methylation distribution .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / we downloaded the fraction methylated values based on whole genome bisulfite sequencing data from different cells and tissues types from the roadmap epigenomics consortium (http://egg .wustl.edu/roadmap/data/bydatatype/dnamethylation/wgbs/fractionalmethylation.t ar.gz)(roadmap epigenomics consortium et al., ). for each cpg, we averaged the fractional methylation values across the roadmap samples. great analysis we applied the great analysis software tool(mclean et al., ) to conduct gene set enrichments for genes near cpgs on the array in human and mouse. the great software performs both a binomial test (over genomic regions) and a hypergeometric test over genes when using a whole genome background. we performed the enrichment based on default settings (proximal: . kb upstream, . kb downstream, plus distal: up to , kb) for gene sets associated with go terms, msigdb, panther and kegg pathway. to avoid large numbers of multiple comparisons, we restricted the analysis to the gene sets with between and , genes. we report nominal p values and two adjustments for multiple comparisons: bonferroni correction and the benjamini-hochberg false discovery rate. tissue enrichment analysis the enrichment of tissue specific genes was done by tissueenrich r package(jain and tuteja, ) using teenrichment() function limited to human protein atlas(uhlén et al., ) and mouse encode(yue et al., ) databases. normalization methods .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://egg .wustl.edu/roadmap/data/bydatatype/dnamethylation/wgbs/fractionalmethylation.tar.gz http://egg .wustl.edu/roadmap/data/bydatatype/dnamethylation/wgbs/fractionalmethylation.tar.gz https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / r software scripts implementing normalization methods can be accessed through our webpage (see the section on data availability). two software scripts are currently available for extracting beta values from raw signal intensities, based on minfi and sesame, respectively. both methods use the noob method (triche et al., ) for background subtraction. the two scripts evaluate each probe's hybridization and extension performance using normalization control probes and infinium-i probe out-of-band measurements (the poobah method (zhou et al. ), respectively. users can use the detection p-values for each cpg to filter out non-significant methylation readouts from probes unlikely to work in the target species. calibration data we generated methylation data on two different platforms: the mammalian methylation array (horvathmammalmethylchip ) and the human epic methylation array. the dna samples from each species were enzymatically manipulated so that they would exhibit %, %, %, % and % percent methylation at each cpg location, respectively. we purchased premixed dna standards from epigendx inc (products - h-premixhuman, - m-premixmouse, and standard - r-premixrat premixed calibration standard). the variable “proportionmethylated” (with ordinal values , . , . , . , ) can be interpreted as a benchmark for each cpg that maps to the respective genome. thus, the dna methylation levels of each cpg are expected to have a high positive correlation with proportionmethylated across the arrays measurement from a given species. the mammalian array was applied to synthetic dna data from species: human (n= mammalian arrays), mouse (n= ), and rat (n= ). similarly, the human epic array was applied to calibration data from of mouse (n= epic arrays) and rat (n= ). thus, we applied epic arrays and epic arrays per value ( , . , . , . , .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ) of proportionmethylated in our mouse and rat studies, respectively. the epic array data were normalized using the noob method (r function preprocessnoob in minfi). data availability the mammalian methylation array (horvathmammalmethylchip ) is registered at the ncbi gene expression omnibus (geo) as platform gpl . the chip manifest file, calibration data, supplementary data, and r software scripts are or will be available from available https://github.com/shorvath/mammalianmethylationconsortium/ or the gene expression omnibus. acknowledgements and funding this work was supported by the paul g. allen frontiers group (sh) and nsf career award # , national institutes of health (dp da ) and a jccc-bscrc ablon scholars award (je). conflict of interest statement the regents of the university of california is the sole owner of a provisional patent application directed at this invention for which aa, je and sh are named inventors. sh is a founder of the non-profit epigenetic clock development foundation, which plans to license several patents from his employer uc regents, and distributes the mammalian methylation array. bret barnes is an employee for illumina inc which manufactures the mammalian methylation array. the other authors declare no conflicts of interest. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/shorvath/mammalianmethylationconsortium/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / no. cpgs whose correlation with the proportionmethylation > threshold species threshold mammal+sesame mammal+minfi epic+minfi mouse . , , , mouse . , , , mouse . , , rat . , , , rat . , , , rat . , , human . , , na human . , , na human . , , na table . correlating dna methylation levels with calibration data. we evaluated the mammalian methylation array with two different software methods for normalization: sesame and minfi (noob normalization). the epic array data were only normalized with the noob normalization method in minfi. as indicated in the first column, the dna samples came from three species: human (n= arrays), mouse (n= ), and rat (n= ). for each species, the “artificial” chromosomes exhibited on average %, %, %, % and % percent methylation at each cpg location. thus, the variable “proportionmethylated” (with ordinal values , . , . , . , ) can be considered as benchmark/gold standard. the table reports the number of cpgs for which the pearson correlation with the proportionmethylation was greater than the correlation threshold (second column). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figures b figure . overview of mammalian methylation array design process. (a) toy example of multiple sequence alignment at a cpg site considered by the cmaps algorithm. the orange coloring highlights the cpg being targeted. positions where other species have alignment that matches the human sequence are in dark blue; positions where other species have alignment that does not match the human sequence are in neon yellow; positions where other species have no alignment are in grey. (b) flowchart detailing the selection of probes on the array by the cmaps algorithm. a small fraction of probes designed were dropped during the manufacturing process. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . cpg and gene coverage of probes on the mammalian methylation array across different phylogenetic orders. (a) probe localization based on the quasr package (gaidatzis et al., ). the rows correspond to different phylogenetic orders. the phylogenetic orders are ordered based on the phylogenetic tree and increasing distance to human. the boxplots report the median number of mapped probes across species from the given phylogenetic order. (b) the number of probes mapped to human orthologous genes for a subset of genomes (methods). (c) percentage of the probes associated with human orthologous genes among mapped probes in these species. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . cpg island and chromatin state analysis of mammalian methylation probes. we characterize the cpgs located on the mammalian methylation array regarding (a) cpg island status in different phylogenetic orders, (b) chromatin state analysis, and (c) lecif score of evidence of human-mouse conservation at the functional genomics level. (a) the boxplots report the median number (and interquartile range) of cpgs that map to cpg islands in mammalian species of a given phylogenetic order (x-axis). the notch around the median depicts the % confidence interval. (b) the heatmap visualizes the chromhmm chromatin state annotations of the location of the cpgs on the array (rows) in different human tissues (columns)(ernst and kellis, , ). the colors correspond to human chromatin states as detailed in the right panel. the probes in the left panel heatmap are ordered by the chromatin state with the maximum median frequency across human cell and tissue types. the right panel indicates the distribution of chromatin states in each tissue type represented on the mammalian methylation array. (c) comparison of distribution of lecif score for probes on the array and aligning bases between human and mouse. the lecif score has been binned as shown on the x-axis, and the fraction of probes or aligning bases with scores in that bin are shown on the y-axis. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . distribution of probe intensities within sample, colored by the expected percentage of methylation at each site. (a-c) distribution of beta values (relative intensity) of all probes on the array before normalization for (a) human samples, (b) mouse samples, and (c) rat samples. (d-f) distribution of probe intensity after sesame normalization and restricting probes to those that cmaps designed to (d) the human genome in human samples, (e) the mouse genome in mouse samples, and (f) the rat genome in rat samples. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . calibration data: mean methylation across probes shared between the human epic array and the mammalian array. the mammalian methylation array contained probes targeting the same cpg that can also be found on the human epic array that were not included based on being human biomarkers. however, the mammalian array probes were engineered differently than epic probes so that they would more likely work across mammals. by applying both array types to calibration data, we are able to compare the calibration of the overlapping probes in mice (a,b) and rats (c,d). upper panels (a,b) and lower panels (c,d) present the results for the mammalian array and the epic array, respectively. the benchmark measure (proportionmethylated, x-axis) versus the mean value across roughly cpgs that map to mice (a,c) and roughly cpgs that map to rats (b,d). the mean methylation (y-axis) was formed across a subset of cpgs that i) are present on the human epic array, ii) present on the mammalian array, and iii) apply to the respective species according to the mappability analysis genome coordinate file. . . . . . . . . . . mouse,mammalarray,sesame cor= . , p= . e- proportionmethylated m e a n m e th .i n te rs e c tm a m m a l. e p ic .m a p s t o m o u s e a . . . . . . . . . . rat,mammalarray,sesame cor= . , p= . e- proportionmethylated m e a n m e th .i n te rs e c tm a m m a l. e p ic .m a p s t o r a t b . . . . . . . . . . mouse dna, epic array cor= . , p= . proportionmethylated m e a n m e th .e p ic .p ro b e s t h a tm a p t o m o u s e c . . . . . . . . . . . . rat dna, epic array cor= . , p= . proportionmethylated m e a n m e th .e p ic .p ro b e s t h a tm a p t o r a t d .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figures supplementary figure s : comparison of probe context between the illumina epic, k and the mammalian methylation array: (a) analysis of cpg and non-cpg (ch) probes, (b) color channel assignment, (c) type i and type ii probes, and (d) next base reveals similar percentages across probes from these three array platforms. color channel assignment and probe basepair context are important for dna methylation array analysis and the similarity between these different arrays can facilitate extension of published analysis and normalization methods. analysis of type i and type ii probes shows a slightly lower percentage of type i probes for the mammalian methylation array. type i probes assay dna methylation using one color channel and two bead .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / types, i.e. one unmethylated bead type and one methylated bead type. conversely, type ii probes assay dna methylation using one bead type and two color channels indicating methylated and unmethylated cytosines. adjustment for dna methylation signal detected by these different probe types is one of the most important steps in dna methylation array normalization, and a sufficient number of type i probes were included in the mammalian methylation array to facilitate the extension of published data normalization methods. (e) comparison of shared and non-shared probes between the mammalian methylation array and mousemethylation array loci reveals shared probes. (f) comparison of shared and non-shared probes between the epic, k and the mammalian methylation array. comparative analysis was performed using illumina probe ids, which are unique to each probe. intersection of ids between arrays reveals over , probes that are common to all platforms (center). these probes can be used to follow up published human epigenome-wide association study (ewas) results in model organisms such as mouse (mus musculus) or rat (rattus norvegicus), or across a range of other species, including all primates and other mammals. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s . chromosome and gene region analysis of mammalian methylation probes in humans and mice. the analysis is based on mapping probes on the mammalian methylation array to the human (hg ) and mouse (mm ) genome using quasr package(gaidatzis et al., ). (a) the number of probes per human and mouse chromosome. (b) the left panel reports the percentage of probes that are located in different gene regions (promoters, ' utr, ' utr, introns, exons) in humans and mice. the right panel reports the distribution of the probes relative to the nearest transcriptional start site. (c) histogram of cpg number in different gene regions in human and mouse genomes (as defined in the legend of panel d). (d) alignment to orthologous genes between humans and mice. the colors indicate the mapped gene region in the mouse genome. the unique region assignment are prioritized as follows: exons, promoters, introns, ' utr, ' utr. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / summary figure s . great gene set enrichment analysis of all probes on the mammalian methylation array. the figure shows the top enriched pathway based on gene-level enrichment analysis for genes proximal to probes using great . the two columns correspond to enrichment analysis for human (hg ) and mouse (mm ) genomes, respectively, using the whole genome as background. the top five enriched datasets from each category (canonical pathways, diseases, gene ontology, human and mouse phenotypes, and upstream regulators) were selected and further filtered for significance at p < - . the category is indicated by the shape, the number of genes by the size of the shape, and the significance of the enrichment is indicated by the color scale. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s . human and mouse tissue-specific probes on mammalian methylation array. characterization of the tissue specificity of cpg probes on the mammalian methylation array using the human protein atlas(uhlén et al., ) and mouse encode gene expression data(yue et al., ). the left and right panels report results for human and mouse genomes, respectively. each probe is mapped to the closest gene while other genes in the flanking region are ignored in this analysis. the number of genes (a) and the number of cpg probes (b) versus a categorical measure of tissue specificity. the categories on the y-axis have the following definitions. the following categories are defined in the tissueenrich software "tissue enriched" labels genes with an expression level greater than (tpm or fpkm) that also have at least five-fold higher expression levels in a particular tissue compared to all other tissues. "group enriched" labels genes with an expression level greater than (tpm or fpkm) that also have at least five-fold higher expression levels in a group of - tissues compared to all other tissues, and that are not considered tissue enriched. "tissue enhanced" labels genes with an expression level greater than (tpm or fpkm) that also have at least five-fold higher expression levels in a particular tissue compared to the average levels in all other tissues, and that are not considered tissue enriched or group enriched. (c) the number of tissue-enriched genes represented on mammalian array vs background in human and mouse transcriptome. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s . distribution of dna methylation levels. distribution of average fractional methylation across cell and tissue types(roadmap epigenomics consortium et al., ) at cpg sites on the array (blue) and all sites in the genome (red). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s : mammalian methylation array enrichment for universal chromatin state annotations. (left) distribution of probe overlap with a universal chromatin state annotation by the stacked modeling approach of chromhmm applied to data from more than cell or tissue types(vu and ernst, ). (right) the same as left, but showing the fold enrichments of the state relative to a uniform background. the strongest enrichment is seen for some bivalent promoter states. a full characterization of the states can be found in (vu and ernst, ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references aryee, m.j., jaffe, a.e., corrada-bravo, h., ladd-acosta, c., feinberg, a.p., hansen, k.d., and irizarry, r.a. ( ). minfi: a flexible and comprehensive bioconductor package for the analysis of infinium dna methylation microarrays. bioinformatics , – . bibikova, m., le, j., barnes, b., saedinia-melnyk, s., zhou, l., shen, r., and gunderson, k.l. ( ). genome-wide dna methylation profiling using infinium® assay. epigenomics , – . bibikova, m., barnes, b., tsan, c., ho, v., klotzle, b., le, j.m., delano, d., zhang, l., schroth, g.p., gunderson, k.l., et al. ( ). high density dna methylation array with single cpg site resolution. genomics , – . breeze, c.e., reynolds, a.p., van dongen, j., dunham, i., lazar, j., neph, s., vierstra, j., bourque, g., teschendorff, a.e., stamatoyannopoulos, j.a., et al. ( ). eforge v . : updated analysis of cell type-specific signal in epigenomic data. bioinformatics , – . cokus, s.j., feng, s., zhang, x., chen, z., merriman, b., haudenschild, c.d., pradhan, s., nelson, s.f., pellegrini, m., and jacobsen, s.e. ( ). shotgun bisulphite sequencing of the arabidopsis genome reveals dna methylation patterning. nature , – . ernst, j., and kellis, m. ( ). chromhmm: automating chromatin-state discovery and characterization. nat. methods , – . ernst, j., and kellis, m. ( ). large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. nat. biotechnol. , – . farrell, c., thompson, m., tosevska, a., oyetunde, a., and pellegrini, m. ( ). bisulfite bolt: a bisulfite sequencing analysis platform. biorxiv . . . . gaidatzis, d., lerch, a., hahne, f., and stadler, m.b. ( ). quasr: quantification and annotation of short reads in r. bioinformatics , – . gómez-martín, c., lebrón, r., oliver, j.l., and hackenberg, m. ( ). prediction of cpg islands as an intrinsic clustering property found in many eukaryotic dna sequences and its relation to dna methylation. methods mol. biol. clifton nj , – . gorkin, d.u., barozzi, i., zhao, y., zhang, y., huang, h., lee, a.y., li, b., chiou, j., wildberg, a., ding, b., et al. ( ). an atlas of dynamic chromatin landscapes in mouse fetal development. nature , – . guintivano, j., aryee, m.j., and kaminsky, z.a. ( ). a cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. epigenetics , – . guo, w., fiziev, p., yan, w., cokus, s., sun, x., zhang, m.q., chen, p.-y., and pellegrini, m. ( ). bs-seeker : a versatile aligning pipeline for bisulfite sequencing data. bmc genomics , . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / haeussler, m., zweig, a.s., tyner, c., speir, m.l., rosenbloom, k.r., raney, b.j., lee, c.m., lee, b.t., hinrichs, a.s., gonzalez, j.n., et al. ( ). the ucsc genome browser database: update. nucleic acids res. , d –d . hannum, g., guinney, j., zhao, l., zhang, l., hughes, g., sadda, s., klotzle, b., bibikova, m., fan, j.-b., gao, y., et al. ( ). genome-wide methylation profiles reveal quantitative views of human aging rates. mol. cell , – . horvath, s. ( ). dna methylation age of human tissues and cell types. genome biol. , r . horvath, s., and levine, a.j. ( ). hiv- infection accelerates age according to the epigenetic clock. j. infect. dis. , – . horvath, s., oshima, j., martin, g.m., lu, a.t., quach, a., cohen, h., felton, s., matsuyama, m., lowe, d., kabacik, s., et al. ( ). epigenetic clock for skin and blood cells applied to hutchinson gilford progeria syndrome and ex vivo studies. aging , – . houseman, e.a., accomando, w.p., koestler, d.c., christensen, b.c., marsit, c.j., nelson, h.h., wiencke, j.k., and kelsey, k.t. ( ). dna methylation arrays as surrogate measures of cell mixture distribution. bmc bioinformatics , . jain, a., and tuteja, g. ( ). tissueenrich: tissue-specific gene enrichment analysis. bioinforma. oxf. engl. , – . kwon, s.b., and ernst, j. ( ). learning a genome-wide score of human-mouse conservation at the functional genomics level. biorxiv . . . . levine, m.e., lu, a.t., quach, a., chen, b.h., assimes, t.l., bandinelli, s., hou, l., baccarelli, a.a., stewart, j.d., li, y., et al. ( ). an epigenetic biomarker of aging for lifespan and healthspan. aging , – . li, x., chen, f., and chen, y. ( ). gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes. bioinforma. oxf. engl. , – . lister, r., pelizzola, m., dowen, r.h., hawkins, r.d., hon, g., tonti-filippini, j., nery, j.r., lee, l., ye, z., ngo, q.-m., et al. ( ). human dna methylomes at base resolution show widespread epigenomic differences. nature , – . mclean, c.y., bristor, d., hiller, m., clarke, s.l., schaar, b.t., lowe, c.b., wenger, a.m., and bejerano, g. ( ). great improves functional interpretation of cis-regulatory regions. nat. biotechnol. , – . meissner, a., gnirke, a., bell, g.w., ramsahoye, b., lander, e.s., and jaenisch, r. ( ). reduced representation bisulfite sequencing for comparative high-resolution dna methylation analysis. nucleic acids res. , – . needhamsen, m., ewing, e., lund, h., gomez-cabrero, d., harris, r.a., kular, l., and jagodic, m. ( ). usability of human infinium methylationepic beadchip for mouse dna methylation studies. bmc bioinformatics , . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ooi, s.k.t., qiu, c., bernstein, e., li, k., jia, d., yang, z., erdjument-bromage, h., tempst, p., lin, s.-p., allis, c.d., et al. ( ). dnmt l connects unmethylated lysine of histone h to de novo methylation of dna. nature , – . pidsley, r., zotenko, e., peters, t.j., lawrence, m.g., risbridger, g.p., molloy, p., van djik, s., muhlhausler, b., stirzaker, c., and clark, s.j. ( ). critical evaluation of the illumina methylationepic beadchip microarray for whole-genome dna methylation profiling. genome biol. , . roadmap epigenomics consortium, kundaje, a., meuleman, w., ernst, j., bilenky, m., yen, a., heravi-moussavi, a., kheradpour, p., zhang, z., wang, j., et al. ( ). integrative analysis of reference human epigenomes. nature , – . robertson, k.d. ( ). dna methylation and human disease. nat. rev. genet. , – . rosenbloom, k.r., armstrong, j., barber, g.p., casper, j., clawson, h., diekhans, m., dreszer, t.r., fujita, p.a., guruvadoo, l., haeussler, m., et al. ( ). the ucsc genome browser database: update. nucleic acids res. , d –d . smith, z.d., and meissner, a. ( ). dna methylation: roles in mammalian development. nat. rev. genet. , – . triche, t.j., weisenberger, d.j., van den berg, d., laird, p.w., and siegmund, k.d. ( ). low-level processing of illumina infinium dna methylation beadarrays. nucleic acids res. , e . uhlén, m., fagerberg, l., hallström, b.m., lindskog, c., oksvold, p., mardinoglu, a., sivertsson, Å., kampf, c., sjöstedt, e., asplund, a., et al. ( ). proteomics. tissue-based map of the human proteome. science , . vu, h., and ernst, j. ( ). universal annotation of the human genome through integration of over a thousand epigenomic datasets. biorxiv . . . . yates, a.d., achuthan, p., akanni, w., allen, j., allen, j., alvarez-jarreta, j., amode, m.r., armean, i.m., azov, a.g., bennett, r., et al. ( ). ensembl . nucleic acids res. , d –d . yu, g., wang, l.-g., and he, q.-y. ( ). chipseeker: an r/bioconductor package for chip peak annotation, comparison and visualization. bioinformatics , – . yue, f., cheng, y., breschi, a., vierstra, j., wu, w., ryba, t., sandstrom, r., ma, z., davis, c., pope, b.d., et al. ( ). a comparative encyclopedia of dna elements in the mouse genome. nature , – . zhou, w., triche, t.j., jr, laird, p.w., and shen, h. ( ). sesame: reducing artifactual detection of dna methylation by infinium beadchips in genomic deletions. nucleic acids res. , e –e . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / periodicity in the embryo: emergence of order in space, diffusion of order in time bradly alicea , , ujjwal singh , keywords: periodicity, dynamical systems, c. elegans, zebrafish, developmental biology, modeling and simulation abstract does embryonic development exhibit characteristic temporal features? this is quite apparent in evolution, where evolutionary change has been shown to occur in bursts of activity. using two animal models (nematode, caenorhabditis elegans and zebrafish, danio rerio) and simulated data, we demonstrate that temporal heterogeneity exists in embryogenesis at the cellular level, and may have functional consequences. cell proliferation and division from cell tracking data is subject to analysis to characterize specific features in each model species. simulated data is then used to understand what role this variation might play in producing phenotypic variation in the adult phenotype. this goes beyond a molecular characterization of developmental regulation to provide a quantitative result at the phenotypic scale of complexity. introduction while the case for the effects of "tempo and mode" [ ] have been made for the evolutionary process, a similar relationship between phenotypic change, time, and space may also exist in development. one obvious answer to this question is to examine the expression and sequence variation of genes associated with cell cycle and developmental patterning [ ]. however, there is a potentially more compelling top-down explanation. we will use two model organisms to demonstrate how periodicity becomes less synchronized over developmental time and space. in the case of the nematode caenorhabditis elegans, a comparison of embryogenetic and postembryonic cells (developmental and terminally-differentiated cell birth times acquired from [ ]) reveals two general patterns. for the zebrafish ( danio rerio ), comparisons within and between embryogenesis stages based on measurements of cell nuclei in the animal hemisphere [ ] reveal patterns at multiple scales. one of the most notable signatures is burstiness [ , ], or a large number of events occurring in a short period of time. these bursts can either be periodic or aperiodic, and these statistical features define the temporal nature of development, potentially in a universal manner across species. based on two species and a computational model, we predict that periodic changes in the frequency of new cells over developmental time represents cell proliferation without functional distinction. we also analyze the intervals between bursts in cell division (and cell differentiation in the case of c. elegans ). these bursts are derived from both time-series segmentation and decomposition in the frequency domain. we show that these results consistently point to great temporal variation at the cellular level, and may play a role in shaping morphogenesis. in addition, these openworm foundation, boston, ma usa. balicea@openworm.org orthogonal research and education laboratory, champaign, il usa. iiit delhi, delhi, india. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:balicea@openworm.org https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / changes in frequency and periodicity over time results in spatial variation (supplemental figure ). to characterize spatial variation, we utilize embryo networks [ ]. embryo networks are complex networks based on the relative proximity of cells as they divide and migrate during the developmental process. the resulting network topologies provide not only information about spatial variation, but cellular interactions and other signaling connections as well [ , ]. the existence of network structure in the form of modules or regions of dense connectivity can reveal a great deal about the unfolding of lineage trees in time. returning to the first prediction, we can create computational summaries of cell division events called numeric embryos to model the proliferation of cells over time. we call these computational models, numeric embryos, and can be used to model branching events in a lineage tree. numeric embryos can be used to model the distribution of branching events in time, independent of cell identity or spatial context. approximating this distribution provides us with a periodic time-series that tells us something about the speed of embryogenesis: how quickly can different underlying distributions of cell division produce a phenotype with many undifferentiated cells. the rate at which developmental cells are produced could affect the rate of overall development, as we will see in an example from zebrafish. finally, we predict that the emergence and subsequent changes in spatiotemporal periodicity at the cellular level lead to regulatory phase transitions. for example, there is a one-to-one correspondence between cell division and waves of differentiation after the syncytial stage in drosophila melanogaster [ ]. in a similar fashion, amphibians exhibit a decay of synchrony of division [ , ] that corresponds to differentiation wave activity [ ]. based on data analysis, modeling, and literature review, we anticipate that further investigation could uncover whether, in regulating embryos, mitosis and cell differentiation are correlated. in interpreting the data, we discuss the potential applicability of holtzer’s quantal mitosis hypothesis [ , ] as it relates to the process of differentiation relative to the proliferation of developmental (undifferentiated) cells. methods a summary of the methods could be given here for smooth reading and interest. all materials are located on github: https://github.com/orthogonal- research-lab/periodicity-in-the-embryo . this repository includes processed data, supplemental materials, and associated code. secondary datasets the c. elegans and d. rerio data sets were acquired from the systems science of biology database ( http://ssbd.qbic.riken.jp/ ). the c. elegans (nematode) data [ ] is based on cell tracking of the nucleus, pmid: . the d. rerio (zebrafish) data [ ] is likewise based on cell track of the nucleus, pmid: . the cell tracking data is used to determine the total number of new cells (cell birth time) present at a particular time step. for the c. elegans data, cell births correspond to minutes of developmental time, and windows of size five ( minutes of developmental time) is used for the time-series plots and histograms. since lineage trees and the nature of developmental .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/orthogonal-research-lab/periodicity-in-the-embryo https://github.com/orthogonal-research-lab/periodicity-in-the-embryo http://ssbd.qbic.riken.jp/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / cell identification are different in zebrafish, cell births correspond to the number of observed cells at discrete points in developmental time. windows representing a certain number of cells in the embryo observed at a given sampling point are used instead of directly converting this process to minutes of developmental time. zebrafish developmental stages estimates and calculations of d. rerio developmental stages are derived from [ ] and the zfin zebrafish developmental staging series web resource ( https://zfin.org/zf_info/zfbook/stages/). where applicable, embryo stages are approximated from the number of cells observed at any given point in developmental time. peak-finding method for both the c. elegans and d. rerio data, a peak finding method is used to evaluate periodicity and to generate data points representing distinct bursts of cell birth. briefly, local peaks in the cell division series are discovered by finding the highest value around the peak over an interval of data points. the data are then visually inspected to ensure that local maximal fluctuations were not selected. using this segmentation method, we are able to define intervals between peaks in a way that allows for the aperiodic regions of our series to be compared to the highly periodic regions. the peak finding method results are supplemented by a fast frequency analysis (fft) of cell divisions in c. elegans embryo (supplemental figure ), cell differentiation events in c. elegans embryo (supplemental figure ), and time series for cell divisions in zebrafish embryo (supplemental figure ). the power spectra largely confirm the nature of our interval and peak analysis. while the analysis of zebrafish reveals a power spectrum at a single scale, the c. elegans embryo reveals a power spectrum of multiple time scales for both cell divisions and differentiations. embryo networks the full methodology for constructing and evaluating can be found in [ ]. briefly, embryo networks are complex networks constructed from the locations of cells in an embryo. nodes are represented by centroids representing cell nuclei, and edges represent the spatial (euclidean) distance between cells in a three- (static) or four- (dynamic) dimensional graph. all nuclei are plotted in embryo space, which is a coordinate system normalized to the center point between all cell locations in a complete embryo. for example, an edge of length . represents two centroids at opposite edges of the embryo space. a distance threshold is then derived from the length of the edge: in this paper, a distance threshold of . is used, excluding all but the cell nuclei in very close proximity to each other. numeric embryo numeric embryos are statistical summaries of the type of information acquired from our secondary datasets, but in a more generic manner. numeric embryos are based on generated pseudo data and are meant to capture the structure of hypothetical developmental scenarios. all analyses of our pseudo data were conducted using scilab . (paris, france). each numeric embryo consists of one or more vectors describing rounds of cell division in the embryo. briefly, each minute of developmental time is represented by either a zero or a positive non-zero value. for .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://zfin.org/zf_info/zfbook/stages/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / purposes of temporal comparison, all non-zero values are thresholded to one. to generate cell division intervals of different sizes, we start with a uniform distribution (division events occur every n minutes) and then compare this with a distribution generated using the grand function in scilab. for the poisson distribution, we use a 𝜆 = . (except where otherwise noted), while for the binomial distribution, we use parameters n = . and p = . . this produces intervals that are variable over developmental time. results our analysis will proceed from c. elegans to zebrafish, to a comparison of the two species, then to a network analysis, and finally to a simulation of cell division in development. first, we plot the developmental cell division dynamics in c. elegans and zebrafish in figures and , respectively, and cell differentiation in c. elegans in figure . we then examine the intervals between cell division events ( c. elegans ) and relative frequency of birth rates across development (zebrafish) in figures and , respectively. focusing on the peaks (maximum of bursts of cell births) shown in figures and , figure shows the distribution of intervals between peak values for c. elegans and zebrafish. figure helps us extend this finding from temporal dynamics to connectivity between cells and spatial distributions of newly-born cells. we conclude with an investigation of how the intervals found between cell divisions can be modeled using various statistical distributions and is shown in figure . these simulations (called numeric embryos) can reveal properties related to the speed of development, particularly the linear and nonlinear accumulation of cells. caenorhabditis elegans example to understand the temporal nature of cell division and differentiation, we start by looking at patterns in c. elegans development over time. figure shows a time series of such events from zygote to adulthood. we are particularly interested in potential spikes or bursts of events in a short period of time. figure shows the fluctuations in cell divisions in embryonic division (figure , top) and differentiation (figure , bottom) events. differentiation events occurring after minutes of developmental time (postembryonic development) occur in a long series of bursts, likely corresponding to the differentiation of seam cells. this can be contrasted with the burstiness that occurs in embryonic development, which is similar to the burstiness of division events. figure shows the intervals between cell division events across embryonic development in c. elegans . this plot confirms an exponential distribution with a long tail, presumably representing intervals in postembryonic development. yet this plot is also sparse, yielding only distinct intervals of cell division throughout all of c. elegans development. this is likely due to the deterministic nature of c. elegans development along with the relatively small number of cells. supplemental figures and reveal the power spectrum for cell division and cell differentiation in c. elegans, respectively. to compare, contrast, and understand these trends further, we now turn to the embryonic development of the zebrafish (d. rerio). .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . developmental cell births in the nematode c. elegans. cell divisions occur according to developmental time (minutes). the timeline ranges from fertilized egg (zygote) to adulthood. embryonic division events (blue), differentiation events (red). zebrafish in figure (top), we observe six regular busts of cell division, followed by aperiodic cell division behavior. this transition in periodicity is observed after the embryo reaches cells in size (figure , bottom). we do not observe this in c. elegans embryos, and may have to do with the more regulative nature of zebrafish embryogenesis [ ]. changes in periodicity may also have to do with the establishment of spatial differentiation beyond the axial variability observed in c. elegans. to better understand the nature of periodicity in zebrafish, we examined the distribution of intervals between birth times. figure and supplemental figure confirms the bursty nature of cell division in zebrafish, in that most sampling time points only feature a few cell births, while a small number of sampling time points represents a large number of cells born. for example, a large majority of sampling time points feature fewer than new cells per time point. by contrast, there are also single sampling points where over cells are born at a single time. in terms of the .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / power spectrum shown in supplemental figure , there is a very high amplitude at very low frequencies, perhaps related to the significant noise and aperiodicity in the later part of the time-series shown in figure . figure . the interval between cell division events across embryonic development in c. elegans . considering the cell divisions for the first period of zebrafish embryogenesis, we conduct an interval analysis for each oscillation of the data shown in figure for c. elegans (top) and d. rerio (bottom). these are measured from peak to peak as described in the methods. for the analysis of c. elegans data (figure , top), our analysis yields a roughly unimodal distribution, with a mean peak interval of - minutes. in pre-hatch c. elegans embryogenesis, there are many quick bursts of cell division as confirmed in figure (top). this results in bursty behavior that is regular and perhaps even periodic. by contrast., an analysis of our zebrafish data yields three interval groups (figure , bottom): the greatest number of oscillations occurs at a period of - minutes, while a smaller number of oscillations occur with periods from - . there is also a longer -minute interval between oscillations. this is consistent with the shift from periodic bursts to aperiodic but still bursty behavior later in zebrafish development shown in figure . this multimodal distribution of peaks points to a more complex process at play, something that might be better understood by investigating morphogenesis as a spatial process. embryo networks: an example from zebrafish another way to identify the consequences of bursts in cell division timing and other non-uniform temporal phenomena is to utilize embryo networks. an embryo network was constructed (figure , top) for cells born during our sampling time .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / points of d. rerio embryogenesis. the resulting circular graph demonstrates a high degree of modularity, but only across part of the graph. figure . cell births in zebrafish embryos during embryogenesis up to the gastrula stage. instead of developmental time, relative developmental progress is plotted as all cells observed in the embryo at each sampling time point. for figure , bottom: periodic region (red), aperiodic region (unshaded). a three-dimensional plot (figure , bottom) demonstrating the position of each cell born during these stages of development shows that the highest degrees of connectivity are clustered in the center of the embryo, while cells that are disconnected based on our connectivity threshold exist on the edges of the embryo. importantly, it appears that cells are more densely clustered toward the center of the .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / embryo early at the earliest stages of development. these dense clusters are likely the product of cell division fluctuations shown in figures and . figure . relative frequency of birth rate across developmental time in d. rerio. histogram demonstrates the distribution of cells born during a single sampling time point. numeric embryo experiments a numeric embryo (or perhaps more accurately a numeric one) allows us to understand the fundamental features of cell division events relative to the efficiency of their timing. is one timing scheme superior to another? we know that in real (biological) lineage trees that cell divisions do not occur at a completely regular rate. are there advantages in one particular statistical signature over another, particularly when comparing it to an artificial (regular) scheme? table shows a summary of how this simulation is constructed. table . an example of our numeric simulation, with variable and sample values. we use the uniform distribution as the basis for poisson noise, which helps to execute things a bit faster on average. compare this to uniform division times such as a division event occurring once every units of time. generated poisson interval represents the size of the interval between division events, while division interval developmental time unit division time (au) generated poisson interval division interval .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / represents when the event occurs in developmental time. our timing data can be modeled as branches of a binary tree which are generated every n units of developmental time. the intervals between n , n , n ,…. nt are determined by a probability distribution, which can be uniform (every branching event occurring at completely regular intervals), or a poisson distribution (where branching events are distributed in an exponential fashion). figure . interval size of peaks in cell division for all developmental cells in c. elegans (top) and first minutes of zebrafish (bottom). c. elegans sampling time points correspond to most of the pre-hatch developmental period ( minutes .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / post-fertilization), while the zebrafish sampling time points correspond roughly to the period between the zygote and the oblong/sphere stages of the blastula. figure . top: an embryo networks for the d. rerio embryo at the cell stage (all cells born during the zygote and cleavage stages), with edges. the edge threshold is an embryo distance of . . bottom: cells in developmental location color-coded by status in the network. white: all cells not above the threshold, red: .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / all source cells with at least one edge to another cell. blue: all destination cells with at least one edge to another cell. red and blue are equivocal. black: all cells with more than eight edges to other cells. the graphs in figure tells us that modeling division events using a poisson distribution is that we can achieve the same number of divisions as fewer developmental time units. figure (top) shows a uniform distribution of division events, while figure (bottom) shows the uniform case as compared to other distributions (exponential, poisson, and binomial). the poisson distribution yields the “fastest” time relative to the number of divisions produced. by contrast, the binomial distribution yields the lowest number of divisions (hence is the slowest method examined). however, none of these methods produce orders-of-magnitude differences in division rate, which is what would be expected from a bursty signature. discussion in this paper, we examine the periodicity of cell proliferation and division examined using three model systems: zebrafish ( danio rerio ), nematode ( caenorhabditis elegans ), and a simulated embryo. when we refer to periodicity in development, we mean events that reoccur over time. regular pulses of cell proliferation events in a short period of time. this leads us to propose a principle of development based on timing. there can also be a spatial component of developmental periodicity as well. these include signatures of time-independent spatial periodicity such as tilings and other repeatable patterns across space. interpretation of figures we interpret figures and in a number of ways. the first is by looking at components of variation over time. we measure this in terms of the interval between cell birth times in c. elegans (figure ) and the frequency of cell birth rates in zebrafish (figure ). we also focus on intervals between other features in the time-series such as peaks for both species in figure . in investigating peak intervals, we discover a similar distribution of cell division events between species in figures and , but a difference between species when looking at specific time-series features (figure ). the reason for this is clear: features such as peaks (magnitude) have a different underlying mechanism than events such as cell division. while both are linked to the lineage tree, magnitude differences are linked to the synchronization of cell division due to deterministic timing. with deterministic timing, synchronized cell divisions produce a lot of cells at any one point in developmental time, but little fluctuation between time points. in the case of stochastic timing, a lot of cells can be produced with a great degree of fluctuation between time points. there are a number of ways to interpret the embryo network and -d plot shown in figure . one interpretation is that in zebrafish, the phenotype is built from the inside out, with densely-packed cells representing fledgling anatomical structures such as the notochord and heart. these clusters may be linked to rounds of cell division (occuring in temporal bursts), while cell divisions occurring during the inter-burst intervals may contribute to cells at the outer edge of the embryo and perhaps representing the ectoderm layer [ , ]. in this way, temporal bursts of cell division lead to a spatial hierarchy of cell differentiation. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . comparison of cumulative cell division events and the speed of division generated by a numeric embryo. top: uniform only (blue). bottom: uniform (blue), exponential (orange), poisson (gray), and binomial (yellow). this spatial hierarchy involves a number of evolutionary and biophysical constraints that have been demonstrated in a number of experimental settings. for example, physical confinement affects the overall axial alignment and geometry of an embryo [ ]. this includes our zebrafish embryo network. other types of fishes (astyanax, see [ ]) exhibit morphological changes in neural crest cell proliferation based on evolutionary changes due to ecological constraints. in c. elegans, asymmetrical cells (or daughter cells with significantly different volumes) result from physical constraints and compose % of c. elegans developmental cell divisions [ , ]. asymmetric cell divisions set up key cell-cell interactions [ ] that are highlighted by the edges of embryo networks. finally, by comparing nematic .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / alignment of liquid crystals to spindles of mitotic cells, phase transitions in actively dividing cells are found to result from the timing of centrosome separation [ ]. figure provides an introduction to the numeric embryo concept. in this figure, we focus exclusively on the timing component of lineage trees. this is essentially a version of the time series shown for zebrafish and c. elegans developmental time series, but with the temporal fluctuations smoothed out. these fluctuations are replaced with a cumulative sum of all cell division events occurring over a certain period of time. it is also apparent that comparisons between different distributions do not yield an appreciable difference in developmental speed (or the accumulation of x cells over a certain period of time). in figure , all simulations were run for iterations. investigating the potential of the poisson distribution further, we investigate how this distribution approximates cumulative cell division (as was done in figure ) for three values of λ ( . , . , and . ). the results of this experiment are shown in supplemental figure . as this parameter value is increased, the number of cells per developmental time point increases while the interval between cell divisions decreases. while the function derived from λ = . is always slowest, the functions derived from λ = . and λ = . are similar for the first timepoints, then diverge to reveal that λ = . clearly results in both faster cell divisions and a larger number of total cells after iterations. broader questions we can ask what it means when embryogenetic systems exhibit multiple pulses of cell proliferation from division events. in particular, the intervals between pulses provide information about the generative mechanisms behind production of the embryo. our inquiry is particularly suited to quantitative interpretation, particularly in terms of characterizing "bursty" behaviors. these bursty behaviors are non-normally distributed generative processes [ ] that describe the tempo and mode of development. while tempo and mode is generally an evolutionary phenomenon, these concepts also yield a model of developmental regulation that is explicitly temporal. our results also suggest that developmental regulation is not simply a molecular mechanism. our network analysis also demonstrates a connection between the spatiotemporal dynamics of cell division, cell differentiation, and systems-level view of timing. for example, we have found that structure and timing of interactions shape embryo network coherence signaling [ ], which in turn is an indicator of diffusion between developmental cells that share network connections. while it is not discussed in this paper, gene expression fluctuations and stochastic noise in gene expression drives heterogeneity in division timing and even timing of differentiation [ , ]. in particular, a focus on the molecular biology of the cell cycle across groups of developmental cells [ , ] can provide more information about how fluctuations work in general at the single-cell level. yet single cells acting in synchrony (or in the aggregate) define the patterns observed in our empirical data. one way to generalize our results to a broader cross-species context is to examine related phenomena such as mitotic bookmarking [ ], in which heritable regulatory information is transmitted from mother to daughter cells in a cell lineage. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / our approach is also quite valuable [see ] for understanding this particular scale of the biological organism. to understand these results more fully in the context of groups of cells producing mean behaviors, we can appeal to the quantal mitosis hypothesis. quantal mitosis involves changes in gene expression, in which the fate depends upon mitosis. this is also a gene expression-related memory mechanism that is widespread in development [ ]. in cases of an observed wave or peak in cell divisions at a certain point in developmental time, mitosis provides an opportunity to change gene expression [ ], and ultimately serves as a collective signal for changes in cell fate [ ]. finally, the way in which we decompose the spatiotemporal dynamics of the embryo might be useful as a supplement to reaction-diffusion models of morphogenesis [ ]. future work will involve extending this type of analysis to other species, in addition to developing our numerical models to include explicitly spatial phenomena. acknowledgements we would like to thank members of the devoworm group for their support and feedback, particularly susan crawford-young. thanks also go to the openworm foundation for their institutional support. supplemental figures supplemental figure . example of an embryo network from the -cell c. elegans embryo build using cell tracking data. data shown in the context of a cartoon showing the anterior end of the embryo. different colored edges represent cells born at different generations of the lineage tree (levels). .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplemental figure . frequency-domain plot of cell division event frequencies in c. elegans embryo. all events greater than an amplitude of shown in red, while all events greater than an amplitude of shown in blue. supplemental figure . frequency-domain plot of cell differentiation event frequencies in c. elegans embryo. all events greater than an amplitude of shown in red, while all events greater than an amplitude of shown in blue. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplemental figure . frequency-domain plot of cell division event frequencies in zebrafish embryo. supplemental figure . comparison of cumulative cell division events and the speed of division generated by a numeric embryo for the poisson distribution at three different values of λ. blue: λ = . , black: λ = . , red: λ = . . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references [ ] simpson, g.g. ( ). tempo and mode in evolution. columbia university press, new york. [ ] ogura, y. & sasakura, y. ( ). developmental control of cell-cycle compensation provides a switch for patterned mitosis at the onset of chordate neurulation. developmental cell, ( ), p - . doi: . /j.devcel. . . [ ] bhatla, n. ( ). an interactive visualization of the c. elegans cell lineage. wormweb , wormweb.org/celllineage [ ] keller. p.j., schmidt, a.d., wittbrodt, j., & stelzer, e.h.k. ( ). reconstruction of zebrafish early embryonic development by scanned light sheet microscopy. science , ( ), - . doi: . /science. . [ ] barabasi, a.l. ( ). the origin of bursts and heavy tails in human dynamics. nature , ( ), – . [ ] abney, d.h., dale, r., louwerse, m.m., and kello, c.t. ( ). the bursts and lulls of multimodal interaction. cognitive science, ( ), - . [ ] alicea, b. and gordon r. ( ). cell differentiation processes as spatial networks: identifying four-dimensional structure in embryogenesis. biosystems, , - . [ ] alicea, b. ( ). the emergent connectome in caenorhabditis elegans embryogenesis. biosystems , , - . [ ] alicea, b. ( ). raising the connectome: the emergence of neuronal activity and behavior in c. elegans . frontiers in cellular neuroscience , doi: . / fncel. . . [ ] foe, v.e. & alberts, b.m. ( ). studies of nuclear and cytoplasmic behaviour during the five mitotic cycles that precede gastrulation in drosophila embryogenesis. journal of cell science , , - . [ ] boterenbrood, e.c., narraway, j.m. & hara, k. ( ) duration of cleavage cycles and asymmetry in the direction of cleavage waves prior to gastrulation in xenopus laevis . roux's archives developmental biology, ( ), - . [ ] boterenbrood, e.c. & narraway, j.m. ( ). the direction of cleavage waves and the regional variation in the duration of cleavage cycles on the dorsal side of the xenopus laevis blastula. roux's archives of developmental biology, , - . [ ] gordon, n.k. & gordon, r. ( ). embryogenesis explained. world scientific publishing, singapore. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] holtzer, h., rubinstein, n., fellini, s., yeoh, g., chi, j., birnbaum, j. & okayama, m. ( ). lineages, quantal cell cycles, and the generation of cell diversity. quarterly reviews in biophysics , ( ), - . [ ] holtzer, h., biehl, j., antin, p., tokunaka, s., sasse, j., pacifici, m. & holtzer, s. ( ). quantal and proliferative cell cycles: how lineages generate cell diversity and maintain fidelity. progress in clinical biological research, , - . [ ] bao, z., murray, j.i., boyle, t., ooi, s.l., sandel, m.j., and waterston, r.h. ( ). automated cell lineage tracing in caenorhabditis elegans . pnas, ( ), - . [ ] keller, p.j., schmidt, a.d., wittbrodt, j., and stelzer, e.h.k. ( ). reconstruction of zebrafish early embryonic development by scanned light sheet microscopy. science , ( ), - . [ ] kimmel, c.b., ballard, w.w., kimmel, s.r., ullmann, b., and schilling, t.f. ( ). stages of embryonic development of the zebrafish. developmental dynamics, , - . [ ] raible, d.w. and eisen, j.s. ( ). regulative interactions in zebrafish neural crest. development , , - . [ ] menon, t., borbora, a.s., kumar, r., and nair, s. ( ). dynamic optima in cell sizes during early development enable normal gastrulation in zebrafish embryos. developmental biology , ( - ), - . [ ] shah, g., thierbach, k., schmid, b., waschke, j., reade, a., hlawitschka, m., roeder, i., scherf, n., and huisken, j. ( ). multi-scale imaging and analysis identify pan-embryo cell dynamics of germ layer formation in zebrafish. nature communications , , . [ ] desmaison, a., guillaume, l., triclin, s., and weiss, p., ducommun, b., and lobjois, v. ( ). impact of physical confinement on nuclei geometry and cell division dynamics in d spheroids. scientific reports , , . doi: . /s - - - . [ ] yoshizawa, m., hixon, e., and jeffery, w.r. ( ). neural crest transplantation reveals key roles in the evolution of cavefish development. integrative and comparative biology , ( ), - . [ ] fickentscher. r. and weiss, m. ( ). physical determinants of asymmetric cell divisions in the early development of caenorhabditis elegans . scientific reports , , . doi: . /s - - - . [ ] alicea, b. and gordon, r. ( ). quantifying mosaic development: towards an evo-devo postmodern synthesis of the evolution of development via differentiation trees of embryos [invited]. biology, ( ), . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] leoni, m., manyuhina, o.v., bowick, m.j., and marchetti, m.c. ( ). defect driven shapes in nematic droplets: analogies with cell division. soft matter , , - . doi: . /c sm f [ ] bono, r., blanca, m.j., arnau , j., and gómez-benito, j. ( ). non-normal distributions commonly used in health, education, and social sciences: a systematic review. frontiers in psychology , , . doi: . /fpsyg. . [ ] akbarpour, m. and jackson, m. ( ). diffusion in networks and the virtue of burstiness. pnas , ( ), e -e . [ ] ben-moshe, s. and itzkovitz, s. ( ). bursting through the cell cycle. elife, , e . [ ] wang, h., yuan, z., liu, p., and zhou, t. ( ). division time-based amplifiers for stochastic gene expression. molecular biosystems , ( ), - . doi: . /c mb a. [ ] csikasz-nagy, a. ( ). computational systems biology of the cell cycle. briefs in bioinformatics , ( ), - . doi: . /bib/bbp . [ ] dangarh, p., pandey, n., vinod, p.k. ( ). modeling the control of meiotic cell divisions: entry, progression, and exit. biophysical journal , ( ), - . doi: . /j.bpj. . . . [ ] festuccia, n., gonzalez, i., owens, n., and navarro, p. ( ). mitotic bookmarking in development and stem cells. development, , - . [ ] alfieri, r., merelli, i., mosca, e., and milanesi, l. ( ). a data integration approach for cell cycle analysis oriented to model simulation in systems biology. bmc systems biology , , . doi: . / - - - . [ ] halley-stott, r.p., jullien, j., pasque, v., and gurdon, j. ( ). mitosis gives a brief window of opportunity for a change in gene transcription. plos biology, ( ), e . https://doi.org/ . /journal.pbio. [ ] perez-carrasco, r., beentjes, c. and grima, r. ( ). effects of cell cycle variability on lineage and population measurements of messenger rna abundance. journal of the royal society interface , . [ ] green, j.b.a. and sharpe, j. ( ). positional information and reaction-diffusion: two big ideas in developmental biology combine. development, , - ; doi: . /dev. . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fibrinolysis influences sars-cov- infection in ciliated cells fibrinolysis influences sars-cov- infection in ciliated cells yapeng hou , yan ding , hongguang nie , *, hong-long ji department of stem cells and regenerative medicine, college of basic medical science, china medical university, shenyang, liaoning , china. department of cellular and molecular biology, university of texas health science center at tyler, tyler, tx , usa. *address correspondence to hgnie@cmu.edu.cn (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract rapid spread of covid- has caused an unprecedented pandemic worldwide, and an inserted furin site in sars-cov- spike protein (s) may account for increased transmissibility. plasmin, and other host proteases, may cleave the furin site of sars-cov- s protein and  subunits of epithelial sodium channels ( enac), resulting in an increment in virus infectivity and channel activity. as for the importance of enac in the regulation of airway surface and alveolar fluid homeostasis, whether sars-cov- will share and strengthen the cleavage network with enac proteins at the single-cell level is urgently worthy of consideration. to address this issue, we analyzed single-cell rna sequence (scrna-seq) datasets, and found the plau (encoding urokinase plasminogen activator), scnn g (enac), and ace (sars-cov- receptor) were co- expressed in alveolar epithelial, basal, club, and ciliated epithelial cells. the relative expression level of plau, tmprss , and ace were significantly upregulated in severe covid- patients and sars-cov- infected cell lines using seurat and deseq r packages. moreover, the increments in plau, furin, tmprss , and ace were predominately observed in different epithelial cells and leukocytes. accordingly, sars-cov- may share and strengthen the enac fibrinolytic proteases network in ace positive airway and alveolar epithelial cells, which may expedite virus infusion into the susceptible cells and bring about enac associated edematous respiratory condition. keywords: sars-cov- ; plasmin; enac; covid- ; furin (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction the sars-cov- infection leads to covid- with pathogenesis and clinical features similar to those of sars and shares the same receptor, angiotensin-converting enzyme (ace ), with sars-cov to enter host cells (zhou et al. , li and zheng ). by comparison, the transmission ability of sars-cov- is much stronger than that of sars-cov, owning to diverse affinity to ace (wrapp and wang ). the fusion capacity of coronavirus via the spike protein (s protein) determines infectivity (wrapp and wang , kam et al. b). highly virulent avian and human influenza viruses bearing a furin site (rxxr) in the haemagglutinin have been described (coutard et al. ). cleavage of the furin site enhances the entry ability of ebola, hiv, and influenza viruses into host cells (claas et al. ). consisting of receptor-binding (s ) and fusion domains (s ), coronavirus s protein needs to be primed through the cleavage at s /s site and s ’ site for membrane fusion (jaimes et al. , huggins ). the newly inserted furin site in sars-cov- s protein significantly facilitated the membrane fusion, leading to enhanced virulence and infectivity (xia et al. , wang, qiu, et al. ). plasmin cleaves the furin site in sars-cov s protein (kam et al. b), which is upregulated in the vulnerable populations of covid- (ji et al. ). however, whether plasmin cleaves the newly inserted furin site in the sars-cov- s protein remains obscure. plasmin cleaves the furin site of human subunit of epithelial sodium channels (enac) as demonstrated by lc-ms and functional assays (zhao, ali, and nie , sheng et al. ). very recently, it has been proposed that the global pandemic of covid- may partially be driven by the targeted mimicry of enac α subunit by sars-cov- (gentzsch and rossier , muhanna et al. ). enac are located at the apical side of the airway and alveolar cells, acting as a critical system to maintain the homeostasis of airway surface and alveolar fluid homeostasis (ji et al. , matalon, bartoszewski, and collawn ). the luminal fluid is required for keeping normal ciliary beating to expel inhaled pathogens, allergens, and pollutants and for migration of immune cells that release pro-inflammatory cytokines and chemokines (hou et al. a). the plasmin family and ace are expressed in the respiratory epithelium (nie et al. , hanukoglu and hanukoglu , kam et al. a). however, if the plasmin system and enac are involved in the fusion of sars-cov- into host cells is unknown. this study aims to determine whether plau, scnn g, and ace are co-expressed in the airway and lung epithelial cells and whether sars-cov- infection alters their expression at the single-cell level. we found that these genes, especially the plau was significantly upregulated in epithelial cells of severe/moderate covid- patients and sars-cov- infected cell lines, mainly owning to ciliated cells. we conclude that the most susceptible cells for sars-cov- infection could be the ones co-expressing these genes and sharing plasmin-mediated cleavage. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results furin sites are identified in both virus and host enac proteins a furin site was located at the s proteins of sars-cov- from arginine- to serine- (rrar|s), and similar site was also seen in the s protein of hcov-oc , mers, and hcov-hku coronavirus (fig. a). in addition, the highly conserved rxxr motif existed in the hemagglutinin protein of influenza h n , herpes, ebola, hiv, dengue, hepatitis b, west nile, marburg, zika, epstein-barr, and respiratory syncytial virus (rsv). the furin site (rkrr|e) was found in the gating relief of inhibition by proteolysis (grip) domain of the extracellular loop of the mouse, rat, and human enac (fig. b). the similarity of these furin sites is - %. respiratory cells co-express plau, scnn g, and ace to identify subpopulations of cells co-expressing plau, scnn g, and ace , we analyzed scrna- seq datasets by nferx scrna-seq platform (https://academia.nferx.com/) (supplementary table ). all three genes were co-expressed in the following cells ranked by the expression level of plau from high to low: club cells, goblets, basal cells, at cells, ciliated cells, fibroblasts, mucous cells, deuterosomal cells, and at cells (fig. c), which were supported by previous studies (sungnak et al. , wang et al. , hanukoglu and hanukoglu ). these results suggest that these cell populations co-expressing plau-enac-ace may be more susceptible to the sars-cov- infection compared with others. in addition, the top ten ranked cell sub-populations expressing plau, scnn g, or ace alone were listed in supplementary table . to compare the transcript of the proteases in different lung epithelial cells, we analyzed the lung dataset from gene expression omnibus (geo) by seurat, and the cells were annotated by their specific markers (supplementary fig. a). the data showed that all these proteases were expressed in at cells, including plau, furin, prss (trypsin), elane (elastase), prtn (myeloblastin), cela (elastase- ), cela a (elastase- a), ctrc (chymotrypsin-c), tmprss (transmembrane protease serine ), and tmprss (transmembrane protease serine ) (supplementary fig. b). in at cells, the proteases expression level in order is: tmprss > furin > tmprss > plau > cela > elane > prss > prtn > ctrc > clea a. for plau, the high to low order is basal > club > ciliated > at > at . the expression levels of proteases (plau, furin, tmprss , plg), ace , and scnn g in cell types co-expressing ace , scnn g, and plau were compared in fig. . the club cells showed the highest expression level of plau, and the ace , scnn g, tmprss , furin, and plg showed a higher expression level in club cells compared with other cell types. of note, the ciliated cell was the second and seventh highest expression cell type of plau and ace , respectively. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://academia.nferx.com/ https://doi.org/ . / . . . expression levels of plau, scnn g, and ace in sars-cov- infection to detect the potential changes in the cell populations that co-express plau, scnn g, and ace , we analyzed the scrna-seq datasets of bronchoalveolar lavage fluid (balf) cells, which are mainly composed of epithelial cells and leukocytes. there were three groups to be studied: healthy controls, moderate, and severe covid- patients. the expression level and the percentage of total cells expressing plau and furin were significantly upregulated in the severe group compared with controls (p < . ), as well as the expression levels of ace , tmprss , scnn g, and plg were also slightly upregulated (fig. a and b). the expression levels of plau, furin, tmprss , and ace and the number of cells were profiled in fig. a. the data showed that these genes were upregulated in covid- patients, and the number of cells expressing these upregulated genes almost increased in a severity-dependent manner. plau was significantly elevated in severe group (p < . ), and the other genes also showed an increasing trend (fig. b). the increments in plau (alveolar epithelial cells, basal, and ciliated cells), plg (basal cells), furin (alveolar epithelial cells, basal, ciliated cells), tmprss (basal and ciliated cells), scnn g (alveolar epithelial cells and basal cells), and ace (alveolar epithelial cells, basal, and club) were predominately observed in different cells. especially, a significant increase in plau expression was seen in ciliated cells, while the expression of measured genes showed a decline in covid- goblets (fig. c). in addition, similar changes of these genes in leukocytes were shown in supplementary fig. . to corporate the results in covid- patients, we analyzed bulk-seq data of human respiratory epithelial cell lines infected with sars-cov- : a , calu- , and nhbe (blanco-melo et al. ). plau transcript was significantly upregulated in all three cell lines after sars-cov- infection (multiplicity of infection = ) (fig. , p < . ). however, tmprss was only upregulated in infected calu- cells, evidenced by recent studies (p < . ) (xu et al. ). similar to those of sars and mers, the sars- cov- infection also increased the expression level of ace in a cells (p < . ) (smith et al. ). although sars-cov- did not change the mrna level of scnn g significantly in these cell lines as that for influenza virus, researchers are warned to pay more attention to the post-translational modification ofenac (hou et al. b). discussion the novel coronavirus, sars-cov- , was identified as the causative agent for a series of atypical respiratory diseases, and the disease termed covid- was officially declared a pandemic by the world health organization on march , (pollard, morran, and nestor-kalinoski ). sars-cov- has a great impact on human health all over the world, the virulence and pathogenicity of which may be relevant to (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the inserted furin site. whilst the sars-cov- s ’ cleavage site has a similar sequence motif to sars-cov and would thus be suitable for cleavage by trypsin-like proteases, insertions of additional arginine residues at the sars-cov- s /s (rrar|s) clearly generate a furin cleavage site (zhou et al. ). interestingly, this difference has been implicated in the viral transmissibility of sars-cov- (anand et al. ). our data supported the investigation that furin sites (rrar|s) not only exist in human virus but also in the -subunit of enac, which expresses highly in alveolar epithelial cells and a substrate to be cleaved by plasmin. plasmin has also been reported to have the ability to cleavage the furin site, and enhance the virulence and pathogenicity of viruses in their envelope proteins (sidarta-oliveira et al. ). sars-cov- has evolved a unique s /s cleavage site, absent in any previous coronavirus sequenced, resulting in the striking mimicry of an identical furin-cleavable peptide on αenac, a protein critical for the homeostasis of airway surface liquid (anand et al. ). all the above indicates that sars-cov- infection will hijack the enac proteolytic network, which is associated with the edematous respiratory condition (fig. ) (chen et al. , zhao, ali, and nie ). our data showed that the respiratory cells co-express sars-cov- receptor, enac (scnn g), and plasmin family mainly belonged to alveolar type Ⅰ/Ⅱ, basal, club, and ciliated cells, respectively. the plg (plasminogen) expression in different cell types is not shown for its expression is too low to be detected in many lung scrna-seq datasets. of note, the ciliated cell is the predominant contributor to upregulate the plau gene in severe covid- patients. as expected, plau levels, as well as tmprss , are upregulated in respiratory epithelial cell lines after sars-cov- infection, supporting the idea that sars- cov- can facilitate ace -mediated viral entry via tmprss spike glycoprotein priming (roberts et al. ). enhanced plau expression induced by sars-cov- infection will activate the plasminogen, which may reduce the difficulty of sars-cov- invasion by cleaving the s protein. the scrna-seq data of bronchoalveolar lavage fluid cells from covid- patients do not show the expression difference of scnn g (enac), which is considered to be regulated by plasmin through proteolytic hydrolysis. enac activity is not only determined by mrna/protein expression but also cell proteases. once the enac is biosynthesized and trafficked to the golgi, it is likely to be modified by intracellular protease (furin). after inserted into plasma membrane, enac will encounter the opportunity for full proteolytic activation of the channel by extracellular proteases (elastase, plasmin, chymotrypsin, and trypsin) (thibodeau and butterworth ). intriguingly, the plg gene also did not show a difference between covid- patients and healthy control, indicating that hyperfibrinolysis in covid- patients may be induced by enhanced urokinase (ji et al. ). additional analysis of clinical studies or animal models is urgently needed to future explore the relationship between the plasmin, enac, and sars-cov- receptors at the protein level. the amplified incidence of thrombotic events had been previously reported on covid- , and tissue (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . plasminogen activator (tpa) was tried to treat stroke in covid- patients (vinayagam and sattu ). we did not analyze the changes of plat in balf cells of covid- patients due to the tpa (plat) is generally expressed in endothelial cells. similarly, the beneficial effects of plasmin on alveolar fluid clearance and novel mechanisms underlying the cleavage of human enacs at multiple sites by plasmin have been provided in our recent studies (zhao, ali, and nie ). new drugs that regulate the upa/ upa receptor (upar) system have been demonstrated to help treat the severe complications of pandemic covid- (d'alonzo, de fenza, and pavone ). amiloride, a prototypic inhibitor of enac, can be an ideal candidate for covid- patients, supporting that enac is a downstream target of plasmin and involved in the luminal fluid absorption in sars- cov- infection (adil, narayanan, and somanath ). considering the two diametrically different therapeutic regimes in practice to address the complicated coagulopathic changes in covid- , fibrinolytic (alteplase, tpa) (bona et al. , ly et al. , wang, hajizadeh, et al. , barrett et al. , christie et al. , papamichalis et al. , poor et al. , arachchillage et al. ) and antifibrinolytic therapies (nafamostat and tranexamic acid) (asakura and ogawa , doi et al. , thierry ), our data provide new and comprehensive information on fibrinolytic related therapy targeting plasmin(ogen) as a promising approach to combat covid- . methods alignment of furin sites in viral and enac proteins the sequences of enac proteins (rat, mouse, and humans) and human viruses were acquired from the uniprot (https://www.uniprot.org/). the accession numbers were p dtc (for sars-cov- ), p (hiv), p (h n ), a a g xeb (ebola), a a ayz (mers), p (epstein-barr), p (herpes), p (dengue), p (hepatitis), q q p (west nile), a a b w (zika), p (respiratory syncytial virus), p (marburg), p (hcov-oc ), a a h h (hcov-hku ), p (human enac), q wu (mouse enac), and p (rat enac). alignment was performed using the jalview software (version: . . . ). the d structure of sars-cov- s (pdb id: x a) and enac (pdb id: bqn) was modified and downloaded from the protein data bank (http://www.rcsb.org/). co-expression profiles of enac, ace , and proteases we performed a systematic expression profiling of ace and enac across published human single- cell rna sequence (scrna-seq) studies comprising ~ . million cells using the nferx single-cell platform (https://academia.nferx.com/) (anand et al. ). the mean expression of plau, scnn g, and ace in a given cell-population (mean cp k) was z-score normalized (to ensure the standard deviation = and mean ~ for all the genes) to obtain relative expression profiles across all the samples. the expression of plau, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://academia.nferx.com/ https://doi.org/ . / . . . scnn g, and ace in the respiratory system were analyzed and graphed as heatmaps using r package pheatmap. acquisition, filtering, and processing of scrna-seq data the dataset downloaded from the gene expression omnibus was filtered for integration. lung scrna- seq dataset ( healthy controls in gse ) were filtered by total number of reads (nreads > , ), number of detected genes ( < ngenes < , ), and mitochondrial percentage (mito.pc < . ). balf scrna-seq dataset was composed of healthy controls, moderate and severe covid- patients in gse , and healthy control in gsm . these datasets were filtered by total number of reads (nreads > , ), number of detected genes ( < ngenes < , ), and mitochondrial percentage (mito.pc < . ). finally, a filtered gene-barcode matrix of all samples was integrated with the seurat v to remove batch effects across different donors as described previously (stuart et al. ). dimensionality reduction and clustering the filtered gene-barcode matrix was first normalized using the ‘lognormalize’ methods in seurat v. with default parameters. the top , variable genes were then identified using the ‘vst’ method in seurat findvariablefeatures function. principal component analysis (pca) was performed using the top , variable genes. then uniform manifold approximation and projection for dimension reduction (umap) or t-distributed stochastic neighbor embedding (tsne) was performed on the top principal components for visualizing the epithelial cells. meanwhile, the graph-based clustering was performed on the pca-reduced data for clustering analysis with seurat v. . the resolution was set to . and . for the lung and balf datasets to obtain a finer result, respectively. the markers used for balf cell annotation were shown by the bubble plot in supplementary fig. . differentiation of gene expression levels differentiation of gene expression level in balf cells among the healthy, moderate, and severe groups was achieved using the wilcox in seurat v. (findmarkers function). then, we divided balf cells into epithelial cells and leukocytes and compared gene expression levels among their subgroups. both epithelial and leukocytes were re-clustered to detect the differences in gene expression of all cell types between healthy controls and severe/moderate covid- patients. bulk-seq data (gse ) was analyzed for the differential genes in respiratory epithelial cell lines using the deseq with wald test and benjamini-hochberg post-hoc test (blanco-melo et al. , love, huber, and anders ). it was considered significant if p < . . acknowledgment (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . this study was supported by nsfc , nih grants hl , hl , and hl , aha awards aha grnt and aha grnt . we were grateful to yunlai zhou (yangzhou university) and congxi zhang (gene denovo) for their assistance on bioinformatics. conflict of interest the authors declare no conflicts of interest. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references adil, m. s., s. p. narayanan, and p. r. somanath. . "is amiloride a promising cardiovascular medication to persist in the covid- crisis?" drug discov ther no. ( ): - . doi: . /ddt. . . anand, p., a. puranik, m. aravamudan, and a. j. venkatakrishnan. . "sars-cov- strategically mimics proteolytic activation of human enac." elife no. :e . doi: . /elife. . arachchillage, d. j., a. stacey, f. akor, m. scotz, and m. laffan. . "thrombolysis restores perfusion in covid- hypoxia." no. ( ):e -e . doi: . /bjh. . asakura, h., and h. ogawa. . "potential of heparin and nafamostat combination therapy for covid- ." j thromb haemost no. ( ): - . doi: . /jth. . barrett, c. d., a. oren-grinberg, e. chao, a. h. moraco, m. j. martin, s. h. reddy, a. m. ilg, r. jhunjhunwala, m. uribe, h. b. moore, e. e. moore, e. n. baedorf-kassis, m. l. krajewski, d. s. talmor, s. shaefi, and m. b. yaffe. . "rescue therapy for severe covid- -associated acute respiratory distress syndrome with tissue plasminogen activator: a case series." j trauma acute care surg no. ( ): - . doi: . /ta. . blanco-melo, d., b. e. nilsson-payant, w. c. liu, s. uhl, d. hoagland, r. moller, t. x. jordan, k. oishi, m. panis, d. sachs, t. t. wang, r. e. schwartz, j. k. lim, r. a. albrecht, and b. r. tenoever. . "imbalanced host response to sars-cov- drives development of covid- ." cell no. ( ): - e . doi: . /j.cell. . . . bona, r. d., a. valbusa, g. malfa, d. r. giacobbe, p. ameri, n. patroniti, c. robba, v. gilad, a. insorsi, m. bassetti, p. pelosi, and i. porto. . "systemic fibrinolysis for acute pulmonary embolism complicating acute respiratory distress syndrome in severe covid- : a case series." eur heart j cardiovasc pharmacother. doi: . /ehjcvp/pvaa . chen, z., r. zhao, m. zhao, x. liang, d. bhattarai, r. dhiman, s. shetty, s. idell, and h. l. ji. . "regulation of epithelial sodium channels in urokinase plasminogen activator deficiency." am j physiol lung cell mol physiol no. ( ):l - . doi: . /ajplung. . . christie, d. b., rd, h. m. nemec, a. m. scott, j. t. buchanan, c. m. franklin, a. ahmed, m. s. khan, c. w. callender, e. a. james, a. b. christie, and d. w. ashley. . "early outcomes with utilization of tissue plasminogen activator in covid- -associated respiratory distress: a series of five cases." j trauma acute care surg no. ( ): - . doi: . /ta. . claas, e. c., a. d. osterhaus, r. van beek, j. c. de jong, g. f. rimmelzwaan, d. a. senne, s. krauss, k. f. shortridge, and r. g. webster. . "human influenza a h n virus related to a highly pathogenic avian influenza virus." lancet no. ( ): - . doi: . /s - ( ) - . coutard, b., c. valle, x. de lamballerie, b. canard, n. g. seidah, and e. decroly. . "the spike glycoprotein of the new coronavirus -ncov contains a furin-like cleavage site absent in cov of the same clade." antiviral res no. : . doi: . /j.antiviral. . . d'alonzo, d., m. de fenza, and v. pavone. . "covid- and pneumonia: a role for the upa/upar system." drug discov today no. ( ): - . doi: . /j.drudis. . . . doi, k., m. ikeda, n. hayase, k. moriya, and n. morimura. . "nafamostat mesylate treatment in combination with favipiravir for patients critically ill with covid- : a case series." crit care no. ( ): . doi: . /s - - -z. gentzsch, m., and b. c. rossier. . "a pathophysiological model for covid- : critical importance of transepithelial sodium transport upon airway infection." function (oxf) no. ( ):zqaa . doi: . /function/zqaa . hanukoglu, i., and a. hanukoglu. . "epithelial sodium channel (enac) family: phylogeny, structure-function, tissue distribution, and associated inherited diseases." gene no. ( ): - . doi: . /j.gene. . . . hou, y., y. cui, z. zhou, h. liu, h. zhang, y. ding, h. nie, and h. l. ji. a. "upregulation of the wnk signaling pathway inhibits epithelial sodium channels of mouse tracheal epithelial cells after influenza a infection." front pharmacol no. : . doi: . /fphar. . . hou, yapeng, yong cui, zhiyu zhou, hongfei liu, honglei zhang, yan ding, hongguang nie, and hong-long ji. b. "upregulation of the wnk signaling pathway inhibits epithelial sodium channels of mouse tracheal epithelial cells after influenza a infection." frontiers in pharmacology no. : . doi: . /fphar. . . huggins, d. j. . "structural analysis of experimental drugs binding to the sars-cov- target tmprss ." j mol graph model no. : . doi: . /j.jmgm. . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . jaimes, j. a., n. m. andre, j. s. chappie, j. k. millet, and g. r. whittaker. . "phylogenetic analysis and structural modeling of sars-cov- spike protein reveals an evolutionary distinct and proteolytically sensitive activation loop." j mol biol no. ( ): - . doi: . /j.jmb. . . . ji, h. l., x. f. su, s. kedar, j. li, p. barbry, p. r. smith, s. matalon, and d. j. benos. . "delta-subunit confers novel biophysical features to alpha beta gamma-human epithelial sodium channel (enac) via a physical interaction." j biol chem no. ( ): - . doi: m [pii] . /jbc.m . ji, h. l., r. zhao, s. matalon, and m. a. matthay. . "elevated plasmin(ogen) as a common risk factor for covid- susceptibility." physiol rev no. ( ): - . doi: . /physrev. . . kam, y. w., y. okumura, h. kido, l. f. ng, r. bruzzone, and r. altmeyer. a. "cleavage of the sars coronavirus spike glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in vitro." plos one no. ( ):e . doi: . /journal.pone. . kam, yiu-wing, yuushi okumura, hiroshi kido, lisa f. p. ng, roberto bruzzone, and ralf altmeyer. b. "cleavage of the sars coronavirus spike glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in vitro." plos one no. ( ):e -e . doi: . /journal.pone. . li, t., and q. zheng. . "sars-cov- spike produced in insect cells elicits high neutralization titres in non-human primates." no. ( ): - . doi: . / . . . love, m. i., w. huber, and s. anders. . "moderated estimation of fold change and dispersion for rna-seq data with deseq ." genome biol no. ( ): . doi: . /s - - - . ly, a., c. alessandri, e. skripkina, a. meffert, s. clariot, q. de roux, o. langeron, and n. mongardon. . "rescue fibrinolysis in suspected massive pulmonary embolism during sars-cov- pandemic." resuscitation no. : - . doi: . /j.resuscitation. . . . matalon, s., r. bartoszewski, and j. f. collawn. . "role of epithelial sodium channels in the regulation of lung fluid homeostasis." am j physiol lung cell mol physiol no. ( ):l - . doi: . /ajplung. . . muhanna, d., s. r. arnipalli, s. b. kumar, and o. ziouzenkova. . "osmotic adaptation by na(+)-dependent transporters and ace : correlation with hemostatic crisis in covid- ." no. ( ). doi: . /biomedicines . nie, h. g., t. tucker, x. f. su, t. na, j. b. peng, p. r. smith, s. idell, and h. l. ji. . "expression and regulation of epithelial na+ channels by nucleotides in pleural mesothelial cells." am j respir cell mol biol no. ( ): - . papamichalis, p., a. papadogoulas, p. katsiafylloudis, a. l. skoura, m. papamichalis, e. neou, d. papadopoulos, s. karagiannis, t. zafeiridis, d. babalis, and a. komnos. . "combination of thrombolytic and immunosuppressive therapy for coronavirus disease : a case report." int j infect dis no. : - . doi: . /j.ijid. . . . pollard, c. a., m. p. morran, and a. l. nestor-kalinoski. . "the covid- pandemic: a global health crisis." physiol genomics. doi: . /physiolgenomics. . . poor, h. d., c. e. ventetuolo, t. tolbert, g. chun, g. serrao, a. zeidman, n. s. dangayach, j. olin, r. kohli-seth, and c. a. powell. . "covid- critical illness pathophysiology driven by diffuse pulmonary thrombi and pulmonary endothelial dysfunction responsive to thrombolysis." clin transl med no. ( ). doi: . /ctm . . roberts, k. a., l. colley, t. a. agbaedeng, g. m. ellison-hughes, and m. d. ross. . "vascular manifestations of covid- - thromboembolism and microvascular dysfunction." front cardiovasc med no. : . doi: . /fcvm. . . sheng, s., m. d. carattino, j. b. bruns, r. p. hughey, and t. r. kleyman. . "furin cleavage activates the epithelial na+ channel by relieving na+ self-inhibition." am j physiol renal physiol no. ( ):f - . doi: . /ajprenal. . . sidarta-oliveira, d., c. p. jara, a. j. ferruzzi, m. s. skaf, w. h. velander, e. p. araujo, and l. a. velloso. . "sars-cov- receptor is co-expressed with elements of the kinin-kallikrein, renin-angiotensin and coagulation systems in alveolar cells." sci rep no. ( ): . doi: . /s - - - . smith, j. c., e. l. sausville, v. girish, m. l. yuan, a. vasudevan, k. m. john, and j. m. sheltzer. . "cigarette smoke exposure and inflammatory signaling increase the expression of the sars-cov- receptor ace in the respiratory tract." dev cell no. ( ): - .e . doi: . /j.devcel. . . . stuart, t., a. butler, p. hoffman, c. hafemeister, e. papalexi, w. m. mauck, rd, y. hao, m. stoeckius, p. smibert, and r. satija. . "comprehensive integration of single-cell data." cell no. ( ): - e . doi: . /j.cell. . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sungnak, w., n. huang, c. becavin, m. berg, r. queen, m. litvinukova, c. talavera-lopez, h. maatz, d. reichart, f. sampaziotis, k. b. worlock, m. yoshida, j. l. barnes, and h. c. a. lung biological network. . "sars-cov- entry factors are highly expressed in nasal epithelial cells together with innate immune genes." nat med no. ( ): - . doi: . /s - - - . thibodeau, p. h., and m. b. butterworth. . "proteases, cystic fibrosis and the epithelial sodium channel (enac)." cell tissue res no. ( ): - . doi: . /s - - -z. thierry, a. r. . "anti-protease treatments targeting plasmin(ogen) and neutrophil elastase may be beneficial in fighting covid- ." physiol rev no. ( ): - . doi: . /physrev. . . vinayagam, s., and k. sattu. . "sars-cov- and coagulation disorders in different organs." life sci no. : . doi: . /j.lfs. . . wang, i. m., s. stepaniants, y. boie, j. r. mortimer, b. kennedy, m. elliott, s. hayashi, l. loy, s. coulter, s. cervino, j. harris, m. thornton, r. raubertas, c. roberts, j. c. hogg, m. crackower, g. o'neill, and p. d. paré. . "gene expression profiling in patients with chronic obstructive pulmonary disease and lung cancer." am j respir crit care med no. ( ): - . doi: . /rccm. - oc. wang, j., n. hajizadeh, e. e. moore, r. c. mcintyre, p. k. moore, l. a. veress, m. b. yaffe, h. b. moore, and c. d. barrett. . "tissue plasminogen activator (tpa) treatment for covid- associated acute respiratory distress syndrome (ards): a case series." no. ( ): - . doi: . /jth. . wang, q., y. qiu, j. y. li, z. j. zhou, c. h. liao, and x. y. ge. . "a unique protease cleavage site predicted in the spike protein of the novel pneumonia coronavirus ( -ncov) potentially related to viral transmissibility." virol sin no. ( ): - . doi: . /s - - - . wrapp, d., and n. wang. . "cryo-em structure of the -ncov spike in the prefusion conformation." no. ( ): - . doi: . /science.abb . xia, s., q. lan, s. su, x. wang, w. xu, z. liu, y. zhu, q. wang, l. lu, and s. jiang. . "the role of furin cleavage site in sars-cov- spike protein-mediated membrane fusion in the presence or absence of trypsin." signal transduct target ther no. ( ): . doi: . /s - - - . xu, j., x. xu, l. jiang, k. dua, p. m. hansbro, and g. liu. . "sars-cov- induces transcriptional signatures in human lung epithelial cells that promote lung fibrosis." no. ( ): . doi: . /s - - - . zhao, r., g. ali, and h. g. nie. . "plasmin improves blood-gas barrier function in oedematous lungs by cleaving epithelial sodium channels." br j pharmacol no. ( ): - . doi: . /bph. . zhou, p., x. l. yang, x. g. wang, b. hu, l. zhang, w. zhang, h. r. si, y. zhu, b. li, c. l. huang, h. d. chen, j. chen, y. luo, h. guo, r. d. jiang, m. q. liu, y. chen, x. r. shen, x. wang, x. s. zheng, k. zhao, q. j. chen, f. deng, l. l. liu, b. yan, f. x. zhan, y. y. wang, g. f. xiao, and z. l. shi. . "a pneumonia outbreak associated with a new coronavirus of probable bat origin." nature no. ( ): - . doi: . /s - - - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . targeted molecular mimicry by sars-cov- of human enac and profiling ace -scnn g- plau/plat co-expression. (a) the cartoon showed the s-protein of sars-cov- (pdb id: x a), which was highlighted in green. the s /s cleavage site required for the activation of sars-cov- was enlarged and highlighted in red. furin/plasmin cleavage sites of common human viruses were shown in a box. (b) the cartoon represents the human enac protein (pdb id: bqn), which was highlighted in green. furin/plasmin cleavage site was enlarged and highlighted in red. the cleavage sites of enac in other species were shown in a box. (c) the single-cell transcriptomic co-expression of ace , scnn g (enac), and plau was summarized. the heatmap depicted the mean relative expression of each gene across the identified cell populations. the cell types were ranked based on decreasing expression of plau. the box highlighted the ace , scnn g (enac), and plau co-expressing cell types in the human respiratory system. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . expression of proteases, enac, and ace in the human respiratory system. violin plots showing the expression level of plau, plg, furin, tmprss , and scnn g in nferx platform. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . overall expression levels of proteases, ace , and scnn g in balf bulk cells of covid- patients. (a) bubble plot of proteases, ace , and scnn g in balfs of covid- patients. the size of the dots indicateed the proportion of cells in the respective cell type having a greater-than-zero expression of these genes, while the color indicated the mean expression of these genes. (b) the gene expression levels of proteases, ace , and scnn g from health controls (n = ), moderate cases (n = ) and severe cases (n = ). ***padj < . (wilcoxon test, padj was performed using bonferroni correction). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . transcription levels of proteases, ace , and scnn g in single epithelial cells of covid- patients. (a) bubble plot of sars-cov- receptor (ace ) and proteases in balfs epithelial cells of covid- patients. the size of the dots indicated the proportion of cells in the respective cell type having a greater-than-zero expression of these genes, while the color indicated the mean expression of these genes. (b) the gene expression levels of selected proteases and ace in epithelial cells from health controls (n = ), moderate (n = ), and severe cases (n = ). (c) the gene expression levels of selected proteases and ace in different epithelial cell types from health controls, moderate and severe cases. ***padj < . (wilcoxon test, padj was performed using bonferroni correction). aec: alveolar epithelial cells. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . changes of proteases, ace , and scnn g in respiratory cell lines after sars-cov- infection. normal human bronchial epithelial (nhbe) and alveolar epithelial (a , calu- ) cells were infected with sars-cov- for h (infected), and control cells received culture medium only (mock). the boxplot showed the changes of proteases (plau, furin, and tmprss), scnn g, and ace in a , calu- , and nhbe after sars-cov- infection. differential genes were calculated by deseq , ***padj < . , *padj < . (wald test, padj was performed using benjamini-hochberg post-hoc test). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . sars-cov- infection hijacks the enac proteolytic network. in physiological conditions, the urokinase activates the plasminogen to plasmin, which will cleave the γenac, leading to its activation. after infected by sars-cov- , the plau (urokinase) expression level is significantly upregulated, which may help other viruses’ invasion by activating the plasminogen to cleave the s protein. the green solid line represents the urokinase, plasminogen, enac mrna transcripts and activation by plasmin under physiological conditions. the red solid line represents the activation process under infection conditions, while the grey dotted line denotes the repression effects. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . impact of gene annotation choice on the quantification of rna-seq data impact of gene annotation choice on the quantification of rna-seq data david chisanga , , , , yang liao , , , and wei shi , , , * olivia newton-john cancer research institute, heidelberg, victoria, , australia, school of cancer medicine, la trobe university, bundoora, victoria, , australia, walter and eliza hall institute of medical research, parkville, victoria, , australia, department of medical biology, the university of melbourne, parkville, victoria, , australia and school of computing and information systems, the university of mel- bourne, parkville, victoria, , australia abstract rna sequencing is currently the method of choice for genome-wide profiling of gene expression. a popular approach to quantify expression levels of genes from rna-seq data is to map reads to a reference genome and then count mapped reads to each gene. gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. there are several major sources of gene annotations that can be used for quantification, such as ensembl and refseq databases. however, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an rna-seq analysis. in this paper, we present results from our comparison of ensembl and refseq human annotations on their impact on gene expression quantification using a benchmark rna-seq dataset generated by the sequencing quality control (seqc) consortium. we show that the use of refseq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from > real-time pcr validated genes, known titration ratios of gene expression and microarray expression data. we also found that the recent expansion of the refseq annotation has led to a decrease in its annotation accuracy. finally, we demonstrated that the rna-seq quantification differences observed between different annotations were not affected by the use of different normalization methods. *to whom correspondence should be addressed. tel: + ; fax: + ; email: wei.shi@onjcri.org.au .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction gene expression profiling using rna sequencing (rna-seq) is a core activity in molec- ular biology. comprehensive gene expression analysis in various settings is important for generating hypotheses for ongoing research, investigating drug-effects in biological or clinical settings and as a diagnostic tool. in this paper, we explore the fact that a popular approach in gene-level quantification from rna-seq data involves mapping reads to a ref- erence genome and then counting mapped reads associated with each gene [ , , , , ]. the process of counting mapped reads to genes requires a database of known genes. a gene is only quantified if it or its components have genomic coordinates already defined with respect to the genome sequence in a process called annotation. for each genome annotation model, a different set of annotation techniques and information sources are used and as such, these annotations vary in terms of comprehensiveness and accuracy of annotated genomic features. annotation techniques often include computer-based predic- tions and/or evidence-based techniques such as manual curation [ , ]. computer-based predictions result in more complex gene models that have a higher proportion of predic- tive genomic features while evidence-based generated gene models are simpler with fewer genes and isoforms. common annotation models for human and mouse genomes include ensembl [ ], refseq [ ], gencode [ ] and ucsc [ ] annotations. annotations are, therefore, an important component in an rna-seq analysis as the results are dependent on what is known in the annotation database. despite the importance of gene annotations in rna-seq data analysis, very little re- search has been conducted to examine how differences in annotations impact on gene expression quantification, which is crucial for downstream analyses such as discovery of differentially expressed genes and identification of perturbed pathways. previous studies compared the effect of human genome annotations from popular databases including en- sembl, gencode and refseq on various aspects of rna-seq analysis and they showed that the choice of annotations had an impact on gene-level quantification in the rna- seq analysis [ , ]. however, these studies are out of date as they were based on old annotations and they also lacked a reliable ground truth for assessing the impact of annotation. major annotation databases have undergone significant expansions over the years, thanks to the wide application of sequencing technologies and the massive amount of se- quencing data that have been generated across the world. however, it is unclear whether the quality of gene annotations have been successfully maintained. a recent study sug- gested that gene annotations have become less accurate and lagging during this expansion [ ]. this can be attributed to the errors from sequencing experiments, sequence analysis .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / or automation in the annotation process. it is important to systematically assess the accuracy of the new gene annotations generated in recent years to ensure the popular annotation databases can continue to be utilized by the community for rna-seq analysis. furthermore, the use of different annotations in different studies makes it difficult for researchers to reproduce the findings from such studies. for example, large consortia such as the european molecular biology laboratory (embl) use ensembl in their studies while the national centre for biotechnology information (ncbi) tend to use refseq. since this can significantly impact on gene expression data, there is a need to develop a comprehensive understanding of how these differences in annotations impact the gene- level expression quantification. in this study, we compared three human gene annotations, including a recent ensembl annotation (released in april ), a recent refseq annotation (released in august ) and an old refseq annotation (released in april ), to understand their impact on gene-level expression quantification in an rna-seq data analysis pipeline. although the old refseq annotation is not available at the ncbi refseq database anymore, it has been included as part of rsubread, a popular rna-seq quantification toolkit, for quantifying human rna-seq data. we used a benchmark rna-seq dataset generated by the sequencing quality control (seqc/maqc iii) consortium for this evaluation. we show that the use of refseq gene annotations led to better quantification accuracy than the use of ensembl annotation, based on the correlation with ground truths including expression data from > real-time pcr validated genes, known genome-wide titration ratios of gene expression and microarray gene expression data. we also show that the older refseq annotation yielded higher quantification accuracy than the recent refseq annotation in our evaluations, suggesting that the recent expansion and changes made to the refseq annotation have led to a decline in annotation accuracy resulting in less accurate quantification result. furthermore, we investigated if any normalization method can mitigate the differences in quantification results caused by the annotation differences. our results show that the quantification differences remained almost the same no matter how the rna-seq data were normalized. materials and methods . seqc/maqc data the rna-seq data used for evaluation in this study are a benchmark dataset generated by the sequencing quality control (seqc) project [ ], the third stage of the microarray quality control (maqc) study [ , ]. the seqc dataset includes the universal .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / human reference rna (uhrr) as sample a and the human brain reference rna (hbrr) as sample b. it also includes two other samples c and d, which are combination of a and b mixed in the ratios of : in c and : in d respectively. the samples were sequenced in four replicate paired-end libraries using an illumina hiseq sequencer at the australian genomics research facility (agrf). each library contains ∼ million bp read pairs. a taqman real-time polymerase chain reaction (rt-pcr) dataset with expression values measured for over , genes, which was generated in the maqc-i study [ ], was used to validate the expression of the rna-seq data in this study. the expression values were measured for both the uhrr and hbrr samples together with their respec- tive combinations. around – taqman rt-pcr genes, which had matching gene identifiers with expressed rna-seq genes from different annotations, were included for assessing the accuracy of rna-seq quantification. in addition, microarray data generated in the maqc-i study with samples a to d hybridized to the illumina human- bead- chip microarrays were also used in the assessment. the taqman rt-pcr and illumina microarray datasets are available as part of the bioconductor package ‘seqc’ [ ]. . annotations used three human gene annotations were included in this study, including a recent ensembl annotation, a recent refseq annotation and an old refseq annotation. all these anno- tations were generated based on the human reference genome grch /hg . the ensembl gene annotation used in this study was generated in april . its ver- sion number is . it was downloaded from ftp://ftp.ensembl.org/pub/release- / gtf/homo_sapiens/homo_sapiens.grch . .gtf.gz. the recent refseq gene annotation used was released by the ncbi in august . its release number is . and it is part of the refseq release version . it was downloaded from the ncbi ftp site ftp://ftp.ncbi.nlm.nih.gov/refseq/h_ sapiens/annotation/annotation_releases/ . /gcf_ . _grch . p /gcf_ . _grch .p _genomic.gtf.gz. we refer this refseq annota- tion as ‘refseq-ncbi’ in this study. the old refseq annotation included in this study was released by the ncbi in april . it was released as part of the patch release of the grch /hg genome build. this annotation has also been included in the popular rna-seq quantification toolkit rsubread [ ] as the default annotation used for quantifying human rna-seq data. the inclusion of this old refseq annotation allowed us to investigate how the annotation changes made recently to refseq affect the quantification result of rna-seq data. the .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / refseq annotation in rsubread is slightly different from the original one in that the overlapping exons from the same gene were collapsed to form a single continuous exon for the gene in the rsubread annotation, however this difference will not change the gene-level rna-seq quantification result because the set of exonic bases belonging to each gene is the same between the original annotation and the rsubread annotation. as this old refseq annotation is no longer available for downloading at the ncbi ftp site, we instead used the rsubread annotation in this study and we denote this annotation as ‘refseq-rsubread’. when matching genes from different annotations, we converted the gene identifiers using the bioconductor package ‘org.hs.eg.db’ [ ] and then compared them to find common genes between annotations. . mapping, quantification and normalization of rna-seq data analysis of the rna-seq data was performed using bioconductor r packages rsubread and limma [ , , ]. the human reference genome (grch ) from gencode (version downloaded from ftp://ftp.ebi.ac.uk/pub/databases/gencode/gencode_human/ release_ /grch .primary_assembly.genome.fa.gz) was indexed using the buildin- dex function in rsubread v . . [ ]. sequencing reads were then mapped to the reference genome using the align function in rsubread [ , ]. during the alignment, the en- sembl, refseq-ncbi and refseq-rsubread annotations were also included as an extra parameter to improve alignment. gene-level read counts were obtained with featurecounts [ , ], a read count summa- rization function within the rsubread package. the ensembl, refseq-ncbi and refseq- rsubread annotations were provided to featurecounts to generate read counts for genes included in these annotations respectively. the gene-level read counts were transformed using the voom function in limma [ , ] and then normalized using the library size [ ], quantile [ ] and trimmed mean of m- values (tmm) [ ] methods, respectively, prior to performing further analysis. the library size normalization was performed by providing raw read counts to voom and then running voom with the ‘normalize.method’ parameter set to ‘none’. the quantile nor- malization was performed by providing raw read counts to voom and then running voom with the ‘normalize.method’ parameter set to ‘quantile’. for tmm normalization, we first calculated the tmm normalization factor for each library using the calcnormfactors method in edger [ ]. then we provided raw read counts and the tmm normalization factors to voom and ran it with the ‘normalize.method’ parameter set to ‘none’. the log cpm (log counts per million) values, produced by the voom function for each gene .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in each library, were converted to log fpkm (log fragments per kilo exonic bases per million mapped fragments) expression values for further analysis. . titration monotonicity the rna-seq data from the seqc project have titration monotonicity built into them, such that a gene is considered to preserve titration monotonicity if the expression of the gene follows a ≥ c≥ d ≥b when its expression in sample a is greater than or equal to that in sample b, or follows a ≤ c≤ d ≤b when its expression in sample a is less than or equal to that in sample b. to test if the titration monotonicity is preserved, equation ( ) was used to compute the expected log fold-change for a gene in the comparison of c vs d given the log fold-change between a vs b. e = log ( × x + x + ) ( ) where e is the expected log fold-change for c vs d and x is the log fold-change for a vs b. expression levels of genes in the replicates of the same sample were averaged before fold change of gene expression was calculated between samples. . validation gene expression data generated using taqman rt-pcr and illumina’s beadchip mi- croarray were used to validate the gene-level quantification results from the rna-seq analysis. pearson correlation coefficients were computed to assess the concordance be- tween the rna-seq quantification data obtained from using different annotations and the gene expression data obtained from the rt-pcr and microarray experiments. the genome-wide built-in truth of titration monotonicity of gene expression in the rna-seq data was also utilized to evaluate the quantification accuracy of rna-seq data generated from using different annotations. . access to data and code the data and analysis code used in this study can be accessed at the following url: https://github.com/shilab-bioinformatics/geneannotation. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results . discrepancy between different gene annotations the ensembl and ncbi refseq annotations are among the most widely used gene anno- tations that have been utilized for rna-seq gene expression quantification in the field. in this study, we downloaded recent ensembl and refseq annotations and also used an older version of refseq annotation to assess the impact of gene annotation choice on the accuracy of rna-seq expression quantification. the inclusion of an older refseq annotation allowed us to investigate the accuracy of new annotation data generated in recent years when the next-gen sequencing data have been used as a new data source for genome-wide annotation generation. the ensembl annotation used in this study was released in april and it has a version number . the recent refseq annotation included in this study was released in august . we call this annotation as ‘refseq-ncbi’ in this study. the older refseq annotation was released in april , and it has also been included as part of the popular rna-seq quantification toolkit ‘rsubread’ for quantifying human rna- seq data. as this annotation is not available in the ncbi refseq database anymore, we instead used the rsubread refseq annotation in our evaluations and we denote this annotation as ‘refseq-rsubread’. as rna-seq gene-level expression quantification is typically performed for genes that contain exons [ , , ], in this study we only focused on the genes that have annotated exons in each annotation. figure a shows that, as expected, the ensembl annotation contains a lot more exon-containing genes than the two refseq annotations. the en- sembl annotation is known to contain a large number of computationally predicted genes whereas refseq genes were mainly annotated based on the biological evidence. however, it is worth noting that the refseq-ncbi annotation still has > , genes that are not included in the ensembl annotation. nearly % of the ensembl genes were found to be absent from both of the two refseq annotations. in total, , common genes were found between the three annotations. most of the genes included in the refseq-rsubread annotation can be found in the refseq-ncbi or ensembl annotations. we then examined the effective gene lengths in each annotation. the effective length of a gene is the total number of unique bases included in all the exons belonging to the gene. figure b shows the distributions of effective lengths of genes in the three annota- tions. around half of the ensembl genes have an effective length less than , bases, whereas in the two refseq annotations only ∼ % of the genes are shorter than , bases in length. the median effective gene lengths in refseq-ncbi and refseq-rsubread .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b ensembl refseq-ncbi refseq-rsubread , , , , , e ns em bl r ef s eq -r su br ea d lo g ef fe ct iv e ge ne le ng th r ef s eq -n c b i e ns em bl vs r ef s eq -n c b i e ns em bl vs r ef s eq -r su br ea d r ef s eq -n c b i vs r ef s eq -r su br ea d d iff er en ce in lo g ef fe ct iv e ge ne le ng th c to ta le ffe ct iv e ge ne le ng th s (x ^ ) e ns em bl r ef s eq -r su br ea d r ef s eq -n c b i d − − figure : concordance and differences between gene annotations. (a) venn diagram showing genes that are common or unique in the ensembl, refseq-ncbi and refseq-rsubread annotations. (b) boxplots showing the distribution of effective gene lengths (log scale) in each annotation. (c) boxplots showing the differences in effective lengths of common genes between each pair of annotations. values shown in the plots are the ratio of effective lengths of the same gene from two different annotations (log scale). (d) the size of transcriptome calculated from each annotation. shown are the sum of effective gene lengths in each annotation. are ∼ , bases, which is much larger than that in ensembl (∼ , bases). although the ensembl annotation contains a lot more genes than the two refseq annotations, it also contains a much higher percentage of short genes. we further performed gene-wise comparison of effective gene lengths using common genes between each pair of annotations. although every annotation contains both longer and shorter genes in comparison to the corresponding genes from other annotations, the ensembl genes were found to have a larger effective length than genes from the two refseq annotations overall (figure c). this is in contrast to the higher proportion of short genes observed in the ensembl annotation (figure b), which indicates that the ensembl genes that are also present in refseq-ncbi or refseq-rsubread annotations tend to be longer than those ensembl genes that can only be found in the ensembl annotation. although at least half of the genes were found to have a less than -fold ( -fold at log scale) length difference between annotations (figure c), the length differences could be as high as more than -folds ( -folds at log scale). the refseq-ncbi genes seem to .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / be slightly longer than the corresponding refseq-rsubread genes overall. ensembl and refseq-rsubread were found to be the least concordant annotations among the three annotations being compared. lastly, we compared the size of the transcriptome represented by each annotation. the transcriptome size of an annotation is computed as the sum of effective gene lengths from all the genes included in that annotation, which also represents the total num- ber of exonic bases that were annotated in an annotation. figure d shows that the ensembl annotation has a larger transcriptome size than both refseq-ncbi and refseq- rsubread annotations. this is not surprising because the ensembl annotation contains more genes and also ensembl genes common to other annotations are longer in general. refseq-rsubread has a much smaller transcriptome size than refseq-ncbi, indicating a significant expansion of the refseq-ncbi annotation in the past five years. however, it is important to note that the refseq-rsubread annotation is not a subset of the refseq- ncbi annotation, as demonstrated by the existence of refseq-rsubread genes that are absent in the refseq-ncbi annotation, the difference in gene length distribution and the length differences of the same genes between the two annotations (figure a-c). this indicates that not only were new genes added to the refseq annotation during the expansion, but existing genes have been modified. it is against this background that we sought to understand how these differences in the annotations impact on the overall gene-level quantification results. . fragments counted to annotated genes we used a benchmark rna-seq dataset generated by the seqc project [ ] to evaluate the impact of gene annotation on the accuracy of rna-seq expression quantification. this dataset contains paired-end bp read data generated for four samples including a universal human reference rna sample (sample a), a human brain reference rna sample (sample b), a mixture sample with %a and %b (sample c) and a mixture sample with %a and %b (sample d). we mapped the rna-seq reads to the human genome grch /hg using the sub- read aligner [ , ], and then counted the number of mapped fragments (read pairs) to each gene in each annotation using the featurecounts program [ , ]. featurecounts assigns a mapped fragment to a gene if the fragment overlaps any of the exons in the gene. figure shows that across all the libraries, the refseq-rsubread annotation constantly has substantially more fragments assigned to it than the ensembl and refseq- ncbi annotations. this is surprising because refseq-rsubread contains much less an- notated genes and also has a significantly smaller transcriptome, compared to ensembl .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / p er ce nt ag e of co un te d fr ag m en ts key ensembl refseq-ncbi refseq-rsubread a − a − a − a − b − b − b − b − c − c − c − c − d − d − d − d − figure : barplots showing the percentage of fragments successfully assigned to genes in each annota- tion, out of all the fragments included in each library. the horizontal axis represents the sixteen seqc rna-seq libraries generated from the four samples ‘a’, ‘b’, ‘c’ and ‘d’. each sample has four replicates that are numbered from to . and refseq-ncbi (figure a,d). we then performed a detailed investigation into the mapping and counting results to find out what enabled refseq-rsubread to achieve a higher percentage of successfully assigned fragments. although gene annotations were utilized in mapping reads to the human reference genome, the use of different annotations was not found to affect the number of success- fully aligned fragments for each library (supplementary figure s ). we found that when assigning fragments to genes using the ensembl or refseq-ncbi annotation, more frag- ments were unable to be assigned because they did not overlap any genes (ie. failed to overlap any exons included in any genes), despite there are more genes included in these annotations compared to the refseq-rsubread annotation (supplementary figure s ). this is particularly the case for the fragment assignment in the human brain reference samples. we also found that the use of ensembl and refseq-ncbi annotations led to more fragments being unassigned due to the assignment ambiguity, ie. a fragment over- laps more than one gene (supplementary figure s ). this should be because there are more genes that overlap with each other (ie. exons from different genes overlap with each other) in the ensembl and refseq-ncbi annotations compared to the refseq-rsubread annotation. our investigation revealed that less gene overlapping in the refseq-rsubread annotation and better compatibility of this annotation with the mapped fragments have led to more fragments being successfully counted for each library in this dataset. given .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / − − lo g r p k m a − a − a − a − b − b − b − b − c − c − c − c − d − d − d − d − key ensembl refseq-ncbi refseq-rsubread figure : boxplots comparing the intensity range of gene expression between the three annotations. all the genes from each annotation were included in the plots. raw read counts of genes were transformed to log fpkm values. a prior count of . was added to raw counts to avoid log-transformation of zero. that both the universal human reference and human brain reference samples used in this study are known to contain a very high number of expressed genes and the rna-seq data generated from these samples are expected to cover most of the human transcrip- tome, our analysis suggests that the refseq-rsubread annotation is likely to contain more transcribed region in the genome than the other two annotations in general. . intensity range of gene expression we examined if the gene annotation choice has an impact on the range of gene expression levels in the rna-seq data. raw gene counts of the seqc data were converted to log fpkm (log fragments per kilo exonic bases per million mapped fragments) values for all the genes included in each annotation. a prior count of . was added to the raw counts to avoid log-transformation of zero. figure shows that the two refseq annotations exhibit a desirable larger intensity range of gene expression than the ensembl annotation, as shown by the larger boxes in the boxplots. it is surprising to see that the ensembl genes have the smallest intensity ranges in all the libraries, give that the ensembl annotation contains the largest number of genes in all the three annotations being examined. in addition to the large intensity range, the refseq-rsubread genes were also found to have a markedly higher median expression level than genes in the refseq-ncbi and ensembl annotations. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . gene annotation discrepancy after expression filtering as it is a common practice to filter out genes that are deemed as lowly expressed, or are completely absent in an rna-seq data analysis [ ], we also set out to assess the differences between alternative annotations after excluding such genes. we excluded those genes that failed to have at least . cpm (counts per million) in at least four libraries (each sample has four replicates) in the analysis of the seqc dataset. the expression-filtered data were also used for comparing the accuracy of quantification from using alternative annotations presented in the following sections. the bar plot in figure a shows that ensembl has significantly more genes (also higher proportion of genes) filtered out due to low or no expression, compared to refseq- ncbi and refseq-rsubread. after expression filtering, the total numbers of remaining genes from the three annotations became more similar to each other. , genes were found to be common between the three annotations after filtering, accounting for %, % and % of the filtered genes in the ensembl, refseq-ncbi and refseq-rsubread an- notations respectively (figure b). almost all the filtered genes in the refseq-rsubread annotation can be found in the other two annotations. after expression filtering, the median effective gene length has increased to ∼ , bases for all annotations (figure c), meaning that a higher proportion of short genes were removed due to low expression in every annotation. the median effective length of ensembl genes now became comparable to, or slightly higher than those in the two refseq annotations, indicating that the ensembl annotation contained a higher proportion of lowly expressed short genes than the two refseq annotations. when comparing the effective lengths of genes common to all three annotations after filtering, the ensembl genes were found to have the largest median effective length and the refseq-rsubread genes have the smallest median effective length (figure d). this is not surprising because the ensembl annotation is known to be more aggressive than the refseq annotations and the refseq-rsubread annotation is an old annotation that has not been updated in the last five year. the expression filtering did not seem to affect the distribution of differences of effective gene lengths between each pair of annotations (using genes common to each pair of annotations), with ensembl and refseq-rsubread remaining to be the least concordant annotations (figure e and figure c). using genes common to all three annotations after filtering exhibited similar distributions of gene length differences between each pair of annotations compared to using genes common to each pair of annotations (figure f). similar to before filtering, the gene-wise length comparison performed after filtering also showed that overall the ensembl genes had the largest gene lengths and the refseq- .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / , , , , ensembl refseq-ncbi refseq-rsubread b d ensembl refseq-rsubreadrefseq-ncbi before after g en e co un t( x ) , , , , , , a e ns em bl r ef s eq -n c b i r ef s eq -r su br ea d lo g ef fe ct iv e ge ne le ng th d iff er en ce in lo g ef fe ct iv e ge ne le ng th f r ef s eq −n c b i vs r ef s eq −r su br ea d e ns em bl vs r ef s eq −r su br ea d e ns em bl vs r ef s eq -n c b i e ns em bl r ef s eq -n c b i r ef s eq -r su br ea d r ef s eq −n c b i vs r ef s eq −r su br ea d e ns em bl vs r ef s eq −r su br ea d e ns em bl vs r ef s eq -n c b i c e d iff er en ce in lo g ef fe ct iv e ge ne le ng th lo g ef fe ct iv e ge ne le ng th − − − − figure : concordance and differences between gene annotations after filtering for lowly expressed genes. (a) bar plot showing the differences in the number of genes included in each annotation before and after filtering for lowly expressed genes. (b) venn diagram comparing genes from different annotations after filtering for lowly expressed genes. distributions of effective gene lengths after filtering are shown for all genes in each annotation (c) and for genes that are common between all three annotations (d). distributions of differences of effective gene lengths between annotations after filtering are shown for common genes between each pair of annotations (e) and for genes that are common between all three annotations (f). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rsubread genes had the shortest gene lengths. . comparison of titration monotonicity preservation to assess the impact of gene annotation choice on the accuracy of rna-seq quantification result, we utilized as ground truth the inbuilt titration monotonicity in the seqc data, the taqman rt-pcr data and the microarray data generated for the same samples, to evaluate which annotation gives rise to a better expression correlation of the rna-seq quantification data with the truth. in this section, we compared the ability of ensembl and the two refseq annotations in retaining the inbuilt titration monotonicity in the rna-seq dataset. in figure , the reference titration curve depicts the expected fold change that genes are expected to follow in sample c vs sample d based on the fold change in sample a vs sample b. this is computed using the equation ( ) (see materials and methods). we then calculated the mean squared error (mse) between the reference titration monotonicity and the titration monotonicity obtained from each annotation. a smaller mse value means that the generated quantification data is closer to the truth. figure shows that the mse computed for the refseq-rsubread annotation is constantly lower than those computed for the ensembl and refseq-ncbi annotations, regardless if filtering was applied or if only common genes were included for comparison. refseq-rsubread was also found to yield comparable or lower mse compared to the other two annotations when the data were tmm or quantile normalized (supplementary figures s and s ), in addition to the library-size normalized data shown in figure . these results demonstrated that the use of refseq-rsubread annotation led to better quantification accuracy for the rna-seq data. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : titration monotonicity plots. the ability of ensembl, refseq-ncbi and refseq-rsubread to retain the titration monotonicity built into the seqc rna-seq data was measured using the mean squared error (mse) between the reference titration and the actual titration obtained from each an- notation. the red curve in each plot represents the reference titration calculated from using equation ( ). plots in the top row include all the genes available in each annotation. plots in the middle row includes those genes that remained after filtering for lowly expressed genes, in each annotation. plots in the bottom row includes genes that are common between the three annotations after the expression filtering was performed. in each plot, the horizontal axis represents the log fold changes of gene expres- sion between samples a and b and the vertical axis represents the log fold changes of gene expression between samples c and d. . validation against taqman rt-pcr data the taqman rt-pcr dataset generated in the maqc study [ , ] was used to validate the gene-level quantification results from the rna-seq dataset. this dataset contains measured expression levels for > , genes in the four seqc samples. the aim was to understand how well ensembl and refseq annotated gene expression correlated with the taqman rt-pcr data. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / key ensembl refseq-ncbi refseq-rsubread all genes after filtering common genes after filtering . . . . . . . . a b c d a b c d . . . . li br ar y si ze no rm al iz at io n q ua nt ile no rm al iz at io n tm m no rm al iz at io n . . . . . . . . . . . . c or re la tio n co ef fic ie nt c or re la tio n co ef fic ie nt c or re la tio n co ef fic ie nt figure : validation of rna-seq against taqman rt-pcr dataset. shown are pearson correlation coefficients computed from comparing rna-seq data against rt-pcr data, using the rt-pcr genes matched with each individual annotation (left column) or matched with all three annotations (right column). the rows represent the different rna-seq normalization methods used. lowly expressed genes in the rna-seq data were filtered out before the correlation analysis was performed. the rna-seq data generated from each annotation were filtered to remove lowly expressed genes before being compared to the rt-pcr data. numbers of matched genes between the rt-pcr data and the rna-seq data were , and for ensembl, refseq-ncbi and refseq-rsubread, respectively. rt-pcr genes were found to be common to all the three annotations. the raw taqman rt-pcr data were log - .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / transformed before comparing to the filtered rna-seq data. pearson correlation analysis of the rna-seq gene expression (log fpkm values) and rt-pcr gene expression (log values) from using the rt-pcr genes matched with each individual annotation showed that the refseq-rsubread annotation constantly yielded a higher correlation than the ensembl and refseq-ncbi annotations, across all the samples and the three different normalization methods (left panel in figure ). the ensembl annotation was found to produce the worst correlation in all these comparisons. when using the rt-pcr genes matched with all three annotations for comparison, refseq- rsubread was again found to yield the highest correlation (right panel in figure ). ensembl and refseq-ncbi were found to produce similar correlation coefficients. taken together, results from this evaluation showed that the use of refseq-rsubread annotation led to a better concordance in gene expression between the rna-seq data and the rt- pcr data, compared to the use of ensembl and refseq-ncbi annotations. . validation against microarray data an illumina beadchip microarray dataset, which was generated by the maqc-i project [ ] for the same samples as in the rna-seq data used in this study, was used to further validate the gene-level rna-seq quantification results obtained from different annota- tions. the microarray dataset was background corrected and normalized using the ‘neqc’ function in the limma package [ , ]. microarray genes were then matched to the rna- seq genes included in the filtered rna-seq data. , , , and , microarray genes were found to be matched with rna-seq genes from ensembl, refseq-ncbi and refseq-rsubread annotations, respectively. , microarray genes were found to be present in all three annotations. for those microarray genes that contain more than one probe, a representative probe was selected for each of them. the representative probe selected for a gene had the highest mean expression value across the four samples among all the probes the gene has. a pearson correlation analysis was then performed between microarray data and rna-seq data for each of the three annotations. both rna-seq and microarray data include log expression values of genes. figure shows that the use of refseq-rsubread annotation consistently yielded the highest correlation between rna-seq and microarray data in all the comparisons, no matter which rna-seq normalization method was used and if all or common matched genes were included in the evaluation. on the other hand, the use of the ensembl annotation resulted in the worst correlation between rna-seq data and microarray data in all the comparisons. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c or re la tio n co ef fic ie nt all genes after filtering common genes after filtering c or re la tio n co ef fic ie nt . . . . a b c d a b c d li br ar y si ze no rm al iz at io n q ua nt ile no rm al iz at io n tm m no rm al iz at io n c or re la tio n co ef fic ie nt . . . . . . . . . . . . . . . . . . . . . . . . . . key ensembl refseq-ncbi refseq-rsubread figure : validation of rna-seq quantification results against microarray data. shown are pearson correlation coefficients computed from comparing rna-seq data against illumina beadchip microarray data, using the microarray genes matched with each individual annotation (left column) or matched with all three annotations (right column). rows in the plots represent the different rna-seq normalization methods used. lowly expressed genes in the rna-seq data were filtered out before the correlation analysis was performed. for those microarray genes that include more than one probe, a representative probe was selected and used for this analysis. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion the rna-seq technique is currently routinely used for genome-wide profiling of gene expression in the biomedical research field. the analysis of rna-seq data relies on the accurate annotation of genes so that expression levels of genes can be accurately and re- liably quantified. there are several major gene annotation sources that have been widely adopted in the field such as ensembl and refseq annotations. the ensembl and refseq annotations have been well maintained and under continuous development. in particular, new gene information collected from the next-generation sequencing technologies, such as rna-seq, has been incorporated into the expansion of these annotations in recent years. however, differences between these annotations have raised concerns over the quality and reproducibility of rna-seq data analyses. there are particularly concerns regarding the accuracy of new gene annotations generated from the use of the sequencing tech- nologies, due to known errors in the generation and analysis of the sequencing data. to address these concerns, in this study we systematically assessed the differences in rna- seq quantification results attributed to the gene annotation discrepancy. annotations being evaluated in this study included recent ensembl and ncbi refseq annotations and also an older version of the refseq annotation. we compared the recent and old refseq annotations to assess the quality of the new annotations that were added when the sequencing technology was utilized at ncbi for curating refseq gene annotations. although the ensembl annotation contains significantly more genes than both the recent and old refseq annotations, it was also found to have a much higher proportion of short genes. interestingly, we found that a much higher fraction of these short genes in ensembl were filtered out due to low or no expression in the analysis of the seqc rna- seq dataset, compared to the short genes included in the two refseq annotations. the seqc rna-seq data is a widely used benchmark dataset including the human brain reference rna and universal human reference rna samples, in which a very large number of gene expressed making the entire human transcriptome well covered. the use of the refseq-rsubread annotation (the older version of the refseq anno- tation used in this study) has led to substantially more fragments being successfully counted to genes than the use of refseq-ncbi (the recent refseq annotation used in this study) or ensembl annotations. a detailed investigation revealed that this was be- cause (a) there are less overlapping between genes in the refseq-rsubread annotation leading to less read assignment ambiguity and (b) the refseq-rsubread annotation con- tains more genes that are compatible with mapped fragments, despite the transcriptome represented by this annotation is much smaller than those represented by the refseq- ncbi and ensembl annotations. moreover, the quantification data obtained from using .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / refseq-rsubread exhibited desirable larger intensity range and higher median expression level than the quantification data obtained from using the other two annotations. the evaluation of quantification accuracy from using genome-wide titration mono- tonicity truth built in the rna-seq data, the taqman rt-pcr data and the microarray data, showed that overall the refseq-ncbi annotation yielded better quantification re- sults than the ensembl annotation. this may not be surprising because the ncbi refseq annotation is a traditionally conservative annotation that is known to be highly accurate as it uses an evidence-based approach to annotate genes. however, we also found that the refseq-rsubread annotation yielded more accurate quantification results than the refseq-ncbi annotation in almost all the comparisons, which is very surprising. we suspect that this might be due to the annotation errors arising from the sequencing data recently utilized in the ncbi refseq annotation generation pipeline. it was reported that the sequencing data, including rna-seq data and epigenome sequencing data, started to be utilized by ncbi for curating refseq gene annotations in around [ , ]. between march and july , the number of gene transcripts in the vertebrate mammalian organisms included in the refseq database increased significantly from . million to . million (https://www.ncbi.nlm.nih.gov/refseq/statistics/), a more than twofold increase in just around years. the use of sequencing data for annotation generation should be a significant driver for this rapid expansion of the refseq database. it is known that some errors associated with the generation and analysis of sequencing data are difficult to correct, such as sample contamination, sequencing errors, read mapping errors and read assembly errors. when these errors were brought to the annotation process, they could result in incorrect gene annotations being generated and consequently led to less accurate quantification of the rna-seq data. conclusion in conclusion, our findings from this study revealed that the ncbi refseq human gene annotations outperformed the ensembl human gene annotation in the quantification of rna-seq data. however, we also raised concerns over the recent changes made to the refseq database due to the use of sequencing data in the annotation generation process. these changes need to be reviewed and validated so as to ensure the refseq database continues to be a reliable and high-quality gene annotation resource for the research com- munity. similarly, such review should be conducted for other gene annotation databases as well. the research findings from this study also have an implication for the quantification of rna-seq data generated by the recently emerged single-cell sequencing technologies. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / same as the quantification of bulk rna-seq data, an accurate gene annotation is also required for quantifying single-cell rna-seq data. it is therefore important to understand if and how the annotation choice impacts the quantification accuracy of the single-cell rna-seq data as well. references [ ] zhenqiang su, pawe l p labaj, sheng li, jean thierry-mieg, danielle thierry-mieg, wei shi, charles wang, gary p schroth, robert a setterquist, john f thomp- son, et al. a comprehensive assessment of rna-seq accuracy, reproducibility and information content by the sequencing quality control consortium. nature biotech- nology, ( ): , . [ ] yunshun chen, aaron tl lun, and gordon k smyth. from reads to genes to pathways: differential expression analysis of rna-seq experiments using rsubread and the edger quasi-likelihood pipeline. f research, : , . [ ] simon anders, paul t pyl, and wolfgang huber. htseq–a python framework to work with high-throughput sequencing data. bioinformatics, ( ): – , . [ ] yang liao, gordon k smyth, and wei shi. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. bioinformatics, ( ): – , . [ ] yang liao, gordon k smyth, and wei shi. the r package rsubread is easier, faster, cheaper and better for alignment and quantification of rna sequencing reads. nucleic acids research, ( ):e –e , . [ ] steven l. salzberg. next-generation genome annotation: we still struggle to get it right. genome biology, ( ): , . [ ] mihaela pertea, alaina shumate, geo pertea, ales varabyou, florian p breitwieser, yu-chi chang, anil k madugundu, akhilesh pandey, and steven l salzberg. chess: a new human gene catalog curated from thousands of large-scale rna sequencing experiments reveals extensive transcriptional noise. genome biology, ( ): , . [ ] daniel r zerbino, premanand achuthan, wasiu akanni, m ridwan amode, daniel barrell, jyothish bhai, konstantinos billis, carla cummins, astrid gall, car- los garćıa girón, et al. ensembl . nucleic acids research, (d ):d –d , . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] nuala a o’leary, mathew w wright, j rodney brister, stacy ciufo, diana haddad, rich mcveigh, bhanu rajput, barbara robbertse, brian smith-white, danso ako- adjei, et al. reference sequence (refseq) database at ncbi: current status, taxo- nomic expansion, and functional annotation. nucleic acids research, (d ):d – d , . [ ] adam frankish, mark diekhans, anne-maud ferreira, rory johnson, irwin jun- greis, jane loveland, jonathan m mudge, cristina sisu, james wright, joel arm- strong, et al. gencode reference annotation for the human and mouse genomes. nucleic acids research, (d ):d –d , . [ ] christopher m lee, galt p barber, jonathan casper, hiram clawson, mark diekhans, jairo n gonzalez, angie s hinrichs, brian t lee, luis r nassar, con- ner c powell, brian j raney, kate r rosenbloom, daniel schmelter, matthew l speir, ann s zweig, david haussler, maximilian haeussler, robert m kuhn, and w j kent. ucsc genome browser enters th year. nucleic acids research, (d ):d –d , . [ ] po-yen wu, john h phan, and may d wang. assessing the impact of human genome annotation choice on rna-seq expression estimates. bmc bioinformatics, ( ):s , . [ ] shanrong zhao and baohong zhang. a comprehensive evaluation of ensembl, ref- seq, and ucsc annotations in the context of rna-seq read mapping and gene quantification. bmc genomics, ( ): , . [ ] leming shi, gregory campbell, wendell d jones, fabien campagne, zhining wen, stephen j walker, zhenqiang su, tzu-ming chu, federico m goodsaid, lajos pusz- tai, et al. the microarray quality control (maqc)-ii study of common practices for the development and validation of microarray-based predictive models. nature biotechnology, ( ): – , . [ ] maqc consortium, leming shi, laura h reid, wendell d jones, richard shippy, janet a warrington, shawn c baker, patrick j collins, francoise de longueville, ernest s kawasaki, et al. the microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. nature biotechnology, ( ): – , . [ ] yang liao and wei shi. seqc: rna-seq data generated from seqc (maqc-iii) study, . r package version . . . http://bioconductor.org/packages/release/data/experiment/html/seqc.html. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] marc carlson. org.hs.eg.db: genome wide annota- tion for human, . r package version . . . https://www.bioconductor.org/packages/release/data/annotation/html/org.hs.eg.db.html. [ ] matthew e ritchie, belinda phipson, di wu, yifang hu, charity w law, wei shi, and gordon k smyth. limma powers differential expression analyses for rna- sequencing and microarray studies. nucleic acids research, ( ):e –e , . [ ] wolfgang huber, vincent j carey, robert gentleman, simon anders, marc carlson, benilton s carvalho, hector corrada bravo, sean davis, laurent gatto, thomas girke, et al. orchestrating high-throughput genomic analysis with bioconductor. nature methods, ( ): , . [ ] yang liao, gordon k smyth, and wei shi. the subread aligner: fast, accurate and scalable read mapping by seed-and-vote. nucleic acids research, ( ):e –e , . [ ] charity w law, yunshun chen, wei shi, and gordon k smyth. voom: precision weights unlock linear model analysis tools for rna-seq read counts. genome biology, ( ):r , . [ ] ali mortazavi, brian a williams, kenneth mccue, lorian schaeffer, and barbara wold. mapping and quantifying mammalian transcriptomes by rna-seq. nat methods, ( ): – , . [ ] benjamin m bolstad, rafael a irizarry, magnus åstrand, and terence p. speed. a comparison of normalization methods for high density oligonucleotide array data based on variance and bias. bioinformatics, ( ): – , . [ ] mark d robinson and alicia oshlack. a scaling normalization method for differen- tial expression analysis of rna-seq data. genome biology, ( ):r , . [ ] mark d robinson, davis j mccarthy, and gordon k smyth. edger: a biocon- ductor package for differential expression analysis of digital gene expression data. bioinformatics, ( ): – , . [ ] wei shi, alicia oshlack, and gordon k smyth. optimizing the noise versus bias trade-off for illumina whole genome expression beadchips. nucleic acids research, ( ):e , . [ ] kim d pruitt, garth r brown, susan m hiatt, françoise thibaud-nissen, alexander astashyn, olga ermolaeva, catherine m farrell, jennifer hart, melissa j landrum, .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / kelly m mcgarvey, et al. refseq: an update on mammalian reference sequences. nucleic acids research, (database issue):d –d , . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a self-supervised machine learning approach for objective live cell segmentation and analysis michael c. robitaille , jeff m. byers , joseph a. christodoulides , marc p. raphael* materials science and technology division, u.s. naval research laboratory, washington d.c. * corresponding author: marc.raphael@nrl.navy.mil abstract machine learning algorithms hold the promise of greatly improving live cell image analysis by way of ( ) analyzing far more imagery than can be achieved by more traditional manual approaches and ( ) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm’s initial training steps and then repeatedly applied to the imagery. to address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. the approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. because it is self-trained, the software has no user- adjustable parameters and does not require curated training imagery. the algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (dic), transmitted light and interference reflection microscopy (irm). the approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery’s optical modality, magnification or cell type. key words: live cell imaging, segmentation, phenotyping, machine learning, unsupervised, classification introduction live cell phenotyping is an information rich experimental approach, capable of providing mechanistic insights into cell biology , , guiding drug development and elucidating disease pathologies , . the wealth of information available from live cell microscopy results from the fact that there are numerous optical modalities that can be integrated within a given experiment – from fluorescence imaging which provides spatio-temporal information on specific signaling pathways and organelles to label-free techniques such as phase contrast and differential interference contrast (dic) which enable the visualization of whole cellular morphologies and dynamics. each of these modalities provides its own outcome measures which can be viewed as static snapshots or dynamic variations within the four-dimensional space of x, y, z and time . however, compared to genotyping - its synergistic partner technique - live cell phenotyping remains a far more subjective science. the generation of genomic sequencing data and its analysis can now be achieved autonomously by employing a combination of robotics and microfluidics for sample preparation and and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . machine learning algorithms for data collection and interpretation. in contrast, the extraction of quantitative information from live cell imagery by manual means is still commonplace in live cell microscopy, a fact which speaks to the human visual system’s adeptness at detecting small changes and low contrast features with high fidelity. but with automated live cell microscopes now able to collect high resolution imagery for days on end, the resulting data files can quickly grow to tens of gigabytes, leaving the analyst with an overwhelming amount of imagery to work through. furthermore, if the analyst is not blinded to the experimental design, unconscious bias can creep into the data extraction process. enter computational algorithms capable of extracting the relevant outcome variables from the imagery in an automated fashion. - broadly speaking, the algorithms are often classified as model based approaches (e.g. cell profiler) , and machine learning algorithms, (e.g. u-net, ilastik) - . neither approach is completely autonomous when it comes to cell segmentation: model-based approaches require the manual tuning of multiple parameters, while machine learning requires the user provide curated data from which the algorithm is trained. once tuned or trained, the software is able to process far more imagery than could be achieved manually - but there is still a human-in-the-loop. it is just that the manual contribution has been moved to the front end for training purposes and is then continuously reapplied by the algorithm. algorithms that are tuned or trained at the onset can problematically miss relevant features as the cellular phenotypes or background characteristics evolve, inadvertently skewing the analysis. for instance, variations in label intensity (e.g. photobleaching, quenching) or new morphological features that were not present during the initial training (e.g. differentiation, mitosis, blebbing) can go undetected if not retrained with a freshly curated data set or parameters that capture the offending features. in the same way, temporal variations in the background illumination intensity or homogeneity can also result in improper cell segmentation. especially concerning is that the user-supervised training process is inherently subjective in nature and can cause unconscious biases to be effectively baked in to the extracted data by the training process. to optimize objectivity and efficiency, an essential goal is to develop software that can accept imagery from any optical modality, labeled or unlabeled, and extract the cellular features of interest without input from the user. as participants in a synthetic biology real-time reproducibility project administered by u.s. defense advanced research projects agency (darpa), referred to as independent verification & validation (iv&v), we have recently experienced all of these algorithmic limitations and how they can result in large amounts of data either being incorrectly segmented, subjectively segmented, or left unanalyzed due to time constraints. the program involves a wide range of cell types (amoeboid to eukaryotic) from multiple cell biology laboratories; multiple imaging modalities – both fluorescent and label free; and objective magnifications ranging from x to x. the cumbersome process of retraining supervised machine learning software to match this variety of conditions proved impractical and a human-in-the loop training step was deemed too subjective. the challenge then was to develop a completely automated segmentation algorithm for live cell microscopy applications. in particular, the image analysis software should be ‘self-supervised’, meaning it trains itself to classify cells versus background and then regularly updates this training so that it can adapt to evolving intensities and morphologies. the software was required to segment a variety of cell types from live cell imagery given the most common imaging modalities as inputs - phase contrast, transmitted light, dic, fluorescence and interference reflection microscopy (irm) – and to do so without user-adjustable parameters or user-selected training imagery. it was additionally required that the generated models adapt to changing cell phenotypes and lighting conditions for long-term imaging applications (hours to days). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . methods to replace more manual model based and machine learning training approaches for segmenting cells with an automated, self-supervised algorithm, we took advantage of the one phenotypic feature which is present in live cell microscopy no matter what the modality: motion. from the nanoscale diffusion of proteins and vesicles to the migration of cells that are tens of microns in length, the ever present dynamics captured by live cell microscopy make it ideal for applying optical flow (of) algorithms designed to identify not just spatial intensity features in a given frame but also the variation or ‘flow’ of those features from frame to frame. the central assumption in optical flow algorithms is that the overall image intensity will remain constant if the time difference between frames is reasonably small. this leads to the following time-derivative constraint equation:   ( , , ) d dx i dy i i i x y t dt dt x dt y t u v ∂ ∂ ∂ = → + + = ∂ ∂ ∂  ( , , ) d dx i dy i i i x y t dt dt x dt y t u v ∂ ∂ ∂ = → + + = ∂ ∂ ∂ where 𝐼𝐼(𝑥𝑥, 𝑦𝑦, 𝑡𝑡) is the in-plane image intensity at time 𝑡𝑡, 𝑢𝑢 and 𝑣𝑣 being the optical flow in the x and y directions, respectively. the methods used to solve this constraint equation are matched with the imaging goal, such as reducing jitter in imagery taken from helicopters, aligning medical imagery or, in the case of this study, cell motion segmentation. in testing a range of optical flow algorithms for cell segmentation, we found the farnebäck method to be the most robust due to its sensitivity to object deformation – a natural fit for cells which are morphologically variable. , of assumptions may or may not be met for fluorescence time-lapse imagery applications in which extended time intervals are sometimes employed to avoid phototoxicity or photobleaching. , for this reason, it was important that our technique be co-validated with label free techniques such as transmitted light and phase contrast which are minimally invasive. overlays of less frequently accumulated fluorescence imagery with cells segmented using a label-free imaging channel is then straightforward. furthermore, there has been an increased appreciation for the morphological information label-free approaches can provide as a result of algorithmic-based phenotyping. - our approach to self-supervised learning and automated model generation begins with utilizing the farnebäck of method as a means of classification bootstrapping (fig ). typical segmentation strategies involve utilizing static information in a single image at time frame (t), which can have difficulty distinguishing ‘cell’ from ‘background’ pixels in a generalizable manner (fig a). in contrast, our approach begins with an of calculation based on images from consecutive time frames (t- , t). this enables us to leverage the ubiquitous nature of intracellular motion and build a dynamics-based feature vector: pixels with the highest flow are automatically labeled as ‘cell’ pixels, those with the lowest flow are automatically labeled as ‘background’ pixels, and those that do not fit either category remain unlabeled (fig b,c). we note that this automatic self-labeling is broadly applicable in that it is not dependent on principles of any specific optical modality, cell type, or phenotype. the of-based self-labeling approach outputs a set of ‘cell’ and ‘background’ labeled pixels which are then used to generate additional entropy and gradient feature vectors at each time point. these static feature vectors are used to train and generate a classifier model which, in the final step, is applied to all pixels in the image for cell segmentation. the algorithm is written in stand-alone matlab script and utilizes functions from the image processing, statistics and machine learning, and computer vision toolboxes. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. overview of the optical flow self-labeling strategy. (a) the vast majority of cell segmentation techniques utilize single image frames and the static information contained within as means to distinguish ‘cell’ from ‘background’, oftentimes represented in a histogram. the self-supervised algorithm utilizes optical flow as a means to self-label pixels in an automated fashion. (b) due to the prevalence of intracellular dynamics in time-lapse live cell imagery, optical flow can be calculated for each pair of consecutive images (𝑡𝑡 − , 𝑡𝑡). the optical flow can then be represented as vectors associated with each pixel (right). (c) the magnitude of the optical flow then offers a means to distinguish cells from their background, as shown in the bivariate histogram which co-plots the pixel intensity of a single image at t to the optical flow vector magnitudes calculated between consecutive images (𝑡𝑡 − , 𝑡𝑡). pixels with the highest flow can be automatically labeled ‘cell’ (left of the green dashed line) and those with the lowest flow can be labeled ‘background’ (right of the yellow dashed line). pixels that do not meet either criteria remain unlabeled, while the self-labeled pixels are used to create a training data set for classification. time increment: sec, scale bar = µm. the self-supervised training approach is illustrated in fig using time lapse dic imagery of multiple (top) and a single highlighted (bottom) mda-mb- cell. from the raw imagery (fig a,b), many portions of individual cells appear to blend in with the background. however, when the of self-labeling strategy is applied, the algorithm automatically identifies pixels with high flow magnitude, highlighted as green pixels (fig c,d), which are selected as having the highest probability of correctly being labeled ‘cell’. to automatically label the background, the algorithm over segments, that is, a liberal (low) of threshold is employed which captures motion from not only the cell but also from nearby background pixels as well. the algorithm sets these pixel values to zero and labels the pixels in which no significant motion was detected as ‘background’ (fig c,d yellow pixels). once labeled ‘cell’ or ‘background’ in this unsupervised manner by of (dynamic features from image pair (𝑡𝑡 − , 𝑡𝑡) ), entropy and gradient feature vectors (static features from image at t) are generated for each of these training pixels using their local neighborhood of pixels (s.i., fig s ). these additional feature vectors are then used train and generate a naïve bayesian classifier model which is applied to the entire image in a pixel-wise fashion. the information gained from and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the entropy and gradient feature vectors enables pixels which were left unlabeled in the of training steps (fig c,d grey pixels) to be classified. the contrast enhanced image (fig b) and model-generated segmentation (fig f, teal pixels) show that the algorithm is able to segment the cell with high fidelity (dic image/segmented boundary overlay, fig g). importantly, this labeling, training and classifying procedure occurs recursively on each successive pair of (𝑡𝑡 − , 𝑡𝑡) images, enabling the classifier model to adapt to changing backgrounds and phenotypes. by using optical flow to label the highest flow pixels as ‘cells’ and lowest flow pixels as ‘background’, the labeling process has become automated (or ‘self-supervised’) and no manual inputs or training images are needed. for extremely low contrast imagery there can be too few training pixels labeled ‘cell’ for robust segmentation to occur given the initial of threshold setting. in such cases, the algorithm calculates the entropy associated with ‘cell’ pixels and iteratively reduces the of threshold until the associated ‘cell’ entropy feature vector is well distinguished from that of the ‘background’ entropy feature vector. fig. overview of the automated self-supervised learning algorithm. a. the contrast enhanced dic image of several and b a single highlighted mda-mb- cell illustrates the range of intensities inherent within the cells. ( x objective). c. & d. unsupervised learning via of: high threshold of is used to select only those pixels exhibiting the highest flow magnitudes and labels them as ‘cell’ (green pixels). similarly, low threshold of is used to identify pixels with a much wider range of flow magnitudes than the high flow regime. the lowest flow magnitude pixels are labelled ‘background’ (yellow pixels). pixels that exhibit of in between these regimes remain unlabeled (gray pixels). e. & f. supervised learning via self-labeled training data. the self-labeled pixels (green and yellow) are then used to generate static feature vectors, which are in turn used to train the classifier model. g. the blue outline is the resulting segmentation which outlines all pixels classified by the of trained model as ‘cell’ and is also overlaid on the image in b. this process is repeated at every time step, thereby using the most recent imagery to update the training data. scale bar: µm ( x objective, time increment: sec). results and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the fig imagery shows the generality of this approach and also demonstrates how the self-supervised algorithm additionally automates commonly required manual inputs such as size filtering and hole filling. the segmented cells were processed from imagery acquired from a range of cell types, imaging modalities, magnifications and time increments (s.i. table s ). the of algorithm enabled a straightforward approach to automated size filtering which is a common user adjustable parameter in supervised machine learning approaches. to accomplish this, a stand-alone application of of was applied to the imagery which lacked the added steps of self-tuning and model building described above. while some cell features are missed, this simpler, faster approach was found to be more than precise enough to estimate average cell size and to exclude much smaller objects, thus automating the size filtering process. because extraneous debris often lacked the motion of the live cells, this debris was also automatically labeled as background by the of algorithm. fig a and b demonstrate the self-supervised code’s ability to size filter, while also adapting to cell types of differing sizes, by comparing the segmentation of human fibroblasts ( x, phase contrast) to those of the much smaller dictyostelium amoeboid cells ( x, transmitted light), respectively. extraneous debris features in the hs imagery (fig a, white arrows) are correctly identified as ‘background’, even though similar in size and intensity to the dictyostelium cells of fig b. the background inhomogeneities observed in fig a and b, which could potentially be mislabeled as ‘cell’, are correctly identified because they remain relatively constant from frame 𝑡𝑡 − to frame 𝑡𝑡. the segmentation results of the mda-mb- cells ( x, phase contrast) in fig c illustrates the algorithm’s ability to adapt to a wide range of phenotypes, from rounded fig c(i) to spread fig c(ii), which is enabled without need for user input by continuously retraining the model on consecutive image pairs. the current instantiation of the software does not attempt to separate cells that are touching or close enough to be segmented as a single object. well-developed approaches such as watershed transforms and levelset methods can be employed for such purposes. the algorithm works robustly for a range of optical modalities and magnifications as shown in figs d-f. figs d and e are segmentation results from irm imagery ( x, hs cell) and dic imagery ( x, mda- mb- ). as a fluorescence imaging example, a self-supervised segmentation of a gfp-actin labeled a cell at x magnification is shown in fig f. as an additional option, of can be applied not only as an algorithm labeling element, but also a measurement tool, as shown in the fig f vector plot. the plotted of vectors (blue) display the magnitude and direction of the measured gfp labelled actin flow between frames. such measurements have been shown to be useful for quantifying intracellular protein and calcium signaling dynamics. - and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. self-supervised segmentation for a range of cell types, microscope modalities, time resolutions and magnifications. a. phase contrast of hs fibroblasts ( x objective, time increment: sec) b. transmitted light of dictyostelium ( x objective, time increment: sec) c. phase contrast of mda-mb- ( x objective, time increment: sec) d. irm image of a single hs cell ( x objective, time increment: sec) e. dic image of mda- mb- cells ( x objective, time increment: sec ) f. fluorescence image of a single lifeact (gfp-actin conjugate) transfected a cell (pseudo-colored) with the associated optical flow vector plot ( x objective, time increment: sec). insets i, ii, iii highlight boxed image regions. white arrows point to examples of debris that was correctly labelled ‘background’ due either to lack of motion or automated size filtering. images have been contrast enhanced to highlight low contrast features and background inhomogeneities. dic image (e) was additionally enhanced with a and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sharpen filter to highlight interference induced shadowing of cell features. scale bars: a, b, c: µm; d, e: µm; f: µm. hole filling, another often required manual input for model-based and machine learning algorithms, has also been automated by this approach. common examples of when hole filling input is required include fluorescent labels that do not penetrate the nucleus or, for label-free microscopy modes such as phase contrast, large spread cells in which the algorithm has a difficult time associating the interference enhanced cell edges with the enclosed lamellipodia. we found that motion within cells was ubiquitously detected by of, regardless of imaging modality or whether imaging the cell membrane, nucleus or cytoplasm. because motion detection was far more common than not for a given pixel within an area labeled ‘cell’, a fixed morphological blurring tool (circular with a radius of pixels) was found to robustly hole fill regardless of cell type or microscope configuration. the calculated cell area was found to be invariant for a range of blurring tool radii (fig s ). in all cases, the use of optical flow to identify motion and the pixel radius blurring tool was sufficient to correctly fill in the cell. by re-training on every pair of consecutive images the self-supervised algorithm remains accurate throughout long-term imaging applications, despite changes in background or cell phenotypes. this allows for a rich behavior of dynamic morphology and migration to readily be collected and analyzed – a key point given the known inter-relationship between cellular shape and function. , , furthermore, the emerging role that not just cell shape, but cell shape dynamics play in fundamental biological processes is becoming increasing clear. fig demonstrates how such quantitative morphological information is readily mined in a long-term imaging application. fig a-c shows the tracking of several mda-mb- cells segmented via the self- supervised approach under x phase contrast microscopy on crgd functionalized gold coverslips. fig a shows the labeled tracks of the cells’ centroids over the course of minutes, with the corresponding initial and final image shown in figs b,c. the cell associated with track undergoes mitosis at approximately minutes, creating two new tracks ( and ) for the daughter cells. because the self- supervised approach automatically re-trains continuously on consecutive frame pairs, the morphological changes from fig b to fig c are quantified with high fidelity, as can be seen by plotting the segmented boundaries as a function of time (fig d). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. tracking of mda-mb- cells under x phase contrast microscopy and time evolution of cell morphology through mitosis. a. the resulting tracks of multiple segmented cells from a single field of view over the course of minutes b. corresponding images at times t = min and c. min. track undergoes mitosis resulting in tracks and of the daughter cells (blue line). d. (left) time evolution of segmented morphology of track (black) with the centroid of each shape denoted by an open circle until mitosis, after which the track splits into (green) and (blue), with the cell separation event denoted by a single red open circle. d. (right) selected images showing raw data overlapped with the self-supervised segmentation throughout mitosis event. ( x objective, time increment: sec) scale bar: μm. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion & conclusions there are numerous advantages to this self-supervised machine learning approach. the most obvious is that because the training data is generated by tracking motion, the approach can be used with any live cell imaging microscopy technique, whether labeled or label-free. also unique is the use of the optical flow labeled pixels to self-supervise the building of a classifier model, which in turn is modular with regards to the incorporated feature vectors. while we have employed only two feature vectors in this current instantiation of the classification code (gradient and entropy), there are many additional image features that can be added based on the application. we have also shown that the incorporation of of enables the straightforward automation of morphological operations such as size filtering and hole filling, eliminating the need for manually tuning these parameters. the automation described here is markedly different from machine learning approaches that require user assisted training. the most time consuming aspect of model-based tuning and machine learning approaches is the training process. the process is one of trial and error, requiring retraining if the model’s performance is not deemed adequate. the complete automation of both the training and segmentation algorithms not only saves time but also removes the chances of unconscious bias from entering the training process. because the training is conducted recursively with each new image, evolutions in phenotype and background structure over extended time periods are accounted for without the need for preprocessing. the sum of all these advantages is segmentation under a wide range of magnifications, time resolutions, cell types and optical modalities that is both automated and robust. this results in the ability to track cells for hours or days and quantify a range morphological and phenotypic features without the need for user input, thus having broad applicability throughout live cell microscopy. the crux of the introduced self- supervised approach relies upon using the dynamic information embedded in each pixel – motion characterized via optical flow – as an elegant means to self-label cells versus background in time-lapse imagery. while cellular dynamics has long been appreciated as information rich with regards to understanding cell function, our approach demonstrates that it also provides the means for robust segmentation – a foundational step for achieving quantitative and objective live cell analysis. acknowledgements the authors gratefully acknowledge the devreotes laboratory of johns hopkins university for the dictyostelim discoideum cell line. m.c.r. gratefully acknowledges support from the national research council research associateship program and the jerome and isabella karle distinguished scholar fellowship program. funding for this project was provided by the office of naval research through the naval research laboratory’s basic research program and by the biological technology office of the defense advanced research program agency. author contributions michael c. robitaille: conceptualization, methodology, investigation, data curation, software, visualization, and writing. jeff m. byers: conceptualization, methodology, formal analysis, and software. joseph a. christodoulides: resources, validation, and writing. marc p. raphael: conceptualization, funding acquisition, methodology, investigation, software, visualization, and writing. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . financial conflicts of interest the authors do not have any conflict of interests with this work. references caicedo, j. c., singh, s. & carpenter, a. e. applications in image-based profiling of perturbations. current opinion in biotechnology , - , doi: . /j.copbio. . . ( ). cadart, c., zlotek-zlotkiewicz, e., le berre, m., piel, m. & matthews, h. k. exploring the function of cell shape and size during mitosis. developmental cell , - , doi: . /j.devcel. . . ( ). zhou, x. b. & wong, s. t. c. high content cellular imaging for drug development. ieee signal processing magazine , - , doi: . /msp. . ( ). zhong, j. et al. persistent hepatitis c virus infection in vitro: coevolution of virus and host. journal of virology , - , doi: . /jvi. - ( ). zhu, n. et al. morphogenesis and cytopathic effect of sars-cov- infection in human airway epithelial cells. nature communications , doi: . /s - - -z ( ). skylaki, s., hilsenbeck, o. & schroeder, t. challenges in long-term imaging and quantification of single-cell dynamics. nature biotechnology , - , doi: . /nbt. ( ). caicedo, j. c. et al. data-analysis strategies for image-based cell profiling. nature methods , - , doi: . /nmeth. ( ). deep learning gets scope time. nature methods , - , doi: . /s - - -x ( ). grys, b. t. et al. machine learning and computer vision approaches for phenotypic profiling. journal of cell biology , - , doi: . /jcb. ( ). moen, e. et al. deep learning for cellular image analysis. nature methods , - , doi: . /s - - - ( ). carpenter, a. e. et al. cellprofiler: image analysis software for identifying and quantifying cell phenotypes. genome biology , doi: . /gb- - - -r ( ). al-kofahi, y., zaltsman, a., graves, r., marshall, w. & rusu, m. a deep learning-based algorithm for -d cell segmentation in microscopy images. bmc bioinformatics , doi: . /s - - -z ( ). falk, t. et al. u-net: deep learning for cell counting, detection, and morphometry (vol , pg , ). nature methods , - , doi: . /s - - - ( ). sommer, c., straehle, c., kothe, u., hamprecht, f. a. & ieee. in th ieee international symposium on biomedical imaging: from nano to macro ieee international symposium on biomedical imaging - ( ). raphael, m. p., sheehan, p. e. & vora, g. j. a controlled trial for reproducibility. nature , - , doi: . /d - - - ( ). beauchemin, s. s. & barron, j. l. the computation of optical flow. acm comput. surv. , - , doi: . / . ( ). farneback, g. in image analysis, proceedings vol. lecture notes in computer science (eds j. bigun & t. gustavsson) - ( ). robitaille, m. c., byers, j. m., christodoulides, j. a. & raphael, m. p. robust optical flow algorithm for general, label-free cell segmentation. biorxiv, . . . , doi: . / . . . ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schroeder, t. long-term single-cell imaging of mammalian stem cells. nature methods , s -s , doi: . /nmeth. ( ). jaccard, n. et al. automated method for the rapid and precise estimation of adherent cell culture characteristics from phase contrast microscopy images. biotechnol. bioeng. , - , doi: . /bit. ( ). ounkomol, c., seshamani, s., maleckar, m. m., collman, f. & johnson, g. r. label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. nature methods , -+, doi: . /s - - - ( ). vicar, t. et al. cell segmentation methods for label-free contrast microscopy: review and comprehensive comparison. bmc bioinformatics , , doi: . /s - - - ( ). wang, m. et al. novel cell segmentation and online svm for cell cycle phase identification in automated microscopy. bioinformatics , - , doi: . /bioinformatics/btm ( ). nath, s. k., palaniappan, k. & bunyak, f. in medical image computing and computer-assisted intervention - miccai , pt vol. lecture notes in computer science (eds r. larsen, m. nielsen, & j. sporring) - ( ). buibas, m., yu, d., nizar, k. & silva, g. a. mapping the spatiotemporal dynamics of calcium signaling in cellular neural networks using optical flow. annals of biomedical engineering , - , doi: . /s - - - ( ). delpiano, j. et al. performance of optical flow techniques for motion analysis of fluorescent point signals in confocal microscopy. machine vision and applications , - , doi: . /s - - - ( ). lee, r. m. et al. quantifying topography-guided actin dynamics across scales using optical flow. mol. biol. cell , - , doi: . /mbc.e - - ( ). meyers, j., craig, j. & odde, d. j. potential for control of signaling pathways via cell size and shape. current biology , - , doi: . /j.cub. . . ( ). rangamani, p. et al. decoding information in cell shape. cell , - , doi: . /j.cell. . . ( ). akanuma, t., chen, c., sato, t., merks, r. m. h. & sato, t. n. memory of cell shape biases stochastic fate decision-making despite mitotic rounding. nature communications , doi: . /ncomms ( ). robitaille, m. c. et al. problem of diminished crgd surface activity and what can be done about it. acs applied materials & interfaces , - , doi: . /acsami. c ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dynugene: an r package for uncertainty-aware gene regulatory network inference, simulation, and visualization dynugene: an r package for uncertainty-aware gene regulatory network inference, simulation, and visualization tianyu lu , � and anjali silva , , department of computer science, university of toronto, toronto, canada department of cell and systems biology, university of toronto, toronto, canada princess margaret cancer centre, university health network, toronto, canada vector institute, toronto, canada methods for gene regulatory network inference focus on net- work architecture identification but neglect model selection and simulation. we implement an extension to the dyngenie al- gorithm that accounts for model uncertainty as an r package, providing users with an easy to use interface for model selection and gene expression profile simulation. source code is avail- able at https://github.com/tianyu-lu/dynugene with a detailed user guide. a webserver with interactive controls is available at https://tianyulu.shinyapps.io/dynugene/. gene regulatory network | network inference correspondence: tianyu.lu @mail.utoronto.ca introduction complex phenomena such as cell development and apopto- sis emerge from coordinated dynamics of gene regulatory networks (grn). inferring network structure from data can be used for hypothesis generation, revealing mechanisms in cell development and disease (huang et al., ), and mod- elling network evolution (crombach and hogeweg, ). accurate dynamical models allow us to predict the effects of network perturbations on biological function, for example to push cells out of a disease state (karlebach and shamir, ), or to design synthetic grns given the desired dynam- ics of a network (hiscock, ). the ideal model should be flexible enough to capture highly nonlinear interactions while not sacrificing model interpretability and computation time. we present dynugene (dynamical uncertainty-aware gene nework inference), an r package that extends the functional- ity of dyngenie , a state-of-the-art method for grn infer- ence (geurts et al., ). we build on dyngenie because it satisfies all three of our model desiderata. existing exten- sions include timeor and benin which both incorporate heterogeneous data to improve network inference accuracy (wonkap and butler, ; conard et al., ). here, we take a different approach and instead account for uncertainty in dyngenie , allowing for stochastic gene expression sim- ulations and parsimonious model selection. our extension is available as an easy to use r package and also as an interac- tive web server. package design dyngenie background. dyngenie poses grn infer- ence as a feature selection problem. it first trains random forests to predict the change in concentration of each species given the current concentrations of all species. each interac- tion from species xi to species xj is associated with an im- portance score, calculated by the reduction in variance from using xi to predict the change in xj. the importance score for an interaction, when normalized, is interpreted as the proba- bility of that interaction to exist. for a detailed treatment, see the vignette and (geurts et al., ). model selection. the inferred network can be visualized as a p×p matrix where the entry [xi,xj] is the importance score of xi for inferring xj (fig. ). however, real grns are of- ten not fully connected and the presence of an interaction is binary (mangan et al., ). to address this, dynugene includes a function for model selection based on visualizing the pareto front (mangan et al., ). however, we note that the model at the sharp drop in the pareto front is not al- ways the best model (supplementary fig. s ). we include an additional function on the web server where users can choose which interactions to mask. the masked networks can then be simulated, allowing for application-specific tun- ing of model complexity. model simulation. the inferred networks and masked net- works can be used to simulate gene expression profiles by numerically solving the system of ordinary differential equa- tions learned by the random forests. in addition to determin- istic simulations, we provide an option that accounts for the uncertainty in the random forests predictions for stochastic simulations. for stochastic simulations, instead of only tak- ing the mean of a random forest’s predictions, we sample from the gaussian n(µ,σ ) where µ is the mean and σ is the variance of the random forest’s predictions. provided datasets. the dynugene package provides four example time-series datasets: repressilator, stochastic re- pressilator, hodgkin-huxley, and stochastic hodgkin-huxley (elowitz and leibler, ; hodgkin and huxley, ). these datasets were generated from systems of ordinary or lu et al. | biorχiv | january , | – .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/tianyu-lu/dynugene https://tianyulu.shinyapps.io/dynugene/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig. : bottom: inferred importance scores on the repressilator dataset for the th network in the step-wise column masks plot (supplementary fig. s ). top: simulated trajectory using the inferred network. stochastic differential equations. details are provided in the vignette. the package also includes one steady state dataset, syntren , taken from grndata (bellot et al., ). users can provide their own data as input following the for- mat specified in ?infernetwork. discussion a requirement for dyngenie and dynugene is that all species must be tracked through time. this requirement is difficult to satisfy in practice as there are often unknown species in a biological process of interest. methods that can identify or approximate latent structure in partially-observed systems are more appropriate here (hiscock, ). an omics treatment such as rna-seq can cover breadth but cur- rent sequencing techniques require cells to be destroyed, thus making time series data collection difficult. non-destructive sequencing techniques could address this issue. the implementation of an inferred network as a gene circuit will require more thought. even for networks with sparse interactions, the likelihood of finding a set of genes and pro- teins that satisfy the interaction strengths and activation or inhibitory effects is unknown. in fact, whether a species is an activator or inhibitor is not explicitly given in the interac- tion matrix. we can address this by posing dynugene as a constrained optimization problem where it is limited to using only a given set of parts (genes, promoters, ribosome bind- ing sites, proteins, etc.) thus relating the importance scores with biological interaction strengths. we leave this for future work. data and code availability source code is available at https://github.com/tianyu- lu/dynugene with a detailed user guide. a webserver with interactive controls is available at https://tianyulu.shinyapps.io/dynugene/. acknowledgements the authors thank the authors of dyngenie for their work and alan moses for guidance. funding this work was supported by a postdoctoral fellowship from canadian institutes of health research. bibliography sui huang, ingemar ernberg, and stuart kauffman. cancer attractors: a systems view of tu- mors from a gene network dynamics and developmental perspective. in seminars in cell & developmental biology, volume , pages – . elsevier, . anton crombach and paulien hogeweg. evolution of evolvability in gene regulatory networks. plos computational biology, ( ):e , . guy karlebach and ron shamir. minimally perturbing a gene regulatory network to avoid a disease phenotype: the glioma network as a test case. bmc systems biology, ( ): , . tom w hiscock. adapting machine-learning algorithms to design gene circuits. bmc bioinfor- matics, ( ): – , . pierre geurts et al. dyngenie : dynamical genie for the inference of gene networks from time series expression data. scientific reports, ( ): – , . stephanie kamgnia wonkap and gregory butler. benin: biologically enhanced network inference. journal of bioinformatics and computational biology, ( ): , . ashley mae conard, nathaniel goodman, yanhui hu, norbert perrimon, ritambhara singh, charles lawrence, and erica larschan. timeor: a web-based tool to uncover temporal regu- latory mechanisms from multi-omics data. biorxiv, . niall m mangan, steven l brunton, joshua l proctor, and j nathan kutz. inferring biological networks by sparse identification of nonlinear dynamics. ieee transactions on molecular, biological and multi-scale communications, ( ): – , . michael b elowitz and stanislas leibler. a synthetic oscillatory network of transcriptional regula- tors. nature, ( ): – , . alan l hodgkin and andrew f huxley. a quantitative description of membrane current and its application to conduction and excitation in nerve. the journal of physiology, ( ): , . pau bellot, catharina olsen, and patrick e meyer. grndata: synthetic expression data for gene regulatory network inference, . r package version . . . carl ganz. rintrojs: a wrapper for the intro. js library. journal of open source software, ( ): , . gregory r. warnes, ben bolker, lodewijk bonebakker, robert gentleman, wolfgang huber, andy liaw, thomas lumley, martin maechler, arni magnusson, steffen moeller, marc schwartz, and bill venables. gplots: various r programming tools for plotting data, . r package version . . . hadley wickham. ggplot : elegant graphics for data analysis. springer, . christopher rackauckas and qing nie. adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. discrete and continuous dynamical systems. series b, ( ): , a. christopher rackauckas and qing nie. differentialequations. jl–a performant and feature-rich ecosystem for solving differential equations in julia. journal of open research software, ( ), b. r core team. r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria, . | biorχiv lu et al. | dynugene .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/tianyu-lu/dynugene https://github.com/tianyu-lu/dynugene https://tianyulu.shinyapps.io/dynugene/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / computing the riemannian curvature of image patch and single-cell rna sequencing data manifolds using extrinsic differential geometry computing the riemannian curvature of image patch and single-cell rna sequencing data manifolds using extrinsic differential geometry duluxan sritharan∗ , , shu wang∗ , , and sahand hormoz† , , harvard graduate program in biophysics, harvard university, cambridge, ma, usa department of data sciences, dana-farber cancer institute, boston, ma, usa laboratory of systems pharmacology, harvard medical school, boston, ma, usa department of systems biology, harvard medical school, boston, ma, usa broad institute of mit and harvard, cambridge, ma, usa abstract most high-dimensional datasets are thought to be inherently low-dimensional, that is, datapoints are constrained to lie on a low-dimensional manifold embedded in a high-dimensional ambient space. here we study the viability of two approaches from differential geometry to estimate the riemannian curvature of these low-dimensional manifolds. the intrinsic approach relates curvature to the laplace-beltrami operator using the heat-trace expansion, and is agnostic to how a manifold is embedded in a high- dimensional space. the extrinsic approach relates the ambient coordinates of a manifold’s embedding to its curvature using the second fundamental form and the gauss-codazzi equation. keeping in mind practical constraints of real-world datasets, like small sample sizes and measurement noise, we found that estimating curvature is only feasible for even simple, low-dimensional toy manifolds, when the extrinsic approach is used. to test the applicability of the extrinsic approach to real-world data, we computed the curvature of a well-studied manifold of image patches, and recapitulated its topological classification as a klein bottle. lastly, we applied the approach to study single-cell transcriptomic sequencing (scrnaseq) datasets of blood, gastrulation, and brain cells, revealing for the first time the intrinsic curvature of scrnaseq manifolds. ∗equal contribution †to whom correspondence should be addressed (sahand hormoz@hms.harvard.edu) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction high-dimensional biological datasets have become prevalent in recent decades because of new technologies such as high-throughput scrnaseq [ , , ], mass cytometry [ , ] and multiplex imaging [ , ]. interpre- tation and visualization of such high-dimensional datasets have been challenging however, prompting the development of tools for non-linear projection of datapoints onto or dimensions [ ]. these tools, such as isomap [ ], t-sne [ ] and umap [ ], appeal to the ansatz that datapoints in a high-dimensional ambient space are constrained to lie on a low-dimensional manifold. unfortunately, determining the geometry of a low-dimensional manifold from these visualizations is difficult, since many geometric properties are lost after projecting onto or dimensions. for example, the cartographic projections used in an atlas to flatten earth’s curved surface tear apart continuous neighborhoods and non-uniformly stretch distances. fortunately, topology and differential geometry provide a wealth of concepts to characterize a manifold’s shape directly without confounding projections. in particular, homology [ , ] categorizes a manifold according to the number of holes it contains, and the dimensionality of each hole (whereas for example, the hole in a hollow sphere does not survive projection onto a -dimensional plane). similarly, metrics [ ] and geodesics [ ] determine shortest-distance paths between pairs of points on a manifold without any distortion from a projection (whereas for example, most atlases exaggerate distances at the poles). curvature [ ] is a local manifold property that quantifies the extent to which a manifold deviates from the tangent plane at each point p. projecting a manifold onto a plane for visualization destroys this property by definition. recent methods have emerged for estimating homology [ , ], metrics [ ] and geodesics [ ] from noisy, sampled data, with accompanying statistical guarantees [ , , ]. these methods have been applied to analyze images [ , ] and biological datasets [ , ]. however, estimating curvature has received less attention although it is fundamental to quantifying geometry. curvature arises from two sources. on the one hand, a manifold itself can be curved, resulting in riemannian or intrinsic curvature. a sphere has intrinsic curvature because it cannot be flattened so that all geodesics on its surface correspond to straight lines on a euclidean plane (see figure a). on the other hand, the embedding of a manifold in an ambient space can give rise to extrinsic curvature, a property that is not inherent to the manifold itself. for example, a scroll has extrinsic curvature because it is formed by rolling a piece of parchment, but the parchment itself is not inherently curved (see figure b). it is important to note that both types of curvature scale inversely with the global length scale (l) associated with a manifold. it is for this reason that a marble (l ≈ cm) is visibly round, but the earth (l ≈ , km) is still mistaken by some to be flat. since intrinsic curvature is an inherent property of a manifold, while .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d intrinsic (riemannian) curvature extrinsic curvature intrinsic differen�al geometry extrinsic differen�al geometry z = ± − x − y figure : riemannian curvature is an intrinsic property of a manifold while extrinsic curvature depends on the embedding. (a) (left) n = points uniformly sampled from the -dimensional hollow unit sphere, s , embedded in the -dimensional ambient space r , colored according to the z-coordinate. s has riemannian or intrinsic curvature because there is no projection onto -dimensional euclidean space that preserves geodesic (shortest-path) distances. (right) for example, a stereographic projection using the point z = ( , , ) and the plane z = introduces distortions since the geodesic distance between any pair of points in the lower hemisphere is (non-uniformly) larger than the euclidean distance in this projection. (b) (left) n = points uniformly sampled from a scroll, which is also a -dimensional manifold embedded in r . the scroll has extrinsic curvature because it curls away from the tangent plane at any point. (right) however, it does not have intrinsic curvature, because it can be projected onto -dimensional euclidean space in a way that preserves geodesic distances, by unfurling. (c) intrinsic differential geometry treats manifolds as self-contained objects that can be described using only intrinsic coor- dinates, which do not depend on any embedding or ambient space. one possible set of intrinsic coordinates for s are polar coordinates, where θ and θ are the azimuthal and elevation angles respectively. while this representation superficially resem- bles the unfurled scroll in (b), distances in this plane are non-euclidean. any line segment along θ = ±π has zero length for example. (d) extrinsic differential geometry defines manifolds in the coordinate system of the ambient space, which requires a privileged vantage point off the manifold itself. both intrinsic and extrinsic differential geometry can be used to compute intrinsic curvature, whereas only extrinsic differential geometry can be used to compute extrinsic curvature (as indicated by the black arrows). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / extrinsic curvature is incidental to an embedding, we will restrict our attention to the former. a precise description of intrinsic curvature is provided by the riemannian curvature tensor, rlkij(p). for a given basis {v}, this tensor quantifies how much a vector initially pointing in direction vk is displaced in direction vl after parallel transport around an infinitesimal parallelogram defined by directions vi and vj. the simplest intrinsic curvature descriptor is scalar curvature, s(p), which is formed by contracting rlkij(p) to a scalar quantity, as its name suggests. when s(p) is greater (less) than , the sum of the angles of a triangle formed by connecting three points near p by geodesics is greater (less) than π. likewise, when s(p) is greater (less) than , a small ball centred at p has a smaller (larger) volume than a ball of the same radius in euclidean space. we furnish toy examples in the main text to provide stronger intuition for this quantity. in theory, intrinsic curvature can be equivalently computed using tools from either one of the two branches of differential geometry. intrinsic differential geometry makes no recourse to an external vantage point off a manifold, just as the polygonal characters in edwin abbot’s classic flatland [ ] were confined to traversing in r , and found the notion of r unfathomable. in this branch, a manifold is therefore represented in intrinsic coordinates, which are agnostic to any ambient space or embedding. a hollow sphere represented in polar coordinates and k-nearest neighbor (knn) graph representations of a dataset, for instance, are in this spirit (see figure c). conversely, in extrinsic differential geometry, a manifold is treated as a surface embedded in an ambient space, and is represented in ambient coordinates (see figure d). the surface of an organ is parameterized this way, for example, in a surgical robot suturing an incision. in this work, we explore two approaches for estimating intrinsic curvature based on these twin views, keeping in mind practical limitations of real-world datasets, which may be comprised of a relatively small number of noisy measurements. the first approach uses the laplace-beltrami operator, which is well-studied in previous applications of differential geometry to data analysis [ , , , , ], and is theoretically appealing as an intrinsic quantity that is embedding-invariant. however, we find that this approach cannot accurately estimate even average scalar curvature on the simplest of low-dimensional toy manifolds for small sample sizes, despite the history and ubiquity of the laplace-beltrami operator in geometric data analysis. meanwhile, the second approach uses the second fundamental form and the gauss-codazzi equation [ ], identities that rely on information from the ambient space. we find that this extrinsic approach is not only more robust to small sample sizes and noise, but permits computation of the full riemannian curvature tensor, though we focus on the scalar curvature for simplicity. using these insights, we developed a software package to compute the scalar curvature (and associated uncertainty) at each sampled point on a manifold, and applied this tool to investigate the curvature of image and scrnaseq datasets. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results . estimators of the laplace-beltrami operator yield inaccurate scalar cur- vatures intrinsic differential geometry treats a d-dimensional manifold, m, as a self-contained object and is agnostic to how m may be represented in ambient coordinates due to any particular embedding (see figure c). conceptually, this is accomplished by only considering m as a collection of local, overlapping neighborhoods. the geometry of these neighborhoods is encoded using tools such as the laplace-beltrami operator, ∆m , which captures diffusion dynamics across neighborhoods. for most practical applications, we do not have direct access to m but instead to a finite number (n) of points sampled from m. for these cases, estimators of ∆m are used instead. these estimators are well-studied [ , , , , ], and the convergence rates of some have been characterized [ ]. the scalar curvature averaged across m, has a well-known connection to ∆m via the heat-trace expan- sion [ , ], which relates the eigenvalues, λk, of ∆m to the geometry of m: z(t) ≡ ∞∑ k= e−λkt = ( πt)− d ( n∑ i= cit i + o(t n+ ) ) , λk ≤ λk+ ( ) the first few coefficients, ci, are given by [ ]: c = ∫ m dm, c = − √ π ∫ ∂m d(∂m), c = ∫ m s dm − ∫ ∂m j d(∂m) ( ) where ∂m is the boundary of the manifold and j is the mean curvature on ∂m. recall that s is the point-wise scalar curvature. by inspection, c is the volume, c is proportional to the area, and c is directly related to the average scalar curvature. we reasoned that if the average scalar curvature cannot be accurately computed for a manifold with constant scalar curvature using these relations, then computing the point-wise scalar curvature for more complex manifolds is intractable. to investigate this, we considered the -dimensional hollow unit sphere, s , for which the true scalar curvature is s(p) = ∀p ∈ m, and uniformly sampled n = points to mirror the typical size of current scrnaseq datasets (see figure a; methods section . . . ). since common estimators of ∆m only yield as many eigenvalues as datapoints (n), we cannot compute .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the infinite set of eigenvalues needed in equation . therefore, we introduced a truncated series with m eigenvalues, zm(x), where we have substituted x = √ t and divided through by the prefactor in the rhs of equation to isolate for ci, following the approach in [ ]: zm(x) = ( π) d/ xd m∑ k= e−λkx ( ) the scalar curvature can then be approximated by fitting the truncated series, zm(x), to a second-order polynomial, p (x), over intervals of small x: zm(x) ≈ p (x), where p (x) = c + c x + c x ( ) we estimated ∆m using the n sampled points (see methods section . . ), substituted the eigenvalues of the estimate into equation , and numerically fit zm(x) to p (x) (see figure s a-g; methods section . . ). we obtained the scalar curvature by inspecting the resulting c coefficient, and compared the result to the true value of . we found that the scalar curvature was always over-estimated (s > ) regardless of m, the number of eigenvalues used in the truncated series (see methods section . . ), or the choice of estimator for ∆m (see methods section . . ). we identified the poor convergence of the estimated eigenvalues of ∆m as the source of error (see methods section . . ) and found that at least n ≈ points are required to reduce the error to ± . , so that s ≈ . (see figure s h). therefore, despite the prevalence of the laplace-beltrami operator in geometric data analysis, our exam- ple shows that an intrinsic approach relying on the operator is not practical for computing scalar curvatures. even for noise-free datapoints uniformly sampled from s , the sample size needed to compute average scalar curvature accurate to ± . is several orders of magnitude greater than what is typically feasible in current scrnaseq experiments. noise and non-uniform sampling would confound the issue further. most impor- tantly, we would eventually like to compute local values of s(p) ∀p ∈ m, but this approach failed to correctly recover even average scalar curvature, which one might have expected to be feasible. to find an alternative approach, we next considered tools from extrinsic differential geometry. . curvature can be computed accurately using the second fundamental form in extrinsic differential geometry, a manifold is described in the coordinates of the ambient space in which it is embedded, usually rn (see figure d). since the shape of the sphere in figure a is visually unambiguous .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to the eye (thanks to its extrinsic view from a vantage point off the manifold), we reasoned that an extrinsic approach would be more fruitful. a d-dimensional manifold, m, embedded in rn can be described at each point p in terms of a d- dimensional tangent space, tm (p), and an (n − d)-dimensional normal space, nm (p), as shown in fig- ure a. given orthonormal bases for tm (p) and nm (p), points in the neighborhood of p can be expressed as y = [t , ..., td,n , ...,nn−d] where ti is y ’s coordinate along the i th basis vector of tm (p) and nk is y ’s coordinate along the kth basis vector of nm (p). the nks can then be locally approximated as functions of the tis i.e. nk ≈ fk(t , ..., td) as shown in figure b. the riemannian curvature of m is related to the quadratic terms in the taylor expansion of each fk with respect to the tis. specifically, the second fundamental form of m, h k ij, gives the second-order coefficient relating each fk to the quadratic term titj [ ]: hkij(p) = ∂ fk ∂ti∂tj ∣∣∣∣ p ( ) the riemannian curvature tensor is related to the second fundamental form according to the gauss-codazzi equation [ ]: rijkl = (h α jkh β il −h β jih α kl)gαβ ( ) where gαβ is the metric of the ambient space, which we take to be the usual euclidean metric δα,β going forward. the scalar curvature can be obtained by contracting the riemannian curvature tensor: s = ∑ i,j rijij ( ) this suggests a conceptually simple procedure to estimate the scalar curvature of a data manifold at each point p: (i) estimate tm (p) and nm (p), (ii) determine h k ij(p) in local coordinates, (iii) compute s using equations and . we developed a computational tool that provides an implementation of this procedure. briefly, given a set of datapoints {x} ∈ rn and manifold dimension d, a neighborhood around each point p is selected to be the n-dimensional ball centred on p of radius r encompassing np(r) points (see methods section . . ). for each point p, principal component analysis (pca) [ ] is performed on the np(r) points in its neighborhood, and the first d (last n−d) principal components (pcs) accounting for the most (least) variance are taken as an orthonormal basis for tm (p) (nm (p)). the normal coordinates, nk, of the np(r) points in each neighborhood are fit by regression to a quadratic model in terms of the tangent coordinates, ti, to obtain h k ij(p) with associated uncertainties (see figure b; methods section . . ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the choice of r(p) is an important one since it sets the length scale at which curvature is computed for point p (see methods section . . ). our tool allows interrogation of curvature at any length scale of interest by allowing the user to manually set r(p), a feature we use to inspect real-world datasets later in the paper. however, since the local geometry of the manifold may be non-trivial and unknown a priori, we also provide the ability to set r(p) according to statistical rather than geometric principles. specifically, our tool algorithmically chooses r at each p so that the uncertainty in hkij(p) from regression is less than a user-specified global parameter, σh (see methods section . . ). since a larger number of points reduces the uncertainty in regression, a smaller σh requires a larger r(p) ∀p ∈ m. this strategy of setting σh therefore allows neighborhood sizes to dynamically vary over the manifold based on the local density of the data, which means the algorithm can gracefully handle non-uniform sampling of the manifold. the choice of σh will depend on the global length scale, l, of the datapoints (see methods section . . ), the average density of sampled points, and of course, the desired uncertainty in the estimates of hkij. these uncertainties are in turn used to compute a standard error, σs, accompanying the scalar curvature estimate at each point, using standard error propagation formulas (see methods section . . ). we specify σh instead of σs as the global parameter for choosing neighborhood sizes, since the latter depends non-linearly on the values of hkij(p), which makes determining r(p) more difficult. our algorithm also computes a goodness-of-fit (gof) p-value at each p by comparing residuals from regression against a normal distribution to quantify how well the normal coordinates are fit by a quadratic function (see methods section . . ). we tested this p-value at significance level α = . , declaring fits to be poor when the residuals are significantly non-gaussian. the p-value can be disregarded if the neighborhood size is manually specified to be larger than a length scale for which a quadratic fit is appropriate. however, when σh is specified instead, a uniform distribution of these p-values over m indicates that the desired uncertainty results in neighborhoods that are well-approximated using quadratic regression. we adopted this heuristic when choosing σh for the datasets studied in this paper (see methods section . . , . . and . . ). the software is available at https://gitlab.com/hormozlab/manifoldcurvature. we first applied our algorithm to compute scalar curvatures for the same n = points uniformly sampled from s for which the intrinsic approach failed (see figure c; methods section . . . ). the algorithm yielded scalar curvature estimates at each point with mean error − . (computed by averaging the difference between the point-wise scalar curvature estimates and the ground truth value of across all points) using neighborhoods that only contained np(r) ≈ points. this is already superior to the intrinsic approach, which failed to compute even average scalar accurate to ± for the same sample size. the non-zero value of the mean error indicates that our estimator is biased. the values of hkij are not biased because they .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gitlab.com/hormozlab/manifoldcurvature https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c d b e f i g h figure : scalar curvature is accurately estimated using the second fundamental form and the gauss-codazzi equation. (a) a hypothetical manifold (shown in grey) from which datapoints are sampled (shown as colored dots). the manifold at any given point p (shown in red) can be decomposed into a tangent space tm (p) (the cyan plane) and a normal space nm (p) (the cyan line). points in the neighborhood around p (shown in green) can be expressed in terms of orthonormal bases for tm (p) and nm (p) (see (b) below). (b) the set of points in the neighborhood of p (shown as green dots in (a)) are represented here in local tangent (t , t ) and normal (n ) coordinates, corresponding to orthonormal bases for tm (p) and nm (p) respectively. coloring corresponds to magnitude in the normal direction. the normal coordinates (n ) can be locally approximated as a quadratic function (the translucent surface) of the tangent coordinates (t , t ), according to the second fundamental form, h k ij. (c) scalar curvatures computed using the extrinsic approach for n = points uniformly sampled from the -dimensional hollow unit sphere, s . the true value is at all points on the manifold. see methods section . . . . (d) scalar curvatures (s) computed in (c) are plotted against their associated standard errors (σs). points enclosed by the red lines have a % confidence interval (ci), computed as s ± σs, containing the true value of . (e) as in (c) but for n = points uniformly sampled from a one-sheet hyperboloid, h , which is also a -dimensional manifold. due to the radial symmetry of the manifold, scalar curvature only varies only along the z-direction. see methods section . . . . (f) scalar curvatures (black) computed in (e) with their associated % cis (shown in grey) plotted as a function of the z-coordinates of the datapoints. the true value is shown as a dashed red line. (g) as in (c) but for n = points uniformly sampled from a -dimensional ring torus, t . t is constructed by revolving a circle parameterized by θ, oriented perpendicular to the xy-plane, through an angle φ around the z-axis. the scalar curvature only depends on the value of θ. see methods section . . . . (h) scalar curvatures computed in (g) with their associated % cis plotted as a function of the θ values of the datapoints. colors as in (f). (i) distribution of computed scalar curvatures for n = points uniformly sampled from the d-dimensional unit hypersphere, sd, for d = , , , . as with s , these manifolds are isotropic and have constant scalar curvature. the true values are shown as dashed red lines. see methods section . . . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / are obtained using regression. even so, the components of the riemannian curvature tensor, rijkl, may still be biased because they are non-linear functions of hkij. note that for s , this bias is the same across all datapoints (because of the isotropic nature of the manifold) and therefore results in a systematic under- estimation of scalar curvature (see figure c; methods sections . . ). we also computed % confidence intervals (ci) for our estimates as s ± σs, and despite the mean error, % of points still reported a % ci containing the true value of (see figure d). we next tested our algorithm on a -dimensional manifold with negative scalar curvature, by uniformly sampling n = points from the one-sheet hyperboloid, h (see figure e; methods section . . . ). here, % of points reported a % ci containing the true scalar curvature (see figure f). lastly, we considered the -dimensional ring torus, t (see figure g; methods section . . . ). as a manifold with regions of positive, zero, and negative scalar curvature, t is a useful toy model for understanding more complex -dimensional manifolds and gaining intuition for higher-dimensional manifolds. in dimensions, regions of a manifold with positive scalar curvature (θ = , π in figure h) are dome-shaped, regions with zero scalar curvature (θ = π , π in figure h) are planar, and regions with negative scalar curvature (θ = π in figure h) are saddle-shaped. we applied our tool to n = points uniformly sampled from t and found that % of points reported a % ci containing the true scalar curvature (see figure h). to test the applicability of our algorithm to higher-dimensional manifolds, we uniformly sampled n = points from unit hyperspheres, sd, and found that %, % and % of points reported a % ci containing the true scalar curvature for d = , and respectively (see figure i; methods section . . . ). the number of terms, hkij, in the second fundamental form grows as d . for larger d, a greater number of datapoints and hence larger neighborhoods are needed for regression, but these are no longer well-approximated by quadratic fits according to our gof measure. more generally, higher-dimensional manifolds require a higher density of data to estimate scalar curvatures accurately. we additionally characterized how our algorithm performed when datapoints were non-uniformly sampled (see figure s a; methods section . . . ) or convoluted by observational noise (see figure s b; methods sec- tion . . . ), when the dimension of the ambient space was large (see figure s c; methods section . . . ), and when the specified manifold dimension differed from the ground truth (see figure s d; methods sec- tion . . . ). we found that the algorithm is robust to non-uniform sampling, large ambient dimension and small observational noise, and provides signatures indicating when the manifold dimension may be mis- specified. however, when the noise scale is large, the resulting manifold is no longer trivially related to the noise-free manifold, consistent with existing literature [ , , , ], so that scalar curvature cannot be accurately computed. lastly, we note that since the full riemannian curvature tensor is computed as an .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / intermediate step in our algorithm, more intricate geometric features in the data can also be analyzed using our tool, though we defer such investigation to future studies. taken together, these examples demonstrate the utility of the algorithm in recovering curvature with specified uncertainties for manifolds with positive and/or negative scalar curvature. next, we tested our algorithm on real-world data. . curvature of image patch manifold is consistent with a noisy klein bottle pixel intensity values in images of natural scenes are not independently or uniformly distributed. understand- ing the statistics of such images is important for designing compression algorithms [ ] and for addressing challenges in the field of computer vision such as segmentation [ ]. lee et al. discovered that x -pixel patches extracted from greyscale images of natural scenes, whose pixels have high-contrast (i.e. the differ- ences between the intensity values of adjacent pixels in a patch are large), are not uniformly distributed in r , but are instead concentrated on a low-dimensional manifold [ ]. this is because high-contrast regions in a natural scene usually correspond to the edges of objects in the scene. high-contrast image patches consequently tend to be comprised of gradients and not simply random speckle. subsequent work using topological data analysis revealed that after appropriate normalization (which takes image patches from r to s ∈ r , so that the global length scale is l = ; see methods section . . ), dense regions of high-contrast image patches have the same homology as a -dimensional manifold called a klein bottle [ ]. a klein bottle, k , is a canonical manifold typically introduced in the context of orientability, where it is often visualized in r (as shown in figure a) to highlight that it is non-orientable. from a topological perspective, k is a manifold parameterized by θ,φ ∈ [ , π] as shown in figure b in which vertical edges are defined to be θ = and θ = π, and horizontal edges are defined to be φ = and φ = π. to make a closed surface, the vertical (horizontal) edges are glued together according to the red (blue) arrows in figure b. k is therefore π-periodic in φ, since a point corresponding to θ on the bottom horizontal edge (φ = ) is the same as the point corresponding to θ on the top horizontal edge (φ = π). similarly, a point corresponding to φ on the left vertical edge (θ = ) is the same as the point corresponding to π −φ on the right vertical edges (θ = π). in short, points on k obey the similarity relation (θ,φ) ∼ (θ + π, π − φ). k captures the dominant features in high-contrast image patches because θ can be treated as a parameter controlling rotation and φ as a parameter controlling the relative contribution of linear vs. quadratic gradients (see figure b). an embedding of k into r with an analytical form, k , was proposed by carlsson et al. in [ ] to model image patches (see equation in methods section . . ). this embedding takes points from (θ,φ) into .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / image patches in r as shown in figure b. for example, θ = (θ = π ) corresponds to patches with vertical (horizontal) stripes and φ = π , π (φ = ,π) corresponds to patches with linear (quadratic) gradients. as θ increases, stripes in the image patches are rotated clockwise. as φ increases, image patches oscillate between having quadratic and linear gradients. importantly, the image patches constructed by this embedding obey the same similarity relation (θ,φ) ∼ (θ+π, π−φ) topologically required of a klein bottle. whereas carlsson et al. studied the global topology of image patches using this embedding, here we study their local geometry instead. first, we analytically calculated the scalar curvature of k as a function of (θ,φ) as shown in figure c (see methods section . ). next, we used our algorithm to compute the scalar curvature on a data manifold of n ≈ . × high-contrast x -pixel image patches randomly sampled from the same van hateren dataset used to propose k (see methods section . . ). we picked σh so that the distribution of gof p-values was flat, and fixed this value for all subsequent simulations (see methods section . . ). to visualize the results, we associated each image patch to its closest point on k (see methods section . . ), and plotted the scalar curvatures on the resulting (θ ,φ ) coordinates (see figure d). most image patches map to φ = π , π or θ = , π because linear gradients (of any orientation) and quadratic gradients that are vertically or horizontally oriented are the dominant features in the data as previously reported [ , ]. the scalar curvatures computed for the image patches did not match the analytical scalar curvature of k (cf. figures c and d). to identify the cause of this discrepancy, we first validated our algorithm by computing scalar curvatures on the set of n ≈ . × (θ ,φ ) points on k associated with the image patches (see figure e); we found close agreement with the analytical calculation ( % of points reported a % ci containing the true scalar curvature). next, observing that the neighborhood sizes used for computing the scalar curvature of image patches were larger than those used for computing the scalar curvature of the associated (θ ,φ ) points (cf. figures s a and s b), we recomputed the scalar curvatures of these (θ ,φ ) points, but now with the same neighborhood sizes used for the image patches. the results agreed with the analytical calculation, but still did not match the scalar curvatures computed for the image patches (see figure s c). having ruled out these two possibilities, we hypothesized that the discrepancy was caused by fluctuations in the positions of the image patches with respect to the (θ ,φ ) points on the k manifold (real image patches are noisy and the klein bottle embedding is only an idealization). we found that adding isotropic gaussian noise of increasing magnitude in r to the set of (θ ,φ ) points on k indeed resulted in scalar curvatures that resemble the data (see figure f; methods section . . ). the best agreement between the scalar curvatures of the image patches and the noisy (θ ,φ ) points was achieved when the magnitude of noise was .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f g h i j figure : scalar curvature computed for image patches is consistent with that of a klein bottle with added isotropic gaussian noise. (a) the klein bottle, k , is a -dimensional manifold shown here in r . (b) k is an analytical embedding given by carlsson et al. in [ ] relating parameter values θ,φ ∈ [ , π] to x -pixel patches of greyscale images (see equation in methods section . . ). θ controls the rotation of stripes in the image patches and φ determines the relative contribution of linear vs. quadratic gradients. importantly, as shown in the figure, this embedding has boundary conditions consistent with the topology of a klein bottle (depicted by the blue/red arrows). in particular, the embedding produces image patches that obey the similarity relation (θ,φ) ∼ (θ + π, π −φ). adapted from figure of [ ]. (c) the analytical scalar curvature of k (derived as described in methods section . ). (d) scalar curvatures computed for n ≈ . × high-contrast x -pixel patches sampled from the greyscale images in the van hateren dataset [ ] are plotted here as a function of (θ ,φ ), the parameter values of the closest point on k associated with each image patch (see methods section . . ). (e) scalar curvatures computed for the set of n ≈ . × closest points on k associated with the image patches. note the close correspondence with figure c, indicating that our algorithm correctly recapitulates the analytical scalar curvature. (f) as in (e), but after adding isotropic gaussian noise in r to the set of closest points on k (see methods section . . ). left to right corresponds to increasing levels of noise, σ = . , . , . . (g) the distribution of euclidean distances in r between each image patch and its closest point on k is shown in blue. the distribution of distances to k after adding gaussian noise to these closest points on k is also shown. (h) k is the analytical embedding from θ,φ ∈ [ , π] to r that minimizes the sum of euclidean distances from the image patches to the closest point on the embedding (see methods section . . ). each of the n ≈ . × image patches was associated to its closest point on k , given by parameter values (θ ,φ ) (see methods section . . ). scalar curvatures computed on this set of n ≈ . × points on k are shown. (i) the same scalar curvatures computed for the image patches and visualized on (θ ,φ ) coordinates in (d), are shown here plotted on (θ ,φ ) coordinates. (j) scalar curvatures computed for a densely sampled manifold comprised of the full set of n ≈ . × high-contrast x -pixel image patches in the van hateren image dataset (see methods section . . ), visualized on (θ ,φ ) coordinates. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / σ = . . notably, in this case, the median euclidean distance of the noisy (θ ,φ ) points to k was . , which is comparable to . , the median euclidean distance of the image patches to k (see figure g). furthermore, the neighborhood sizes chosen by our algorithm when σ = . (see figure s a) matched those chosen for the image patches (see figure s b). to find an embedding of the klein bottle that might better explain the scalar curvature of the image patches without needing to add noise, we incorporated higher-order terms to k (see methods section . . ). the coefficients for the higher-order terms were determined by fitting the data, resulting in a new embedding, which we refer to as k (see methods section . . ). the median euclidean distance of the image patches to k was . versus . to k . as was done for k , we associated each image patch to its closest point (θ ,φ ) on k , and used our algorithm to compute the scalar curvature of these (θ ,φ ) points (see figure h). despite the reduction in the median euclidean distance of images patches to the embedding, the scalar curvature of k was even less similar to that of the image patches (visualized in figure i on these new (θ ,φ ) coordinates for k ) than was the scalar curvature of k ; the range of scalar curvature values for k was much larger than for either the image patches or k , and the scalar curvature fluctuates on smaller length scales. lastly, we reasoned that there might be fine-scale scalar curvature fluctuations in the image patches that are masked by the larger neighborhood sizes used to compute scalar curvature for the image patches (see figure s b) relative to k (see figure s d). to decrease the neighborhood sizes chosen by the algorithm for the same σh, we augmented the image patch dataset using the full set of n ≈ . × datapoints from the van hateren dataset (see methods section . . ). this resulted in neighborhood sizes comparable to those determined for k (cf. figures s d and s e), but failed to recapitulate the fine-scale scalar curvature fluctuations observed in k (see figure j). as a sanity check, we confirmed that the scalar curvature of the augmented image patch dataset matched that of the original image patch dataset, when computed using the same neighborhood sizes as the latter (see figure s f). therefore, including higher-order terms in the embedding does not yield scalar curvatures that better agree with the data. taken together, our analysis of curvature suggests that the image patch dataset can be best modelled by adding noise to the simplest embedding, k . having applied our algorithm on real-world manifold-valued data that is well-modelled by an analyti- cal embedding, we next turned our attention to scrnaseq datasets, which are generally regarded as low- dimensional manifolds and have no known analytical form. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . scrnaseq datasets have non-trivial intrinsic curvature in scrnaseq datasets, each datapoint corresponds to a cell, and each coordinate to the abundance of a different gene. here we consider the data manifold after basic preprocessing and linear dimensionality reduction using pca (see methods section . . ). since many common analyses in the field such as clustering, visualization, and inference of cell differentiation trajectories are performed in this reduced space, it is natural to compute curvature in this space as well. we set the ambient dimension, n, to be the number of pcs needed to explain % of the variance. the manifold dimension, d, for scrnaseq datasets is not well-defined and needs to be chosen heuristically. as a simple heuristic, we specified d as the number of pcs needed to explain % of the variance in the ambient space i.e. % of the original variance (we show later that our computations are relatively insensitive to the choice of d). we considered three datasets. the first consists of n ≈ peripheral blood mononuclear cells (pbmcs) collected from a healthy human donor [ ]. the second is a gastrulation dataset comprised of n ≈ . × cells pooled from embryonic mice sacked at -hour intervals from embryonic day . to . [ ]. the final dataset is a benchmark in the field consisting of n ≈ . × brain cells pooled from embryonic mice sacked at embryonic day [ ]. refer to figures s a, s a and s a for cell type annotations for the three datasets. the pbmc dataset is characteristic of the sample size of current scrnaseq data. the other two are larger than most scrnaseq datasets, and we included these to verify if geometric features seen in the first dataset can be reproduced for more densely sampled manifolds. for the pbmc, gastrulation and brain datasets, the ambient (manifold) dimensions were determined to be , and ( , and ) respectively, according to the aforementioned heuristic (see methods section . . ). for all three datasets, the global length scale happened to be l ≈ (see methods sections . . ). as before, we picked σh for each dataset according to the distribution of gof p-values (see figures s b, s b and s b; methods section . . ). we visualized the computed scalar curvatures on standard plots employed in the field (umap and t- sne; shown in figure a,d,g) and observed non-trivial scalar curvature for all three datasets. we found statistically significant correlations between the scalar curvature reported by each point and its knn for k ≤ (ρpearson = . , . and . for the pbmc, gastrulation and brain datasets respectively at k = , p < − ; see figures s c, s c and s c), indicating that our algorithm yields scalar curvatures that vary continuously over the data manifolds. by plotting scalar curvatures against their standard errors, σs, we verified that regions with non-zero scalar curvature are statistically significant (see figure b,e,h). as a consistency check, we confirmed that the percentage of points with % cis containing the scalar curvatures reported by their respective knns (i) decayed with increasing k for k ≤ , and (ii) was significantly larger .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / than expected by chance ( %, % and % for the pbmc, gastrulation and brain datasets respectively at k = , p < . ; see figures s d, s d and s d; methods section . . . ). to rule out the possibility that localization of non-zero scalar curvature in certain regions of the umap/t- sne plots is an artifact caused by other features of the data that are also localized, we considered several factors. first, we plotted the gof p-value at each point on umap/t-sne coordinates and noted that poor gofs were not localized on the data manifolds, let alone to regions of non-zero scalar curvature (see figures s b, s b and s b). therefore, the computed scalar curvatures are not due to poor fits. next, we plotted the neighborhood size, r(p), used for fitting and observed that in some regions, non-zero scalar curvatures seemed to correspond to small r (see figures s e, s e and s e). since σh is fixed, these regions necessarily have a larger number of neighbors np(r) and are hence more dense (see figures s f, s f and s f). to rule out the possibility that the non-zero scalar curvatures were an artifact of smaller neighborhood size, we recomputed the scalar curvature at three fixed neighborhood sizes (see figure c,f,i), corresponding to the , , and %-ile values of r(p) which arose from setting σh (see figures s e, s e and s e). in general, the scalar curvatures decreased in magnitude when neighborhood sizes increased. however, regions which had statistically significant non-zero scalar curvatures (zero falls outside of the % ci) using variable neighborhood sizes also had non-zero scalar curvatures for all three fixed neighborhood sizes. additionally, statistically significant non-zero scalar curvature also emerged on other parts of the manifolds when using small fixed neighborhood sizes. these regions are therefore curved at small length scales but do not have a sufficient density of points to resolve curvature to the desired uncertainty σh (see method section . . ). this is analogous to the image patch dataset for which we could resolve scalar curvatures of larger magnitude at a smaller length scale when the dataset was augmented with enough points to attain smaller neighborhood sizes for a fixed σh. we also checked how computed scalar curvatures changed with density in a toy model with zero scalar curvature. importantly, we did not observe the artifactual appearance of statistically significant non-zero scalar curvature, for either variable neighborhood sizes chosen by the algorithm to achieve σh, or for fixed neighborhood sizes (see figure s a; methods section . . . ). taken together, although higher density allows us to resolve statistically significant non-zero scalar curvatures in scrnaseq data, these computed scalar curvatures are not an artifact of the smaller neighborhood sizes used in regions with higher density. to ensure that the computed scalar curvatures were not sensitively dependent on the heuristically chosen manifold dimension, d, we also recomputed scalar curvatures for d − and d + and observed similar qualitative results (see figures s g, s g and s g). lastly, we verified that the computed scalar curvatures were not correlated with the number of transcripts in each cell (see figures s h, s h and s h). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f h ig figure : scrnaseq datasets have localized regions of non-zero scalar curvature. (a) scalar curvatures were computed for a scrnaseq dataset with n ≈ peripheral blood mononuclear cells (pbmcs) collected from a healthy human donor. the ambient (n) and manifold (d) dimensions were specified to be and respectively and variable neighborhood sizes were chosen by setting σh (see methods section . . ). the scalar curvatures are shown here overlaid onto umap coordinates, after smoothing the values over k = nearest neighbors in the ambient space. (b) scatter plot of (unsmoothed) scalar curvatures, s, and associated standard errors, σs, for each datapoint in the pbmc dataset. points enclosed by the red lines reported a % ci (s ± σs) including . (c) as in (a) but with scalar curvatures computed using a fixed neighborhood size, r, for all datapoints. the value of r was set to be the , , and -%ile values (left to right) of the neighborhood sizes used in (a) (see figure s e). points for which a neighborhood of size r does not include enough neighbors for regression are not shown. (d-f) as in (a-c) for a mouse gastrulation dataset with n ≈ . × , d = and n = . (g-i) as in (a-c) for a mouse brain dataset with n ≈ . × , d = and n = , plotted on t-sne coordinates. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to confirm the robustness of our results to sampling, we randomly discarded f% of points in the ambient space determined for each dataset, and recomputed scalar curvatures using the same values of n, d and r(p) used for the original dataset. we found that a statistically significant percentage of downsampled points ( % for the pbmc dataset with f = , % for the gastrulation dataset with f = , and % for the brain dataset with f = ; p < . ) had a % ci containing the scalar curvature reported by the same point for the original dataset (see figures s i, s i and s i; methods section . . . ). this suggests that if the datasets were more highly sampled, and scalar curvatures were recomputed using the same neighborhood sizes, they would be reliably contained within the currently reported % cis. unlike the two other datasets, the brain dataset could not be downsampled to f = while still having at least % of points report % cis containing the originally reported scalar curvatures, despite having the most points. this might be because the brain dataset has a larger manifold dimension according to our heuristic and therefore requires a greater number of terms, hkij, to be estimated in the second fundamental form. for the pbmc dataset, we additionally downsampled the single-cell count matrix by discarding f% of transcripts at random and preprocessing the same way. we recomputed scalar curvatures for this downsam- pled dataset with the same n, d and r(p) values used for the original dataset. here too, we found that when f = (f = ), % ( %) of the downsampled points had a % ci containing the originally reported scalar curvature (p < . , see figure s j; methods section . . . ). therefore, the computed scalar cur- vature is robust to changes in capture efficiency and sequencing depth. taken together, our computational analysis reveals non-trivial intrinsic geometry in scrnaseq data. discussion in this study, we explored two approaches to computing the curvature of data manifolds using tools from twin branches of differential geometry. despite the prevalence of the laplace-beltrami operator in geometric data analysis [ , , , , ], an intrinsic approach to computing scalar curvature relying on this operator’s eigenvalues was determined to be infeasible for sample sizes of n ≈ typical of current scrnaseq datasets. although methods such as magic [ ] and diffusion pseudotime [ ] apply the laplace-beltrami operator to smooth scrnaseq data and infer cell differentiation trajectories respectively, using information intrinsic to the manifold, our results suggest that the embedding of the manifold in the ambient space provides valuable information necessary for estimating the intrinsic curvature. this observation is perhaps implicit in recent tools for estimating the laplace-beltrami operator, which first use moving local least-squares to approximate a surface, thereby incorporating information from the ambient space [ ]. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / certainly, we found that an extrinsic approach in which the embedding is retained, and curvature is determined by local quadratic fitting of datapoints in ambient coordinates, is feasible given the sample size and degree of noise in real-world datasets. to obtain the scalar curvature of data manifolds, our algorithm first computes the full riemannian curvature tensor. for other applications, this tensor can be used to compute other geometric quantities, such as ricci curvature, or may itself be of interest. more generally, we focused on intrinsic curvature because we were interested in geometric properties of the manifolds independent of their embeddings. however, the second fundamental form used in our approach to compute the intrinsic curvature can be used to obtain all the information about the extrinsic curvature as well. indeed, hkij(p) exactly quantifies the extent to which the manifold deviates in the kth normal direction from the ij-tangent plane at point p. a key limitation of our algorithm is that the manifold dimension must be specified by the user. we also assumed that the manifold dimension is the same at every point in a dataset. extending the algorithm to determine the manifold dimension from the data itself, potentially in a position-dependent manner, may prove useful. in addition, there is no inherently correct length scale over which curvature should be computed for a data manifold. our algorithm chooses a length scale that varies from one part of the data manifold to another according to the density of points, and is tuned to achieve a user-specified level of uncertainty in the computed curvature. for some applications, it might be more sensible to fix a desired length scale for computing the curvature. as a demonstration of our algorithm, we computed the scalar curvature of image patches, and found that it was consistent with that of a klein bottle. this observation further validates the claim by carlsson et al. who showed that image patches have the topology of a klein bottle [ ]. unlike the klein bottle parameterization of image patches however, no definitive analytical form has been established for scrnaseq datasets. recent work has suggested the use of hyperbolic geometry to model branching cell differentiation trajectories [ ] and specific manifolds have been proposed to model reaction networks [ ], which may be applicable to scrnaseq data. these proposed manifolds can be validated or improved using knowledge of the intrinsic geometry of scrnaseq datasets. finally, incorporating information about curvature may provide a more principled approach for developing dimensionality reduction and visualization tools. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods . differential geometry of theoretical manifolds here we briefly discuss how to compute the scalar curvature of, and sample from, theoretical manifolds given a parameterization. for a d-dimensional manifold, m, with intrinsic coordinates {x , ...,xd} and embedding in rn given by f(x , ...,xd), the metric is: gij = ∂ft ∂xi ∂f ∂xj ( ) the scalar curvature of m can then be derived analytically in intrinsic coordinates in terms of the metric as s = gij ( Γkij,k − Γ k ik,j + Γ l ijΓ k kl − Γ l ikΓ k jl ) ( ) where the Γijks are christoffel symbols given by Γijk = gil ( ∂glj ∂xk + ∂glk ∂xj − ∂gjk ∂xl ) ( ) and Γijk,l= ∂Γijk ∂xl . to draw points from m with ai ≤ xi ≤ bi so that the embedded manifold is uniformly sampled in rn, we use rejection sampling. for paired random variables x ∼ uniform(a,b) and y ∼ uniform( , max √ det g), we retain x as a sample point if √ det g ∣∣ x ≤ y. . details of intrinsic approach to curvature estimation here we explain how we used equations - on the simplest of toy manifolds, the noise-free -dimensional hollow unit sphere, s , to obtain an estimate of the average scalar curvature. the true scalar curvature is s(p) = ∀p ∈ m. for the remainder of this section, we adopt the convention that symbols with overbars are estimates of the corresponding unaccented quantities. . . approach for s our approach mirrors the treatment in [ ], in which heat-traces are fit over various intervals [x ,x ] with x ≥ , to quadratic polynomials p (x) = c + c x + c x to estimate the geometric quantities in equation . here, we constrained the form of p (x) for fitting by assuming that (i) the manifold is boundary-less (so that c = c = and the second boundary term for c vanishes), (ii) the volume is known (so that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c = c = π), and (iii) the scalar curvature is constant (so that c = π s), yielding p (x) = π + π sx . these are strong assumptions that will not hold for an arbitrary manifold, which already precludes this as a generic procedure. nonetheless, we proceeded for s to see if even with this privileged information, the scalar curvature could be estimated accurately. we declared an estimate to be accurate on the interval [x ,x ] if s has error within ± . i.e. s ∈ [ . , . ]. all quadratic fits were performed in matlab using the lsqnonlin function (‘steptolerance’= e- , ‘functiontolerance’= e- ). first, we evaluated zm(x) using analytical eigenvalues for s given by λ(`− ) + , ...,λ` = `(`− ),` > , and let dm be the collection of all intervals for which fits to p (x) yielded accurate s. dm corresponds to intervals where equation is accurate to our desired tolerance when the eigenvalues are known exactly. next, we uniformly sampled n = points from s (see figure a; methods section . . . ), estimated ∆m using the random walk graph laplacian with gaussian kernel (see equation in methods section . . ), and computed empirical eigenvalues, λk, from ∆m . we selected n = as it is the same order of magnitude as the sample size of current scrnaseq experiments, and is sufficient to identify m as s by eye (see figure a). we verified if estimates zm(x), obtained by evaluating equation using λk, when fit as described above to p (x) over intervals in dm, recapitulated the accurate s obtained using zm(x). we restricted our attention to dm for calculations using empirical eigenvalues, since it is only over intervals in dm that it is even theoretically possible to compute scalar curvature to the desired accuracy. below, we report our findings for different m. . . infinite series we first applied this approach to the ideal case in equation , where infinite analytical eigenvalues are available. we computed z∞(x) (shown as a black line in figure s a) and obtained s by fitting p (x) over various intervals as described above. figure s b shows that d∞ is comprised of intervals with ≤ x < x . . . for x & . , errors from neglecting higher-order terms o(x ) in equation dominate. since zm(x) converges from ∞, x . . necessarily holds for any interval in dm∀m. . . truncated series we next considered zm(x) for m < n, since in practice, we will only have access to as many eigenvalues as datapoints (n). we computed z (x) using equation (shown as a solid blue line in figure s a), and obtained s by fitting p (x) (see figure s c). intervals in d roughly satisfy . . x < x . . . however, we found that z (x) (shown as a dashed blue line in figure s a) deviated markedly from z (x) in the rough interval [ . , . ], which has significant overlap with d . consequently, when we .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fit p (x) to z (x) on d , the resulting s was not accurate for any interval in d (see figure s d). note that this inaccuracy was not a consequence of not using all n available eigenvalues. while picking m = n would reduce the lower bound on valid intervals in dm (since zm(x) converges from ∞), it is exactly for small x that s obtained from z (x) is already over-estimated as shown in figure s d. since zm (x) > zm (x) ∀x,m > m , using a truncated series with a larger m would simply exaggerate the difference between zm(x) and zm(x) for small x and cause scalar curvatures estimated using the latter to be further over-estimated. following this line of thought, we reasoned that picking a fewer number of eigenvalues may ameliorate the issue. we selected m = (instead of a round number like m = so that all eigenvalues of a given multiplicity are included) and repeated this analysis for the same set of n = points. z (x) is shown as a solid red line in figure s a and the intervals over which fits to p (x) yield accurate s, d , are shown in figure s e. while z (x) (shown as a dashed red line in figure s a) has a much smaller deviation from z (x) than z (x) did from z (x), no estimate of s obtained from fits of z (x) to p (x) on d were sufficiently accurate once again (see figure s f). . . eigenvalue convergence we refrained from reducing m further to improve agreement between zm(x) and zm(x) after noting that the size of the intervals in dm shrink with m. though we may have a better chance of computing accurate s with zm(x) on dm for smaller m, recall that in practice we will not have dm available to us since the analytical eigenvalues will be unknown. therefore, we simply shift the problem to one of choosing an interval that will yield an accurate s, from a shrinking pool of intervals that could even theoretically yield an accurate estimate. instead, we compared the estimated λks with their true values, λk, and observed that the former con- sistently under-estimate the latter (see figure s h). furthermore, we found that the fractional error grows with k, exceeding % for k = , ..., . therefore, z (x) will only be accurate if n is large enough to limit the fractional error. to determine the required tolerance on the fractional error, we constructed a truncated series analo- gous to equation , but with eigenvalues interpolated between the analytical eigenvalues and the empirical .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / eigenvalues determined for n = , according to a parameter f: z̃m(x; f) = ( π) d/ xd m∑ k= e−λ̃k(f)x λ̃k(f) = λk + f(λk −λk) ( ) f signifies that the fractional error of the interpolated eigenvalues is reduced by −f relative to the empirical eigenvalues determined for n = . we found that f ≤ . is needed so that z̃ (x; f) (shown as a green line in figure s a) fit to p (x) yields accurate s on half the intervals in d (see figure s g). given that the fractional error in estimating λ , ...,λ by λ , ...,λ is % when n = , how large does n have to be to reduce this fractional error to % × . ≈ %? a convergence rate for the fractional error is given in theorem of [ ]. for -dimensional manifolds: ∣∣λk −λk∣∣ λk = o ( (log n) n ) ( ) assuming that the big-o bound is sharp at n = for k = , ..., (i.e. the prefactor is given by . log( ) ≈ . ), we extrapolated that at least n = datapoints are needed to reduce the fractional error to % (see figure s h). equation also applies to empirical eigenvalues of ∆m constructed from weighted knn and r-neighborhood kernels instead of gaussian kernels (see methods section . . ). however, the prefactor in equation is actually worse for these estimators since their empirical eigenvalues have larger fractional errors at n = (see figure s h), so that even larger n would be required to attain the desired fractional error. lastly, note that while we had analytical eigenvalues available with which to ascertain m = as suitable, the naive approach of simply using all eigenvalues available (m = n), would require sample sizes that are even larger by several more orders of magnitude. . . estimating the laplace-beltrami operator from data for n points, {xi} ∈ rn, sampled from m, we estimated ∆m by normalizing the weight matrix w (see below) using the random walk normalization [ , ]. ∆m constructed using this normalization converges to ∆m when samples are drawn uniformly from the embedding of m in rn, as was done in our analysis. ∆m = � (in −d− w) d = diag{w } ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in is the n ×n identity matrix, ∈ rn is a vector of ones and the kernel width, �, is set to match that used in theorem of [ ]: � = (log n) n ( ) throughout our analysis, we used w = wg, the weight matrix with entries given by a gaussian kernel: [wg]i,j = exp(−‖xi −xj‖ /�) − δi,j ( ) to check whether other estimators had more benign prefactors for eigenvalue convergence (see figure s h), we also considered the weighted knn kernel, wknn , and the r-neighborhood kernel, wr, with r = � [ ]: [wknn ]i,j = [wg]i,j [ knn(j)(i) or knn(i)(j) ] [wr]i,j = bxi(r)(xj) − δi,j ( ) knn(i) is the set of indices of the k-nearest neighbors of point i in rn, bxi(r) is the n-dimensional ball of radius r centred at xi, and a(x) is the indicator function for x ∈ a. . details of extrinsic approach to curvature estimation . . quadratic regression on local neighborhoods of data here we describe the regression model for computing the coefficients of the second fundamental form, hkij, at a particular point p. as described in the main text, after performing pca on a neighborhood of np points around p in rn, each point in the neighborhood can be described in terms of d tangent coordinates, ti, and n−d normal coordinates, nk. we defer discussion of how the neighborhood is selected to methods section . . . the nks are treated as dependent variables that can be modelled as quadratic functions of the tis, which are taken to be independent variables. see equation below. linear terms are excluded since they ought to have zero coefficients in the tangent basis. constant terms, ck, are included to account for affine shifts. since hkij = h k ji according to equation , in practice we only consider titj and h k ij for j ≥ i so that t and h in equation have linearly independent columns, though we write the full form here for simplicity. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / n = th + e n =   n ( ) . . . n ( ) n−d ... . . . ... n (np) . . . n (np) n−d   t =   t ( ) t ( ) . . . t ( ) t ( ) d t ( ) t ( ) . . . t ( ) d t ( ) d ... ... . . . ... ... . . . ... t (np) t (np) . . . t (np) t (np) d t (np) t (np) . . . t (np) d t (np) d   h =   c h , . . . h ,d h , . . . h d,d ... ... . . . ... ... . . . ... cn−d h n−d , . . . h n−d ,d h n−d , . . . h n−d d,d   t e =   ε ( ) . . . ε ( ) n−d ... . . . ... ε (np) . . . ε (np) n−d   =   ε( ) t ... ε(np) t   ( ) regression yields the following least-squares solution: ĥ = (tt t)− tt n Σε = (n − tĥ)t (n − tĥ) np Σh = Σε ⊗ (tt t)− ( ) where ĥ is the matrix of estimates of the second fundamental form, Σε is the estimated covariance structure of the residuals so that ε(i) ∼ n( , Σε), and Σh is the covariance matrix for ĥ. ⊗ denotes the kronecker product. we used the mvregress function in matlab to perform this regression in our code. when datapoints are sampled exactly from an analytical manifold, Σε measures the contribution of higher-order terms. in the limit of infinite sampling and infinitesimally small neighborhoods, Σε → . when observational noise is present (discussed in methods section . . . ), Σε also depends on the magnitude of the noise (σ in equation ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . selecting local neighborhoods for regression here we describe the procedure for selecting a neighborhood around each point p for computing the second fundamental form. we adopt the simplest approach of selecting the neighborhood to be a ball of radius r centred at p, bp(r). if r(p) is not specified, we set it according to statistical rather than geometric principles, since the geom- etry of the manifold may be non-trivial and unknown a priori. specifically, we set r(p) so that the elements in the covariance matrix, Σh, are upper-bounded by σ h, the square of the specified target uncertainty. the largest elements in Σh are the variance terms on the main diagonal, corresponding to the squares of the standard errors, σhk ij , for the coefficients hkij. by inspection of equation : σ hk ij = [diag Σε]k [ diag (t′t)− ] (ij) ( ) where [diag Σε]k is the diagonal entry of Σε corresponding to the k th normal direction and [ diag (t′t)− ] (ij) is the diagonal entry in (t′t)− for which the corresponding entry in t′t is ∼ ∑ l(t (l) i t (l) j ) . increasing r(p) monotonically increases both np(r), the number of points in bp(r), and the average magnitude of elements in t, both of which reduce σhk ij . to avoid sweeping r(p) to find the minimum value such that max σhk ij < σh, which is computationally expensive, for each point we instead model the dependence of np(r) on r as np(r) ∼ rd ′ ( ) so that σ hk ij ∼ rd ′+ ( ) to determine d′, np(r) is counted at log-spaced distances, ri, and a line is fit to the (log ri, log np(ri)) pairs for i ∈{ , ..., }. r is set to be the distance from p to the ( d(d+ ) + ) -closest point to p (the minimum number of points needed for regression). r is set to be the distance from p to the furthest point from p. to solve for r, we first guess rg = r , perform regression on the set of points in bp(rg) and assign σ g to be the largest diagonal entry in Σh. if ∣∣∣σgσh − ∣∣∣ is within a desired tolerance, we set r = rg, or else we update rg as shown and iterate to convergence. rg ← rg ( σg σh ) d′+ ( ) for large datasets, we speed up computation by only selecting r in this manner for a subset of ncalib ≤ n .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / randomly selected calibration points. all datapoints in the voronoi cell of each calibration point are then assigned the same r as the calibration point. unless otherwise specified ncalib = n. . . goodness-of-fit test for quadratic regression for a fixed density of points, there is a fundamental trade-off between reducing uncertainty in the hkijs and the validity of approximating local neighborhoods with quadratic fits. to reduce σh, more points must be included in the fit, but a larger neighborhood may not be well-modelled by only quadratic terms. conversely, d(d+ ) + points are sufficient to perform the regression, but there is then large uncertainty in the estimate of hkij. since our approach is to choose a neighborhood size to achieve a target σh, we include a companion goodness-of-fit (gof) statistic measuring how well the neighborhood is fit by a quadratic. namely, we use mardia’s test on the residuals from regression (ε(i) in equation ), which yields a p-value for the null hypothesis that the residuals are normally distributed [ ]. when the p-values are small, the quadratic regression model is unlikely to be valid. in this case, curvatures computed using the resulting hkij may be suspect regardless of the tightness of the errorbars, and the user may want to consider increasing σh to reduce the neighborhood size. however, the poor gof may not be of concern if the length scale of interest is larger than the fluctuations in the manifold which give rise to the non-gaussian residuals (see methods section . . ). note that mardia’s test is relatively weak since it may yield false negatives for heteroskedastic residuals. this gof measure is therefore only provided as a computationally cheap consistency check. ideally, the density of sampled points is sufficiently high to (i) permit small σh and (ii) produce gof p-values that are uniformly distributed (consistent with the null model) and spatially uncorrelated. . . standard error and bias of scalar curvature estimate here we discuss how we compute the standard error, σs, of the estimate for s and note sources of estimator bias. since the riemannian curvature tensor in equation is a bilinear form and the tensor contraction in equation is a straightforward sum, σs can be computed using simple error propagation formulas in terms of the uncertainties from regression. specifically, the standard error we report is the first-order approximation to the second moment of a function of random variables: σs = √ jt Σhj ( ) where j = ∂s ∂hk ij ∣∣∣ ĥ . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / it is important to note that our estimate for s is biased and not normally distributed. first, the hkijs are only normally distributed when the residuals (ε(i) in equation ) themselves are normally distributed. second, even when the hkijs are normally distributed, our estimate of s will not be due to its bilinear dependence on hkij. lastly, estimates for s can be biased in a manifold-dependent and even position- dependent way. for instance, the analytical scalar curvature of s embedded in r is given by s = (h h − h h ), with h = h = / and h = h = . numerically however, the symmetric off-diagonal terms will never be exactly so s will be systematically under-estimated. this is apparent in the left tail of the blue histogram in figure i. in our experience, adding isotropic noise of small magnitude tends to remove the skew, presumably because then the residuals more closely match the regression assumptions (see for example figure s b, where the left tail disappears for σ = . ). furthermore, in our examples, we observed that computed scalar curvatures were less biased when the ambient and/or manifold dimensions were large. we speculate that this is because the increased number of terms (with alternating signs) in equations and leads to cancellation of errors, which is likely why the accuracy of computed scalar curvatures was higher for s , s and s than s , and the distribution of scalar curvatures less skewed (see figure i). . . note on length scales here we make three remarks regarding length scales relevant both for considering curvature theoretically and for applying our algorithm. first, note that scalar curvature has units of inverse length squared. therefore, scaling all the coordinates of the points on a manifold by a factor l, changes the scalar curvature at all points by l− . thus, it is always important to contextualize the scalar curvature in terms of the global length scale associated with the manifold. for example, the scalar curvature of sd with radius r is sd(p) = d(d− ) r ∀p ∈ m (here l = r). in the case of the toy models shown in figure , the global length scale is l ≈ (see methods section . . ). for the image patch dataset, a normalization is applied which places all patches on s (see methods section . . ), so that the global scale is again l = . for scrnaseq data, we computed scalar curvature on the datapoints after preprocessing (see methods section . . ), without imposing any additional scaling correction to achieve a standardized global length scale. since other custom analyses also use these same boilerplate preprocessing steps, computing scalar curvatures in the context of the global length scale of the preprocessed data is sensible. for all three scrnaseq datasets, the global length scale happened to be l ≈ (see methods section . . ). second, since hkij is a dimension-ful quantity (which scales as l − ), to keep the ratio of σs to s fixed when all coordinates are scaled by l, σh needs to be scaled by l − . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / lastly, we note that our choice of σh sets local length scales that are statistically rather than geometrically informed: neighborhoods are chosen to upper bound the uncertainty in estimates obtained from regression. this length scale can also be understood in terms of a bias-variance trade-off. large length scales reduce variance but may introduce a bias if the resulting neighborhoods are larger than features on the manifold. this manifests as poor gofs and can be corrected by finer sampling. however, for manifolds with features at different length scales (such as a golf ball, which can be treated as dimples superimposed on s ), neigh- borhoods chosen by this heuristic can also be much smaller than the feature of interest, so that fine-scale curvature fluctuations are detected (dimples) while coarser features are neglected (s ). regardless, we de- fault to this statistical approach because in general, the length scale of relevant features on a data manifold will not be uniform across the manifold or known a priori. however, we also provide the ability to manually set position-dependent r(p) in the software to facilitate ad hoc computation of curvatures at any length scale of interest. . details of toy manifold curvature computations . . analytical forms here we provide analytical forms for the toy manifolds shown in figures and s . . . . hypersphere the d-dimensional unit hypersphere, sd, has intrinsic coordinates θ ∈ [ , π], θ , ...,θd ∈ [ −π , π ] and ambient coordinates in rd+ given by: xi =   ∏d j= cos θj, i = sin θi− ∏d j=i cos θj, < i ≤ d + ( ) using the relations in methods sections . , the scalar curvature is given by sd(p) = d(d − ) ∀p ∈ m. to draw uniform samples from sd, instead of applying rejection sampling on these intrinsic coordinates as described in methods section . , it is more straightforward to let xi ∼n( , ) and scale the resulting vector (x , ...,xd+ ) to have unit norm. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . one-sheet hyperboloid the one-sheet hyperboloid, h , has intrinsic coordinates θ ∈ [ , π], u ∈ r and ambient coordinates in r given by: x = a cos θ √ u + y = b sin θ √ u + z = cu ( ) for figure e,f, we used a = b = and c = . using the relations in methods sections . , the scalar curvature is given by s(z) = − ( z + ) . to avoid edge effects in the z-direction, we constrained u ∈ [− , ], and sampled points as described in methods section . until a subset of n = had u ∈ [− , ]. scalar curvature was computed and visualized for these n = points. . . . ring torus the -dimensional ring torus, t , has intrinsic coordinates θ,φ ∈ [ , π] and ambi- ent coordinates in r given by: x = (r + r cos θ) cos φ y = (r + r cos θ) sin φ z = r sin θ ( ) for figure g,h, we used r = . and r = . . using the relations in methods sections . , the scalar curvature is given by s(θ) = cos(θ) +cos(θ) . . . . hypercube the m-dimensional cube of side length r, dmr , has intrinsic coordinates z , ...,zm ∈ [−r/ ,r/ ], and ambient coordinates in rn for n ≥ m given by: xi =   zi, ≤ i ≤ m , m < i ≤ n ( ) using the relations in methods sections . , the scalar curvature is given by s(p) = ∀p ∈ m. . . practical issues for curvature estimation on real-world datasets for real-world data, small sample size is only one of the potential confounders for accurately estimating curvature. here, we report how our algorithm fares when four other real-world confounders are applied to toy manifolds: non-uniform sampling, observational noise, large ambient dimension n, and uncertainty in .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the manifold dimension d. . . . non-uniform sampling we expect our approach to handle non-uniform sampling of the man- ifold gracefully: smaller (larger) neighborhoods will be used on densely (sparsely) sampled portions of the manifold to encapsulate the number of points needed to achieve σh. to computationally verify the robustness of our tool to non-uniform sampling, we constructed a toy model to roughly match the (n, d, l) parameters for the scrnaseq datasets explored in the paper, for which non-zero scalar curvatures seemed to appear at smaller length scales/higher densities. specifically, we wanted to verify that non-zero scalar curvatures do not appear artifactually at specific length scales due to sharp changes in the local density of points sampled from a flat manifold. to this end, we formed a dataset with a sparse periphery and dense core by uniformly sampling n = points from d to establish a background density equal to points per unit volume, and n = points from d to create a core density roughly equal to points per unit volume (see methods section . . . ). we embedded these points in r by adding isotropic gaussian noise with σ = . to the eight normal directions, for all datapoints. we computed scalar curvature on this dataset for a fixed σh (see methods section . . ) and found no significant deviation from the true value of zero in either the sparse or dense regions (see figure s a). we next computed scalar curvatures at three fixed length scales corresponding to the , , and %-ile r values obtained using the specified σh (r = . , . and . respectively) and again saw no deviation from zero scalar curvature for points in either the sparse or dense region (see figure s a). we repeated this analysis for n = and again saw no deviation from zero scalar curvature, regardless of whether variable neighborhood sizes or fixed length scales (r = . , . and . corresponding to the same percentiles) were used (see figure s a). . . . observational noise every ambient coordinate can be considered a measured observable with its own observational noise. assuming each observable is distorted by independent, isotropic gaussian noise with variance σ (sometimes referred to as convolutional noise [ ]), datapoints x ∈ rn sampled from an embedded manifold m are modelled by: x = x + n( ,σi), x ∈ m ( ) to study the sensitivity of our algorithm to noise, we uniformly sampled n = datapoints from s ∈ r , added convolutional noise with σ ranging over several orders of magnitude, and estimated scalar curvatures using a fixed σh (see methods section . . ). for small σ, the distribution of scalar curvatures was centred .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / on the true value of , but once σ became large (≈ % of s ’s radius), the estimated scalar curvatures approached (see figure s b). noise in the regression context does not change the expectation value of any estimated parameter. the apparent flattening that is observed therefore indicates that x (obtained from convoluting m), has a geometry that is not trivially related to m. certainly for σ ≈ , x does not even preserve the topology of m as s . from a practical perspective, it suffices to say that small convolutional noise can be handled by simple quadratic regression, while large convolutional noise obfuscates the original manifold. these observations are consistent with literature defining a manifold’s reach [ , ], a noise scale beyond which noisy samples cannot be uniquely associated to a point on the noise-free manifold. when σ exceeds the manifold’s reach, the relationship between the empirical density of sampled points and the original manifold is non-trivial even for a relatively forgiving model of manifold-orthogonal noise. the ridge manifold [ , ] of an empirical density has also been defined as an alternative to the unwieldy task of deconvoluting noisy samples to recover a noise-free manifold. this definition avoids the notion of a noise-free manifold altogether and instead defines manifolds as ridges, contours along which the empirical density of points is maximized. . . . large ambient dimension a high-dimensional dataset may have an ambient space comprised of tens of thousands of observables, i.e. n is very large. meanwhile, the underlying manifold dimension, d, may be small. since convolutional noise occurs in n dimensions, will a low-dimensional manifold still be discernable? to explore this, we uniformly sampled n = datapoints from s ∈ r , embedded these points in rn for a range of n up to , and added convolutional noise of magnitude σ = . , . , and . in the n-dimensional ambient space. we computed curvatures for all combinations of n and σ using a fixed σh (see methods section . . ). as n or σ increased, the algorithmically chosen neighborhood sizes, r(p), expanded to include enough datapoints to maintain the desired σh. the distribution of estimated scalar curvatures (shown in figure s c) is centred on the true value of for n < and σ ≤ . . however, we observed that r was far less sensitive to changes in n than changes in σ. for example, exploding n from to at σ = . and tripling σ from . to . at n = required a comparable increase in r (see figure s c). therefore, consistent with the results of methods section . . . , as long as the noise scale σ is small, a large ambient dimension n is not a confounder. practically however, to shorten computational overhead and avoid the large-n-and-σ case, it is still helpful to reduce the ambient dimension by projecting datapoints to an affine subspace containing the manifold (e.g. by pca). such a transformation does not change the intrinsic curvature. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . choice of manifold dimension the last practical consideration is accurate selection of the manifold dimension, d, which we have so far assumed to be known. there is no consensus on the definition of d for a dataset, so various disciplines have devised different heuristics to determine d in a data-driven fashion [ ]. from the regression perspective, any d > corresponds to a well-defined regression problem. the choice of d merely determines how local coordinates are partitioned into independent (tangent) and dependent (normal) variables. however, in our algorithm we noticed that some choices of d result in exces- sively large r(p) for a fixed σh. we explored this further using two toy manifolds and discovered a signature indicating that the specified manifold dimension may be incorrect. the manifolds considered were s ⊂ r convoluted by isotropic gaussian noise with σ = . and s ×s ⊂ r , for which d∗, the true manifold dimension, is d∗ = and d∗ = respectively. we uniformly sampled n = points from each manifold and estimated scalar curvatures by holding σh fixed for different d (see methods section . . ). for both manifolds, the average neighborhood size, r, was much larger for d > d∗ and d < d∗, than for d = d∗ (see figure s d). in the case of s , for d < d∗, the average neighborhood size was even larger than the global length scale, l, of the manifold. since neighborhood sizes are chosen to achieve a target σh, manually decreasing r(p) is counter-productive and simply increases the uncertainty from regression above σh. the large neighborhood sizes that emerged for both d > d∗ and d < d∗ can be understood in terms of the mis-assignment of normal vectors to the tangent space, or vice versa. according to equation , σhk ij increases with large variation in the normal direction ([diag Σε]k), or with small variation in the tangent direction ( [ diag (t′t)− ] (ij) ). when we choose d > d∗, we mis-attribute a normal direction with small variation [diag Σε]k as an independent variable, whereas variation along the true tangent space is � [diag Σε]k. r must therefore be increased to compensate for the lack of variation along this direction mis-classified as tangent. when d < d∗, we have spuriously assigned a tangent direction with large variation to be a normal direction. since this spurious normal coordinate cannot be well-approximated as a function of tangent coordinates from which it is linearly independent, the perceived noise scale ([diag Σε]k) is exaggerated so that a larger neighborhood is needed to attain σh. this suggests a crude, operational definition of what constitutes an incorrect choice of d. when σhk ij is large relative to the uncertainty in other coefficients, there is either too little variation along the ith and jth tangent directions, or too much variation along the kth normal direction. in the former case, the ith or jth tangent direction might be more appropriately classified as a normal direction (d is too large and should be decreased), while in the latter case, the kth normal direction might be more appropriately classified as a tangent direction (d is too small and should be increased). when this criterion is applied point-wise, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / there may be a different acceptable choice of d for different parts of the manifold. when this criterion is generalized over the entire manifold, a σh yielding a flat distribution of gof p-values when the manifold dimension is specified to be d will also yield a flat distribution for d + but not necessarily for d − : if residuals in n−d dimensions are well-modelled by a multivariate gaussian, so too will residuals in n−d− dimensions, but not necessarily residuals in n − d + dimensions (see figure s d). our observations are consistent with manifolds in literature with multiple possible manifold dimensions (like the helix manifold in [ ]), and which could generally arise from non-isotropic noise or non-uniform sampling. . . parameters for curvature estimation for each manifold in figure , we chose σh so that the fraction of points with gof p-value ≤ α = . most closely matched the null model of normally distributed residuals consistent with neighborhood sizes well-approximated by quadratic regression (see section . . ). σh = ( . , . , . , . , . , . ) for (s ,s ,s ,s ,h ,t ) resulted in ( . , . , . , . , . , . )% of points having gof p-values ≤ α = . . theoretically, max |hkij| = ( . , , . ) for (s d,h ,t ) so our choices for σh result in small fractional errors in all cases. for figure s a, we set σh = ( . , . ) for n = ( , ) respectively which resulted in ( . , . )% of points having gof p-values ≤ α = . . for all other panels in figure s , where we were interested in ascertaining the sensitivity to different confounders, instead of minimizing uncertainty per se, we used a fixed value of σh = . . this choice resulted in neighborhoods small enough to be well-approximated by quadratic regression, manifesting as a roughly uniform distribution of gof p-values in all cases. . details of image patch dataset and klein bottle manifolds . . notation and preliminaries first we introduce some notation needed to describe the image patch dataset. we refer readers to [ , ] for a more detailed exposition. let p be the space of all bivariate polynomials p : r × r → r with p ∈ p, h : p → r the vectorization operator given by h(p) = [p(− , ),p(− , ),p(− ,− ),p( , ),p( , ),p( ,− ), p( , ),p( , ),p( ,− )]t , u : rm →sm− the normalization operator given by u(v) = v‖v‖ , and c : r → r the projection operator given by c(y) = Λaty, where a = [e . . . e ], Λ = diag{ ‖e ‖ , ..., ‖e ‖ }, and {ei} .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / are vectorized basis vectors for the -dimensional discrete cosine transform (dct) applied to x patches: e = [ , ,− , , ,− , , ,− ]t/ √ e = [ , , , , , ,− ,− ,− ]t/ √ e = [ ,− , , ,− , , ,− , ]t/ √ e = [ , , ,− ,− ,− , , , ]t/ √ e = [ , ,− , , , ,− , , ]t/ √ e = [ , ,− ,− , , , , ,− ]t/ √ e = [ ,− , , , , ,− , ,− ]t/ √ e = [ ,− , ,− , ,− , ,− , ]t/ √ ( ) by inspection, e is the basis vector for patches with horizontal stripes and linear gradients, e for patches with vertical stripes and linear gradients, e for patches with horizontal stripes and quadratic gradients, e for patches with vertical stripes and quadratic gradients, and e for diagonally-oriented patches with quadratic gradients. all the patches produced by the embedding k in equation below and visualized in figure b can be written as a linear combination of these basis vectors. next, note that the components in each ei sum to , so that the projection operator, c, additionally serves to remove the mean. finally, observe that the vector norm formed under d = aΛ at (referred to hereafter as the d-norm following [ ]) measures the contrast in a x patch since ‖v‖d = √ vtdv = √∑ i ∑ j∼i (vi −vj) ( ) where j ∼ i refers to all vertical and horizontal neighbors, j, of a pixel i in the preimage of v under h. the ei are normalized so that ‖ei‖d = . . . image dataset we used the same van hateren iml dataset [ ] consisting of greyscale images of size x pixels studied by carlsson et al. in [ ] and followed the same preprocessing steps used there. in short, we applied a log p transformation to all pixel values and randomly sampled × (possibly overlapping) x patches from each image. we indexed the pixels in each patch using standard cartesian coordinates with the middle pixel as the origin, so that log-transformed pixel values are given by p(x,y),x ∈{− , , },y ∈{− , , }. we .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / then applied h to vectorize each patch p, and retained the high-contrast patches comprising the top quintile of d-norms for each image, resulting in n ≈ . × datapoints. next, we normalized these high-contrast vectorized patches using the composition u◦ c, resulting in a set of datapoints on s ⊂ r . we determined the density of these datapoints in r using the knn density estimator with k = , and retained the densest decile, which yielded n ≈ . × datapoints. this dense subset of high-contrast normalized patches was found using topological data analysis in [ ] to be a klein bottle, k ⊂ s , and is studied in figures d,i and s b. to generate the augmented image patch dataset used in figures j and s e,f, we first considered all n ≈ . × vectorized high-contrast patches in the van hateren iml dataset using the same procedure described above (each of the images yields × patches, of which the top % by d-norm are retained per image). these were normalized by u◦c as before to place them on s ⊂ r . we again wanted to retain the densest decile of points, since only these have the topology of a klein bottle. mirroring the approach in [ ] where the k used in the knn estimator was scaled with sample size, k = used for n ≈ . × corresponds to k = × . × . × ≈ × for n ≈ . × . computing k ≈ × neighbors for all n ≈ . × points is prohibitive however. to determine a reasonable smaller value of k, we randomly selected × points from the set of n ≈ . × on which to compare estimators and found that % of points in the densest decile as computed with k = × also appeared in the densest decile computed using k = × . we therefore used the latter value for density estimation and retained the n ≈ . × datapoints comprising the densest decile. . . parametric family of klein bottle embeddings let θ,φ ∈ [ , π]. bivariate polynomials parameterized by (θ,φ), kθ,φ ∈ kθ,φ ⊂ p, that satisfy kθ,φ = kθ+π, π−φ form a klein bottle, k : the (θ,φ) ∼ (θ + π, π − φ) similarity relation results in edges being glued together in the manner definitional of a klein bottle’s topology (shown in figure b). the candidate klein bottle embedding supplied in [ ] to model image patch data satisfies the similarity relation ∀x,y: k ≡ k θ,φ(x,y) = cos φ [x cos θ + y sin θ] + sin φ [x cos θ + y sin θ] ( ) note that any kθ,φ ∈ kθ,φ can be decomposed as: kθ,φ = c + κθ + κφ + κθ,φ ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / where κθ = κθ+π, κφ = κ π−φ and κθ,φ = κθ+π, π−φ. the first three terms can be understood as constant, θ-dependent and φ-dependent phases respectively. we sought an embedding of the klein bottle for which the sum of euclidean distances from each image patch to its closest point on the embedding is minimized. to accomplish this, we constructed a parametric family of models for each of the four terms in equation . the first three of these are most conveniently expressed directly in the dct basis. (c◦h) (c) = nc ∑ i= µiei (c◦h) (κθ) = ∑ i=   nθ∑ j= j even βi,j cos(jθ) + γi,j sin(jθ)   ei (c◦h) (κφ) = ∑ i=  nφ∑ j= ζi,j cos(jφ)  ei ( ) nc is a boolean variable, and nθ and nφ control the number of terms in the inner sum for (c◦h) (κθ) and (c◦h) (κφ) respectively. the expression for (c◦h) (κθ) only includes even coefficients for θ so that the similarity relation (θ) ∼ (θ +π) is satisfied. the expression for (c◦h) (κφ) only includes cosine terms so that the similarity relation (φ) ∼ ( π −φ) is satisfied. for κθ,φ, we refrained from writing a fourier series-like expansion because we wanted to preserve the interpretation of θ and φ as parameters controlling the orientation and gradient respectively [ ]. instead, we devised the following form, which we motivate further below: κθ,φ(x,y) = mφ∑ l= cosl(φ)   s+t≤mθ∑ ≤s,t≤mθ even and t odd − √ e − √ e , if t > even and s odd √ (e + e + e ) , if s > even and t > even ( ) note that the first inner sum in equation is a linear combination of basis vectors encoding purely quadratic gradients (e , e , e and e ), weighted by even trigonometric functions of θ. the prefactors on this inner sum are functions that are even in φ. this inner sum and its prefactor therefore jointly satisfy the similarity relation (θ,φ) ∼ (θ + π, π−φ) by independently satisfying (θ) ∼ (θ + π) and (φ) ∼ ( π−φ). meanwhile, the second inner sum in equation is a linear combination of basis vectors containing linear gradients (e , e , e and e ), weighted by odd trigonometric functions of θ. the prefactors on this inner sum are functions that are odd in φ. this inner sum and its prefactor therefore jointly satisfy the similarity relation (θ,φ) ∼ (θ + π, π − φ), by independently satisfying (θ) ∼ −(θ + π) and (φ) ∼ −( π − φ). since the trigonometric functions of θ are coupled to (x,y), θ controls the rotation of stripes in the image patches, just as in k . similarly, since the prefactors on the inner sums are functions of φ, φ controls the relative contribution of quadratic gradients (e , e , e and e in the first inner sum) and linear gradients (e , e , e and e in the second inner sum). lastly, the boundary conditions for θ and φ in this parameterization of κθ,φ, yield patches with vertical (horizontal) stripes when θ = (θ = π ), and linear (quadratic) gradients when φ = π , π (φ = ,π) just as in k . a klein bottle embedding belonging to this parametric family, kαθ,φ ∈ kθ,φ, can therefore be specified in terms of a vector f = [nc,nθ,nφ,mθ,mφ] defining its functional form, and a corresponding coefficient vector α = [µi, ...,βi, ...,γi, ...,ζi, ...,ηi, ...,ϑi]. in this parametric family of klein bottle embeddings, k corresponds to f = [ , , , , ] with α = [η , , ,η , , ,η , , ,ϑ , , ,ϑ , , ] = [ , , , , ]. note that since curvatures are only computed on the embedding after normalization, α is only meaningfully defined up to a multiplicative constant. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . associating image patches to a klein bottle embedding for a given klein bottle embedding, kαθ,φ ∈ kθ,φ, we associated each datapoint vi (already vectorized and normalized by u◦ c◦h) to the closest point on kαθ,φ by minimizing the euclidean distance in r : (θ̂i, φ̂i) = argminθ,φ‖(u◦ c◦h) ( kαθ,φ ) −vi‖ ( ) we solved this minimization using the lsqnonlin function (‘steptolerance’= e- , ‘functiontolerance’= e- ) in matlab, supplying initial conditions corresponding to analytical values for a point on k : θ̂i = arctan e tvi −e tvi ( vi∈(u◦c◦h)(k ) = arctan sin φ̂i sin θ̂i sin φ̂i cos θ̂i ) φ̂i = arctan √ (e tvi) + (e tvi) (e tvi) + e tvi  vi∈(u◦c◦h)(k )= arctan √ sin φ̂i cos φ̂i   ( ) we constrained solutions to θ̂i ∈ [ ,π] and φ̂i = [ , π] according to the (θ,φ) similarity relation. . . optimal klein bottle embedding let kα̂θ,φ ∈ kθ,φ be the klein bottle embedding that minimizes the sum of euclidean distances in r between each image patch and the closest point on the embedding. to determine kα̂θ,φ given a functional form f, we initialized the coefficient vector α̂ to have zero entries everywhere except for the values used in k . we then iterated between optimizing for (θ̂i, φ̂i) according to equation and for α̂ as shown below using least-squares, until convergence: α̂ = argminα ∑ i ‖(u◦ c◦h) ( kα θ̂i,φ̂i ) −vi‖ ( ) k ≡ kα̂θ,φ is the optimized klein bottle embedding corresponding to f = [ , , , , ], for which results are shown in figures h and s d. . . noisy klein bottle embeddings the set of n ≈ . × image patches was associated to k according to the procedure described in methods section . . , yielding (θ̂i, φ̂i) values. isotropic gaussian noise of magnitude sσ was added element-wise in r (prior to normalization by u ◦ c) to h(k θ̂i,φ̂i ), where s = mediani{‖h(k θ̂i,φ̂i )‖ } ≈ . . figures f,g and s a correspond to noise with σ = . , . and . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . parameters for curvature estimation for all scalar curvature computations on image patch datasets and klein bottle embeddings, we set d = and ncalib = . unless the neighborhoods were manually specified, we used σh = . , which yielded a flat distribution of gof p-values ( . % of points reported gof p-values ≤ α = . ) for the set of n ≈ . × points on k closest to the image patches (shown in figure e). . details of scrnaseq datasets the pbmc dataset provided by x genomics is comprised of n = pbmcs collected from a healthy donor [ ]. the mouse gastrulation dataset consists of n = cells collected at nine -hour intervals from embryonic day . to . [ ]. the mouse brain dataset is a benchmark from x genomics consisting of n = cells collected from the cortex, hippocampus and ventricular zone of two embryonic mice sacked at embryonic day [ ]. . . preprocessing for the pbmc dataset, we applied standard preprocessing steps using seurat v . . [ ] with default function arguments, to extract pc projections and umap coordinates ourselves. specifically, we removed cells where the percentage of transcripts corresponding to mitochondrial genes exceeded %, or which had fewer than transcripts. this reduced the number of cells from to . on this filtered set, we normalized the data (normalizedata(normalization.method=‘lognormalize’, scale.factor= )), retained the most variable genes (findvariablefeatures(selection.method=‘vst’, nfeatures= )), and scaled the data (scaledata). next, we performed linear dimensionality reduction using pca down to dimensions (runpca(npcs= )) and generated umap coordinates for visualization (runumap(dims = : )). for the gastrulation (brain) dataset, we did not preprocess the data ourselves but instead directly used the ( ) pc projections and umap (t-sne) visualization coordinates provided with the dataset. please refer to [ , ] for additional details. . . cell type annotations for the pbmc dataset, the addmodulescore(ctrl= ) function was used to compute the per-cell average expression of marker genes corresponding to seven different cell types [ ]. to prepare figure s a, each cell was assigned the cell type for which its average marker gene expression was the highest. cell type annotations for the gastrulation dataset (see figure s a) were sourced from figure c of [ ]. cell type .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / annotations for the brain dataset (see figure s a) are predicted labels sourced from [ ]. . . statistical tests here we describe the statistical tests applied to scalar curvatures computed for the scrnaseq datasets. . . . spatial precision of errorbars let m be the fraction of datapoints with % cis containing the scalar curvatures reported by their respective knns. to check whether m was significantly larger than chance, we used a permutation test. we randomly assigned the knn of each datapoint to be one of the n datapoints in the dataset and computed m. we repeated the procedure t = times to generate an empirical distribution of m for the null model of random neighbors. the reported p-value for each k is the fraction of the t trials for which m was greater than the value computed for data. see figures s d, s d and s d. . . . sensitivity to cell downsampling to check the sensitivity of the computed scalar curvatures to the average density of cells, we discarded f% of cells at random from the ambient space computed using the original set of n datapoints, and recomputed scalar curvatures using the same ambient dimension, manifold dimension and neighborhood sizes as for the original dataset (see methods section . . ). let m be the fraction of downsampled datapoints with % cis containing the scalar curvatures originally reported. since the cis grow as f increases, we checked whether m was significantly larger than chance by using a permutation test. we randomly paired each of the % cis computed after downsampling, to one of the scalar curvatures reported by the downsampled points for the original dataset, and computed m. we repeated the procedure t = times to generate an empirical distribution of m for the null model. the reported p-value for each f is the fraction of the t trials for which m was greater than the value computed for data. see figures s i, s i and s i. . . . sensitivity to transcript downsampling to check the sensitivity of the computed scalar curvatures to the capture efficiency and sequencing depth of the data, we discarded f% of transcripts at random from the single-cell count matrix for the pbmc dataset, then performed the same preprocessing steps described in methods section . . . we recomputed scalar curvatures using the same ambient dimension, manifold dimension and neighborhood sizes as for the original dataset (see methods section . . ). let m be the fraction of datapoints with % cis containing the scalar curvatures originally reported. to check whether m was significantly larger than chance, we used a permutation test. we randomly paired each of the % cis computed after downsampling transcripts, to one of the scalar curvatures computed for the original .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dataset, and computed m. we repeated the procedure t = times to generate an empirical distribution of m for the null model. the reported p-value for each f is the fraction of the t trials for which m was greater than the value computed for data. see figure s j. . . parameters for curvature estimation let the variance explained by the ith pc be given by σ i and the cumulative fractional variance of the first m pcs by cm = ∑m i= σ i∑ i σ i . for each dataset, we selected the ambient dimension as n = argmaxm{cm|cm ≤ . }, the manifold dimension as d = argmaxm{cm|cm ≤ . }, and considered the global length scale to be l = σd. (n,d,l) = ( , , . ), ( , , . ) and ( , , . ) for the pbmc, gastrulation and brain datasets respectively. for the three datasets, we computed scalar curvatures for manifold dimensions d− , d and d + . it was not always possible to select σh for each dataset and manifold dimension, so that the distribution of gof p-values was flat, according to our usual heuristic. for consistency, we therefore picked σh so that / of points had gof p-values ≤ α = . . for manifold dimension (d − ,d,d + ), σh = ( . , . , . ), ( . , . , . ) and ( . , . , . ) for the pbmc, gastrulation and brain datasets respectively. acknowledgements ds was funded in part by the natural sciences and engineering research council of canada (nserc pgsd - - ). sw was supported by nci u -ca and nih nigms t gm . ds and sh acknowledge funding from nih nigms r gm , u systems immunology pilot project grant at harvard university, and the harvard university william f. milton fund. the authors would like to thank peter kharchenko and allon klein for helpful discussions. portions of this research were conducted on the o high performance compute cluster, supported by the research computing group, at harvard medical school. see http://rc.hms.harvard.edu for more information. data and code availability the van hateren iml dataset is available at http://bethgelab.org/datasets/vanhateren and was loaded according to the instructions there. the pbmc dataset is available at https://support. xgenomics. com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc. the gastrulation dataset can be retrieved using instructions found at https://github.com/marionilab/embryotimecourse . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://rc.hms.harvard.edu http://bethgelab.org/datasets/vanhateren https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://github.com/marionilab/embryotimecourse https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the brain dataset is available at https://support. xgenomics.com/single-cell-gene-expression/ datasets/ . . / m_neurons. the software package described here to compute scalar curvature is avail- able at https://gitlab.com/hormozlab/manifoldcurvature. all code and instructions to reproduce the numerics and figures in this study will be made available upon publication. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://gitlab.com/hormozlab/manifoldcurvature https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figures .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f h g figure s : the scalar curvature of s is poorly estimated using the laplace-beltrami operator. (a) the heat-trace with m terms, (zm(x) in equation ) is shown for m = ∞ (black), m = (solid blue) and m = (solid red), when evaluated with analytical eigenvalues for s . empirical eigenvalues were obtained by uniformly sampling n = points from s (see figure a; methods section . . . ) and estimating the laplace-beltrami (lb) operator using equations - . the heat-trace evaluated using these empirical eigenvalues, zm, is shown for m = (dashed blue) and m = (dashed red). the heat-trace evaluated using eigenvalues obtained by interpolating between the analytical and empirical values (z̃m(x; f) in equation ) is shown for m = and f = . (solid green). f signifies that the fractional error of the interpolated eigenvalues is reduced by −f relative to the empirical eigenvalues. f = corresponds to the analytical eigenvalues while f = corresponds to the empirical eigenvalues. the white region bounded by [x ,x ] indicates a candidate interval over which to fit a heat-trace to a quadratic in order to extract an estimate for the scalar curvature (see equations - ; methods section . . ). on the one hand, since the knee of zm(x) shifts to the left as m increases (i.e. zm(x) converges from ∞), larger m results in more intervals for which zm(x) well-approximates z∞(x) and will therefore yield accurate scalar curvature estimates. on the other hand, zm(x) becomes a worse estimator for zm(x) as m increases. (b) scalar curvatures estimated by fitting z∞(x) to a quadratic over different intervals [x ,x ] as defined in (a). scalar curvatures are shown in color for intervals yielding accurate estimates (s ∈ [ . , . ]). this colored region corresponds to d∞. (c) as in (b) but with estimates obtained by fitting a quadratic to z (x). the colored region corresponds to d . by inspection, d ⊂ d∞. (d) scalar curvatures estimated by fitting z (x) to a quadratic over each interval in d . though d was constructed using only intervals which yielded an accurate scalar curvature estimate when analytical eigenvalues were used in the heat-trace, no interval in d yields an accurate scalar curvature estimate when the same number of empirical eigenvalues are used in the heat-trace instead. (e) as in (b) but with estimates obtained by fitting a quadratic to z (x). the colored region corresponds to d . by inspection, d ⊂ d (f) as in (d) but with estimates obtained by fitting z (x) to a quadratic over each interval in d . no estimate is accurate just as in (d). (g) as in (f) but with estimates obtained by fitting z̃ (x; f = . ) to a quadratic over each interval in d . f = . was chosen so that half the intervals in d yield an accurate scalar curvature estimate. (h) (left) the fractional error in the first empirical eigenvalues of the lb estimator from (a) is shown in red. this operator was computed using the gaussian kernel (wg in equation ). eigenvalues - have a fractional error of %. the fractional error of the eigenvalues of lb estimators computed on the same n = points but using the weighted knn and r-neighborhood kernels (wknn and wr respectively in equation ) is also plotted. positive error indicates under-estimation. (right) projected fractional error for eigenvalues - of the lb estimator with gaussian kernel computed using a larger sample size (n). the projection is based on the convergence rate given in theorem of [ ], assuming that the big-o bound is sharp at n = for eigenvalues - . the dashed green line corresponds to the % fractional error needed for scalar curvatures to be accurately estimated for half the intervals in d . this corresponds to f = . in (g) since % ×f = %. for the lb estimator computed using the gaussian kernel, achieving this fractional error requires n ≈ . since lb estimators computed using the other kernels have the same convergence rate but larger fractional error at n = , these estimators would require even larger n to achieve the desired % fractional error. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d figure s : sensitivity of algorithm to real-world confounders. (a) (left) a dataset with a sparse periphery and a dense core was formed by uniformly sampling n = points from the -dimensional cube of side-length , d , and n = points from the -dimensional cube of side-length , d (see methods section . . . ). these points were embedded in r and padded with isotropic gaussian noise of magnitude σ = . in the normal directions. scalar curvatures (s) were computed on this dataset of n + n points by setting σh and are plotted against their standard errors (σs) in the leftmost panel. curvature computations were also performed at fixed length scales corresponding to the , and %-ile values for neighborhood size (left to right) used in the leftmost panel (r = . , . and . respectively). here, points for which the chosen r led to neighborhoods with insufficient points for regression are not shown. for large length scales, all points in the dense region are able to report curvatures but are crowded into the apex of the plots. the n (n ) sparse (dense) points are shown in blue (green). points enclosed by the red lines have % cis including the true value of zero. the right four panels show analogous results when n = . here the the , and %-ile values for neighborhood size are r = . , . and . respectively. see methods section . . . . (b) distribution of scalar curvatures computed for n = points uniformly sampled from s ⊂ r and convoluted with isotropic gaussian noise of magnitude σ in r . noise confounds accurate scalar curvature computation when σ is roughly % of the sphere’s radius. the deviation of the estimated scalar curvatures from the true value of (shown as a dashed red line) for σ ≥ . reflects the nontrivial geometry of a manifold convoluted by noise. see methods section . . . . (c) (left) n = points were uniformly sampled from s and embedded in rn. isotropic gaussian noise of magnitude σ was applied to each of the n ambient dimensions. scalar curvatures computed by keeping σh fixed for all n and σ, recapitulated the true value of (shown as dashed red lines) for n ≤ and σ ≤ . . (right) the neighborhood size (r) necessary to attain σh is less sensitive to changes in n than changes in σ. see methods section . . . . (d) n = points were uniformly sampled from (left) s ⊂ r convoluted with isotropic gaussian noise in the ambient space with σ = . and (right) s ×s ⊂ r . to investigate the effects of choosing the manifold dimension, d, differently than the true value, d∗, σh was kept fixed, and scalar curvatures were computed for d = d ∗− (cyan), d = d∗ + (magenta) and d = d∗ (green). the panels show the distribution of (left to right) scalar curvatures (s), standard errors (σs) and gof p-values. the true value of the scalar curvature (at d = d∗) is constant across both manifolds and shown as a dashed red line. the average neighborhood size (r averaged over all points) is much larger for both d = d∗ − and d = d∗ + than for d = d∗ as shown in the legend. for the same σh, d = d ∗− also leads to a more skewed distribution of gof p-values relative to d = d∗, while the distribution for d = d∗ + is still flat. see methods section . . . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f figure s : additional details of the image patch dataset and klein bottle embeddings (related to figure ). (a) to compute scalar curvatures for figure e, each image patch was associated to the (θ ,φ ) coordinates of the closest point on k . here we select a handful of these associated points on k (shown in black) and visualize how neighborhoods chosen in r to compute scalar curvatures for figure e appear in (θ ,φ ) coordinates (shown in red). when noise of increasing magnitude, σ, is added to the set of closest points on k (see methods section . . ), the neighborhood size at each point grows until σh is attained. (b) as in (a), but showing neighborhoods used in computing the scalar curvatures in figure d for the image patch dataset. note the close correspondence in neighborhood size with σ = . in (a). (c) scalar curvatures computed for the set of closest points (θ ,φ ) on k as in figure e, but using the same neighborhood sizes determined for the image patch dataset shown in figure d, some of which are visualized in (b). (d) as in (a) but showing neighborhoods used in computing the scalar curvatures in figure h for the set of closest points on k . neighborhoods are visualized on (θ ,φ ) coordinates instead of (θ ,φ ) coordinates for ease of comparison. (e) as in (b) but showing neighborhoods used in computing the scalar curvatures in figure j for the augmented image patch dataset. (f) scalar curvatures computed for the augmented image patch dataset with n ≈ . × points as in figure j, but using the same neighborhood sizes determined for the original image patch dataset with n ≈ . × shown in figure d and (b). note the close correspondence with figure d. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / j c d e h g f i a b figure s .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure s : additional details of the pbmc scrnaseq dataset (related to figure ). (a) cell types overlaid onto umap coordinates and sorted in decreasing order of abundance in the legend. cells were annotated as described in methods section . . . (b) a goodness-of-fit p-value was computed for each point by applying mardia’s test to the residuals obtained from fitting the neighborhood around the point to a quadratic function (see methods section . . ). these p-values are visualized on umap coordinates corresponding to each point (left) and their empirical distribution is shown using a histogram (right). small p-values suggest that the residuals are non-normal so that approximating local neighborhoods as quadratic may not be valid. (c) pearson correlation between the scalar curvature reported by each point and its kth-nearest neighbor (knn) for different k (shown in blue). the red bar shows the mean and standard deviation of the pearson correlation when neighbors are chosen randomly over trials (*p < − ). (d) the percentage of points with % cis containing the scalar curvatures reported by their respective knns (shown in blue). the red bar shows the mean and standard deviation of this percentage when neighbors are chosen randomly over trials (*p < . ; see methods section . . . ). (e) the neighborhood size (r) used for computing scalar curvature at each point, overlaid onto umap coordinates (left) and a corresponding histogram of the empirical distribution (right). the dashed red lines correspond to the , , and %-ile values of r(p) used for computing scalar curvatures at fixed neighborhood sizes for figure c. see methods section . . . (f) the number of points in each neighborhood (corresponding to the neighborhood sizes in (e)) overlaid onto umap co- ordinates (left) and a corresponding histogram of the empirical distribution (middle). (right) the set of neighbors used for computing scalar curvature (purple) is visualized on umap coordinates for a handful of points (black). (g) scalar curvatures were computed for manifold dimension d− (left) and d + (right). they are plotted here on umap coordinates after smoothing over the same set of k = neighbors used in figure a. see methods section . . . (h) the total number of transcripts observed in each cell overlaid onto umap coordinates. (i) scalar curvatures were computed after downsampling the number of cells in the ambient space by a factor of (left) and (middle), using the same ambient dimension, manifold dimension and neighborhood sizes determined for the original dataset. they are plotted here on umap coordinates after smoothing over the same set of neighbors (which survive downsampling) used in figure a. (right) the percentage of points in the downsampled datasets with a % ci containing the originally reported scalar curvature (blue), and likewise for a negative control obtained by randomly pairing % cis and originally reported scalar curvatures for points in the downsampled dataset (red). errorbars for the negative control are the standard deviation of this percentage over trials with different random pairings (*p < . ; see methods section . . . ). (j) scalar curvatures were computed after downsampling the number of transcripts by a factor of (left) and (middle), using the same ambient dimension, manifold dimension and neighborhood sizes determined for the original dataset. they are plotted here on umap coordinates after smoothing over the same set of k = neighbors used in figure a. (right) the percentage of points in the downsampled datasets with a % ci containing the originally reported scalar curvature (blue), and likewise for a negative control obtained by randomly pairing % cis and originally reported scalar curvatures for points in the downsampled dataset (red). errorbars for the negative control are the standard deviation of this percentage over trials with different random pairings (*p < . ; see methods section . . . ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / b c d e h g f i a figure s : additional details of the gastrulation scrnaseq dataset (related to figure ). panels as in figure s . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / b c d e h g f i a figure s : additional details of the brain scrnaseq dataset (related to figure ). panels as in figure s but with t-sne instead of umap plots. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] a. m. klein, l. mazutis, i. akartuna, n. tallapragada, a. veres, v. li, l. peshkin, d. a. weitz, and m. w. kirschner. droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. cell, ( ): – , . [ ] e. z. macosko, a. basu, r. satija, j. nemesh, k. shekhar, m. goldman, i. tirosh, a. r. bialas, n. kamitaki, e. m. martersteck, j. j. trombetta, d. a. weitz, j. r. sanes, a. k. shalek, a. regev, and s. a. mccarroll. highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. cell, ( ): – , . [ ] g. x. y. zheng, j. m. terry, p. belgrader, p. ryvkin, z. w. bent, r. wilson, s. b. ziraldo, t. d. wheeler, g. p. mcdermott, j. zhu, m. t. gregory, j. shuga, l. montesclaros, j. g. underwood, d. a. masquelier, s. y. nishimura, m. schnall-levin, p. w. wyatt, c. m. hindson, r. bharadwaj, a. wong, k. d. ness, l. w. beppu, h. j. deeg, c. mcfarland, k. r. loeb, w. j. valente, n. g. ericson, e. a. stevens, j. p. radich, t. s. mikkelsen, b. j. hindson, and j. h. bielas. massively parallel digital transcriptional profiling of single cells. nature communications, ( ): – , . [ ] d. r. bandura, v. i. baranov, o. i. ornatsky, a. antonov, r. kinach, x. lou, s. pavlov, s. voro- biev, j. e. dick, and s. d. tanner. mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. analytical chem- istry, ( ): – , . [ ] c. giesen, h. a. o. wang, d. schapiro, n. zivanovic, a. jacobs, b. hattendorf, p. j. schüffler, d. grolimund, j. m. buhmann, s. brandt, z. varga, p. j. wild, d. günther, and b. bodenmiller. highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. nature methods, ( ): – , . [ ] j-r. lin, m. fallahi-sichani, j-y. chen, and p. k. sorger. cyclic immunofluorescence (cycif), a highly multiplexed method for single-cell imaging. current protocols in chemical biology, ( ): – , . [ ] j-r. lin, b. izar, s. wang, c. yapp, s. mei, p. m. shah, s. santagata, and p. k. sorger. highly multiplexed immunofluorescence imaging of human tissues and tumors using t-cycif and conventional optical microscopes. elife, , . [ ] l. h. nguyen and s. holmes. ten quick tips for effective dimensionality reduction. plos computational biology, ( ):e , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] j. b. tenenbaum. a global geometric framework for nonlinear dimensionality reduction. science, ( ): – , . [ ] l. van der maaten and g. hinton. visualizing data using t-sne. journal of machine learning research, (nov): – , . [ ] e. becht, l. mcinnes, j. healy, c-a. dutertre, i. w. h. kwok, l. g. ng, f. ginhoux, and e. w. newell. dimensionality reduction for visualizing single-cell data using umap. nature biotechnology, ( ): – , . [ ] a. hatcher. algebraic topology. cambridge university press, . [ ] r. ghrist. barcodes: the persistent topology of data. bulletin of the american mathematical society, ( ): – , . [ ] d. perrault-joncas and m. meilâ. non-linear dimensionality reduction: riemannian metric estimation and the problem of geometric discovery. arxiv, . [ ] j. m. lee. riemannian manifolds: an introduction to curvature (graduate texts in mathematics). springer, . [ ] a. zomorodian and g. carlsson. computing persistent homology. discrete & computational geometry, ( ): – , . [ ] g. carlsson. topology and data. bulletin of the american mathematical society, ( ): – , . [ ] m. bernstein, v. de silva, j. c. langford, and j. b. tenenbaum. graph approximations to geodesics on embedded manifolds. technical report, department of psychology, stanford university, . [ ] f. chazal, m. glisse, c. labruère, and b. michel. convergence rates for persistence diagram estimation in topological data analysis. journal of machine learning research, ( ): – , . [ ] c. r. genovese, m. perone-pacifico, i. verdinelli, and l. wasserman. minimax manifold estimation. journal of machine learning research, ( ): – , . [ ] g. carlsson, t. ishkhanov, v. de silva, and a. zomorodian. on the local behavior of spaces of natural images. international journal of computer vision, ( ): – , . [ ] p. lawson, a. b. sholl, j. q. brown, b. t. fasy, and c. wenk. persistent homology for the quantitative evaluation of architectural features in prostate cancer histology. scientific reports, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] j. m. chan, g. carlsson, and r. rabadan. topology of viral evolution. proceedings of the national academy of sciences, ( ): – , . [ ] p. g. cámara, a. j. levine, and r. rabadán. inference of ancestral recombination graphs through topological data analysis. plos computational biology, ( ):e , . [ ] e. abbott. flatland: a romance of many dimensions. princeton university press, . [ ] m. belkin and p. niyogi. laplacian eigenmaps and spectral techniques for embedding and clustering. advances in neural information processing systems, : – , . [ ] m. reuter, f-e. wolter, and n. peinecke. laplace–beltrami spectra as ‘shape-dna’ of surfaces and solids. computer-aided design, ( ): – , . [ ] m. belkin, j. sun, and y. wang. constructing laplace operator from point clouds in rd. in proceedings of the twentieth annual acm-siam symposium on discrete algorithms, pages – , . [ ] j. liang, r. lai, t. w. wong, and h. zhao. geometric understanding of point clouds using laplace- beltrami operator. in ieee conference on computer vision and pattern recognition, pages – , . [ ] n. g. trillos, m. gerlach, m. hein, and d. slepčev. error estimates for spectral convergence of the graph laplacian on random geometric graphs toward the laplace–beltrami operator. foundations of computational mathematics, ( ): – , . [ ] h. p. mckean jr. and i. m. singer. curvature and the eigenvalues of the laplacian. journal of differential geometry, ( - ): – , . [ ] b. andrews. lectures on differential geometry. https://maths-people.anu.edu.au/~andrews/dg. australian national university. [ ] i. t. jolliffe and j. cadima. principal component analysis: a review and recent developments. philosophical transactions of the royal society a: mathematical, physical and engineering sciences, ( ): , . [ ] h. federer. curvature measures. transactions of the american mathematical society, ( ): – , . [ ] p. niyogi, s. smale, and s. weinberger. finding the homology of submanifolds with high confidence from random samples. discrete & computational geometry, ( - ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://maths-people.anu.edu.au/~andrews/dg https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] u. ozertem and d. erdogmus. locally defined principal curves and surfaces. journal of machine learning research, : – , . [ ] c. r. genovese, m. perone-pacifico, i. verdinelli, and l. wasserman. nonparametric ridge estimation. the annals of statistics, ( ): – , . [ ] r. w. buccigrossi and e. p. simoncelli. image compression via joint statistical characterization in the wavelet domain. ieee transactions on image processing, ( ): – , . [ ] j. malik, s. belongie, t. leung, and j. shi. contour and texture analysis for image segmentation. international journal of computer vision, ( ): – , . [ ] a. b. lee, k. s. pedersen, and d. mumford. the nonlinear statistics of high-contrast patches in natural images. international journal of computer vision, ( - ): – , . [ ] j. h. van hateren and a. van der schaaf. independent component filters of natural images compared with simple cells in primary visual cortex. proceedings: biological sciences, ( ): – , . [ ] x genomics. pbmcs from a healthy donor: whole transcriptome analysis. https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc, . [ ] b. pijuan-sala, j. a. griffiths, c. guibentif, t. w. hiscock, w. jawaid, f. j. calero-nieto, c. mulas, x. ibarra-soria, r. c. v. tyser, d. l. l. ho, w. reik, s. srinivas, b. d. simons, j. nichols, j. c. marioni, and b. göttgens. a single-cell molecular map of mouse gastrulation and early organogenesis. nature, ( ): – , . [ ] x genomics. . million brain cells from e mice. https://support. xgenomics.com/ single-cell-gene-expression/datasets/ . . / m_neurons, . [ ] d. van dijk, r. sharma, j. nainys, k. yim, p. kathail, a. j. carr, c. burdziak, k. r. moon, c. l. chaffer, d. pattabiraman, b. bierie, l. mazutis, g. wolf, s. krishnaswamy, and d. pe’er. recovering gene interactions from single-cell data using data diffusion. cell, ( ): – , . [ ] l. haghverdi, m. büttner, f. a. wolf, f. buettner, and f. j. theis. diffusion pseudotime robustly reconstructs lineage branching. nature methods, ( ): – , . [ ] a. klimovskaia, d. lopez-paz, l. bottou, and m. nickel. poincaré maps for analyzing complex hierar- chies in single-cell data. nature communications, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] s. wang, j-r. lin, e. d. sontag, and p. k. sorger. inferring reaction network structure from single-cell, multiplex data, using toric systems theory. plos computational biology, ( ):e , . [ ] m. hein, j-y. audibert, and u. von luxburg. graph laplacians and their convergence on random neighborhood graphs. journal of machine learning research, ( ): – , . [ ] d. ting, l. huang, and m. jordan. an analysis of the convergence of graph laplacians. arxiv, . [ ] k. v. mardia. measures of multivariate skewness and kurtosis with applications. biometrika, ( ): – , . [ ] p. campadelli, e. casiraghi, c. ceruti, and a. rozza. intrinsic dimension estimation: relevant tech- niques and a benchmark framework. mathematical problems in engineering, : – , . [ ] a. butler, p. hoffman, p. smibert, e. papalexi, and r. satija. integrating single-cell transcriptomic data across different conditions, technologies, and species. nature biotechnology, ( ): – , . [ ] y. hu, m. ranganathan, c. shu, x. liang, s. ganesh, a. osafo-addo, c. yan, x. zhang, b. e. aouizerat, j. h. krystal, d. c. d’souza, and k. xu. single-cell transcriptome mapping identifies common and cell-type specific genes affected by acute delta -tetrahydrocannabinol in humans. scientific reports, ( ): – , . [ ] k. xie, y. huang, f. zeng, z. liu, and t. chen. scaide: clustering of large-scale single-cell rna-seq data reveals putative and rare cell types. nar genomics and bioinformatics, ( ), . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction results estimators of the laplace-beltrami operator yield inaccurate scalar curvatures curvature can be computed accurately using the second fundamental form curvature of image patch manifold is consistent with a noisy klein bottle scrnaseq datasets have non-trivial intrinsic curvature discussion methods differential geometry of theoretical manifolds details of intrinsic approach to curvature estimation approach for s infinite series truncated series eigenvalue convergence estimating the laplace-beltrami operator from data details of extrinsic approach to curvature estimation quadratic regression on local neighborhoods of data selecting local neighborhoods for regression goodness-of-fit test for quadratic regression standard error and bias of scalar curvature estimate note on length scales details of toy manifold curvature computations analytical forms hypersphere one-sheet hyperboloid ring torus hypercube practical issues for curvature estimation on real-world datasets non-uniform sampling observational noise large ambient dimension choice of manifold dimension parameters for curvature estimation details of image patch dataset and klein bottle manifolds notation and preliminaries image dataset parametric family of klein bottle embeddings associating image patches to a klein bottle embedding optimal klein bottle embedding noisy klein bottle embeddings parameters for curvature estimation details of scrnaseq datasets preprocessing cell type annotations statistical tests spatial precision of errorbars sensitivity to cell downsampling sensitivity to transcript downsampling parameters for curvature estimation acknowledgements data and code availability supplementary figures references deephbv: a deep learning model to predict hepatitis b virus (hbv) integration sites. deephbv: a deep learning model to predict hepatitis b virus (hbv) integration sites. canbiao wu ¶, xiaofang guo ¶, mengyuan li ¶, xiayu fu , zeliang hou , manman zhai , , jingxian shen , xiaofan qiu , zifeng cui , hongxian xie , pengmin qin , xuchu weng , zheng hu , *, jiuxing liang * key laboratory of brain, cognition and education sciences, ministry of education, china; institute for brain research and rehabilitation, south china normal university, guangzhou, china. department of medical oncology of the eastern hospital, the first affiliated hospital, sun yat-sen university, guangzhou, guangdong, china department of gynecological oncology, the first affiliated hospital, sun yat-sen university, guangzhou, guangdong, china department of thoracic surgery, the first affiliated hospital, sun yat-sen university, guangzhou, guangdong, china school of psychology, south china normal university, guangzhou, guangdong, china generulor company bio-x lab, guangzhou, guangdong, china department of obstetrics and gynecology, tongji hospital, tongji medical college, huazhong university of science and technology, wuhan, hubei, china *corresponding author email: huzheng @ .com(zh), liangjiuxing@m.scnu.edu.cn(jl) ¶these authors contributed equally to this work. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract hepatitis b virus (hbv) is one of the main causes for viral hepatitis and liver cancer. previous studies showed hbv can integrate into host genome and further promote malignant transformation. in this study, we developed an attention-based deep learning model deephbv to predict hbv integration sites by learning local genomic features automatically. we trained and tested deephbv using the hbv integration sites data from dsvis database. initially, deephbv showed auroc of . and aupr of . on the dataset. adding repeat peaks and tcga pan cancer peaks can significantly improve the model performance, with an auroc of . and . and an aupr of . and . , respectively. on independent validation dataset of hbv integration sites from visdb, deephbv with hbv integration sequences plus tcga pan cancer (auroc of . and aupr of . ) performed better than hbv integration sequences plus repeat peaks (auroc of . and aupr of . ). next, we found the transcriptional factor binding sites (tfbs) were significantly enriched near genomic positions that were paid attention to by convolution neural network. the binding sites of ar-halfsite, arnt, atf , bhlhe , bhlhe , bmal , clock, c-myc, coup-tfii, e a, ebf , erra and foxo were highlighted by deephbv attention mechanism in both dsvis dataset and visdb dataset, revealing the hbv integration preference. in summary, deephbv is a robust and explainable deep learning model not only for the prediction of hbv integration sites but also for further mechanism study of hbv induced cancer. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / author summary hepatitis b virus (hbv) is one of the main causes for viral hepatitis and liver cancer. previous studies showed hbv can integrate into host genome and further promote malignant transformation. in this study, we developed an attention-based deep learning model deephbv to predict hbv integration sites by learning local genomic features automatically. the performance of deephbv model significantly improves after adding genomic features, with an auroc of . and an aupr of . . furthermore, we enriched the transcriptional factor binding sites of proteins by convolution neural network. in summary, deephbv is a robust and explainable deep learning model not only for the prediction of hbv integration sites but also for the further study of hbv integration mechanism. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction hbv is the main cause of viral hepatitis and liver cancer (hepatocellular carcinoma: hcc) [ ]. it is a small dna virus that can integrate into the host genome via an rna intermediate [ ]. first, hbv attaches and enters into hepatocytes, then transports its nucleocapsid which contains a relaxed circular dna (rcdna) to the host nucleus. in host nucleus, rcdna is converted into covalently closed circular dna (cccdna) which produces messenger rnas (mrna) and pregenomic rna (pgrna) by transcription. via reverse transcription in host nucleus, pgrna produces new rcdna and double-stranded linear dna (dsldna), which tend to integrate into the host cell genome [ ]. previous study showed hbv integration breakpoints distributed randomly across the whole genome with a handful of hotspots [ ]. for instance, hbv was reported to recurrently integrate into the telomerase reverse transcriptase (tert) and myeloid/lymphoid or mixed-lineage leukemia (mll , also known as kmt b) genes. the insertional events were also accompanied by the altered expression of the integrated gene [ , , ], indicating important biological impacts on the local genome. further analysis revealed that the association between hbv integration and genomic instability existed in these insertional events [ ]. moreover, significant enrichment of hbv integration was found near the following genomic features in tumours compared to non-tumour tissue: repetitive regions, fragile sites, cpg islands and telomeres [ ]. however, the pattern and the mechanism of hbv integration still remained to be explored. many of the hbv integration sites distributed throughout the human .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / genome and seem completely random [ , , ]. whether the features and patterns of these “random” viral integration events could be learned and extracted remained an open question, and once solved, will greatly improve the understanding towards hbv integration induced carcinogenesis. deep learning has an excellent performance in computational biology research, such as medical image identification [ ], discovering motifs in protein sequences [ ]. the convolutional neural network (cnn) is the most important part in deep learning, which enables a computer to learn and program itself from training data [ ]. though deep learning performs excellent in a various of fields, the detailed theory of how it makes the decision was hard to explain due to its black box effect. therefore, an approach named attention mechanism which can highlight the outstanding parts was invented to open the “black box” [ , ]. in this study, we developed, deephbv, an attention-based model to predict the hbv integration sites using deep learning. the attention mechanism calculates the attention weight for each position and connect the encoder and the decoder in the meanwhile. it highlights the regions concentrated by deephbv and helps figure out the patterns that were paid attention to. deephbv can predict hbv integration sites accurately and specifically, and the attention mechanism identified positions with potential important biological meanings. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / results deephbv effectively predicts hbv integration sites by adding genomic features. deephbv model structure and the scheme of encoding a kb sample into a binary matrix were described in fig . deephbv model was tested with our hbv integration sites database (http://dsvis.wuhansoftware.com). hbv integration sequences were prepared according to hbv integration sites as positive/negative samples following the steps in method. the negative samples should be twice number of positive samples to keep data balance and to improve the confidence level. the positive samples were divided into and as positive training dataset and testing dataset. ccorrespondingly, we extracted and negative samples as negative training dataset and testing dataset. deephint, an existing deep learning model for predicting hiv integration sites according to surroundings [ ], will also be evaluated using hbv integration sequences for training and testing. both models were trained by the same hbv integration training dataset and used the same testing dataset for the evaluation. deephbv with hbv integration sequences showed an auroc of . and an aupr of . while deephint with hbv integration sequences demonstrated an auroc of . and an aupr of . (fig ). the comparison of deephbv and deephint was described in discussion part. several previous studies showed that hbv integration has a preference on surrounding genomic features such as repeat, histone markers, cpg islands, etc [ , ]. thus, we tried to add these genomic features into deephbv, by mixing genomic feature samples together with hbv integration sequences as new datasets, then .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / trained and tested the updated deephbv models. we downloaded following genomic features from different datasets [ - ] into four subgroups: ( ) dnase clusters, fragile site, repeatmasker; ( ) cpg islands, genehancer; ( ) cons mammals, tcga pan-cancer; ( ) h k me chip-seq, h k ac chip-seq (s fig). after obtaining genomic feature data positions (sources are mentioned in s table), we extended the positions to bp and extracted related sequences on hg reference genome. we defined these sequences as positive genmoic feature samples. then we mixed hbv integration sequences, positive genome feature samples, and randomly picked negative genomic feature samples (see method) together and trained the deephbv model. once a subgroup performed well, we re-test each genomic feature in that subgroup to figure out which specific genomic feature affect the model performance significantly (s fig) (auroc and aupr values were recorded in s table). from the roc and pr curves, we found deephbv with hbv integration sites plus the genomic features repeat (auroc: . and aupr: . ) and tcga pan cancer (auroc: . and aupr: . ) can significantly improve the hbv integration sites prediction performance against deephbv with hbv integration sequences (fig ). we also performed the same test on deephint, but did not find a subgroup can substantially improve the model performance (results were recorded in s table). together, deephbv with hbv integration sequences plus repeat or tcga pan cancer can significantly improve the model performance. validation of deephbv using independent dataset visdb it is necessary of deephbv to be applied on general datasets, we tested the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pre-trained deephbv models (deephbv with hbv integration sequences + repeat peaks and deephbv with hbv integration sequences + tcga pan cancer peaks) on the hbv integration sites dataset in another viruses integration sites (vis) database visdb [ ]. we found that in the model trained with hbv integration sequences + repeat sequences showed an auroc of . and an aupr of . , while the model trained with hbv integrated sequences + tcga pan cancer showed an auroc of . and an aupr of . . the deephbv model with hbv integration sequences + tcga pan cancer performed better compared with deephbv model with hbv integration sequences + repeat and was more robust on both testing dataset from dsvis (auroc: . and aupr: . ) and independent testing dataset from visdb (auroc: . and aupr: . ). thus, we decided to use this model for future hbv integration sites study. study the preference pattern of hbv integration by conserved sequence elements deephbv can extract features with translation invariance by pooling operation, which enables deephbv to recognise certain patterns even the features were slightly translated. the participating of attention mechanism into deephbv framework might partly open the deep learning black box by giving an attention weight to each position. each attention weight represented the computational importance level of that position in deephbv judgement. the attention weights in attention layer were extracted after two de-convolution and one de-pooling operation and the output shape .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / is × . each score represented an attention weight of a bp region. positions with higher attention weight scores might have more important impact on the pattern recognition of deephbv, meaning these positions might be the critical points for identifying hbv integration positive samples. we first averaged the fractions of attention scores in all hbv integration sequences and normalized them to the mean of all positions. then we visualised the fractions of attention scores and found the figure showed peak-valley-peak patterns only in positive samples (fig ). we were interested in the positions with higher attention weights in convolution neural network. and we found that, in the attention weight distribution of deephbv with hbv integration sites + tcga pan cancer, a cluster of attention weights much higher than other weights often occurred in the positive samples. while in the model of deephbv with hbv integration sites + repeat did not show this pattern (fig ). to further discover the pattern behind these positions with higher attention weights, we defined the sites with top % highest attention weights as attention intensive sites, the regions of bp near them as attention intensive regions. we mapped these attention intensive sites on hg reference genome with genomic features (fig ), but found that the positional relationship between attention intensive sites and genomic features was not quite clear. the results indicated that there may exist other specific pattern closely related to hbv integration preference, and when analysed carefully, could be recognized by the deephbv model. convolution and pooling module will learn the patterns with translation invariance in deep learning, based on that deep learning network tend to learn the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / domains happened recurrently among different samples in the same pooling matrix, even if the learned feature was not at the same position in these different samples [ , ]. attention intensive regions are more likely to be conserved due to the translation invariance in convolution and pooling module, and would give hints to the selection preference of hbv integration sites. transcriptional factor-binding sites (tfbs) motifs are conserved genomic elements which can be critical to the regulation of downstream genes. therefore, we tested whether tfbs played important roles in hbv integration preference. we used all hbv integration samples whose prediction scores were higher than . from dsvis and visdb separately to enrich local tfbs motifs in attention intensive regions by homer v . . [ ] with its vertebrates transcription factor databases (table ). from the result of deephbv with hbv integration sequences + tcga pan cancer, binding sites of ar-halfsite, arnt, atf , bhlhe , bhlhe , bmal , clock, c-myc, coup-tfii, e a, ebf , erra, foxo , heb, hic , hif- b, lrf, meis , mitf, mnt, myog, n-myc, npas , npas, nr a , ptf a, snail , tbx , tbx , tcf , tead , tead , tead , tead, tgif , tgif , thrb, usf , usf , zac , zeb , zfx, znf , znf can be both enriched in attention intensive regions of dsvis and visdb sequences. we selected two representative samples to obtain a more intuitive display. genomic features, hbv integration sites from dsvis and visdb, attention intensive sites and tfbs were aligned and shown in hg reference genome (fig ). most attention intensive sites can be mapped to enrich tf motifs. and the clusters of high attention weight from the output of deephbv with hbv integration sites plus tcga pan cancer showed the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / binding site of a tumour suppressor gene hic , circadian clock related elements bmal , clock, c-myc and naps (fig ). the data provided novel insights into hbv integration site selection preference and reveal biological importance that warrants future experimental confirmation. table . enriched tfbs from attention intensive regions of deephbv with hbv integration sites + tcga pan cancer peaks. homer known results homer de novo results rank name p-value rank best match/details p-value bmal e- tead e- npas . e- ebf e- clock . e- tcf e- c-myc . e- grhl e- zfx . e- dux e- tgif . e- ptf a e- mnt . e- tead e- lrf . e- ahr::arnt . e- tbx . e- sox . e- znf . e- tead . e- n-myc . e- zic . e- znf . e- nr e . e- usf . e- sox . e- bhlhe . e- zbtb . e- rbpj . e- usf . e- zac . e- isl . e- tgif . e- znf . e- zeb . e- ascl . e- thrb . e- znf . e- ptf a . e- lrf . e- bhlhe . e- znf . e- tead . e- pknox . e- stat . e- bcl b . e- meis . e- arnt . e- c-myc . e- osr . e- usf . e- tfap a . e- npas . e- hic . e- tead . e- tead . e- ar-halfsite . e- stat . e- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / tcf . e- mitf . e- tead . e- atf . e- hif- b . e- foxo . e- e a . e- tead . e- mef a . e- znf . e- nkx . . e- coup-tfii . e- myog . e- nkx . . e- snail . e- heb . e- tbx . e- scrt . e- nr a . e- nanog . e- oct . e- elk . e- erra . e- gata . e- bhlha . e- amyb . e- nr a . e- nfkb-p -rel . e- zic . e- trps . e- hoxa . e- hif a . e- isl . e- cebp:ap . e- ews:fli -fusion . e- foxk . e- ets . e- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / discussion in this study, we developed an explainable attention-based deep learning model deephbv to predict hbv integration sites. in the comparison of deephbv and deephint on predicting hbv integration sites (s table), deephbv out-performed deephint after adding genomic features due to its more suitable model structure and parameters on recognising the surroundings of hbv integration sites. we applied two convolution layers ( st layer: convolution kernels and the kernel size is ; nd layer: convolution kernels and the kernel size is ) and one pooling layer (with pooling size of ) in deephbv while in deephint the model only have one convolution layer ( convolution kernels and the kernel size is ) and one pooling layer (with pool size of ). the increasing of convolution layers enables the information from higher dimensions can be extracted, the increasing of convolution kernels enables more feature information to be extracted [ ]. we trained the deephbv model using three strategies ( ) dna sequences near hbv integration sites (hbv integration sequences), ( ) hbv integration sequences + tcga pan cancer peaks, ( ) hbv integration sequences + repeat peaks. we found that the model with hbv integration sequences adding tcga pan cancer or repeat can both significantly improve the model performance. and the deephbv with hbv integration sequences adding tcga pan cancer peaks performed better on independent test dataset visdb. however, the attention intensive regions cannot be well aligned to these genomic features. thus, we further inferred that other features such as tfbs motifs may influence deephbv in the prediction process. and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / homer was applied to recognise these tfbs that might be related to hbv-related diseases or cancer development. we noticed that the attention intensive regions identified by attention mechanism of deephbv with hbv integration sequences + tcga pan cancer showed strong concentration on the binding site of the tumour suppressor gene hic , circadian clock-related elements bmal , clock, c-myc, naps , and the transcription factors tead and nr a . these dna binding proteins were closely related to tumour development [ - ]. for instance, hic is a tumour suppressor gene in hepatocarcinogenesis development [ , ]. bmal , clock, c-myc, naps all participate in the regulation of circadian clock [ ], which is reported to promote hbv-related diseases [ , ]. in accordance, the binding motif of circadian clock-related elements were also enriched from the attention intensive regions of deephbv with hbv integration sequences + repeats, further confirming the results (s table). in addition, the other transcription factors identified by deep hbv are tead and nr a . tead deregulation affected well-established cancer genes such as braf, kras, myc, nf and lkb , and showed high correlation with clinicopathological parameters in human malignancies [ ]. nr a (also known as liver receptor homolog- , lrh- ) binds to the enhancer ii (enii) of hbv genes, and serves as a critical regulator of their expression [ ]. in summary, deephbv is a robust deep learning model of using convolutional neural network to predict hbv integrations. our data provide new insight into the preference for hbv integration and mechanism research on hbv induced cancer. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / methods data preparation a detailed step-by-step instruction of deephbv was provided in s and s notes. to obtain positive training and testing samples for deephbv, we extracted bp dna sequences from upstream and bp dna sequences from downstream of hbv integration sites as positive dataset, each sample was denoted as 𝑆 = (𝑛 ,𝑛 ,…,𝑛 ), where 𝑛i represents the nucleotide in position i. deephbv, as a deep learning network also require negative samples that do not contain hbv integration sites as background area. the existing of hbv integration hot spots which contains several integration events within ~ kb range [ ] prompted us that we should selected background area keeping enough distance from known hbv integration sites. thus, we discarded the regions around known hbv integration sites with length kb on hg reference genome and selected kb length dna sequences randomly on remained regions as negative samples. we encoded extracted dna sequences using one-hot code to make the calculation of distance between features in training and the calculation of similarity more accuracy. original dna sequences were converted to binary matrices of -bit length where each dimension corresponds to one nucleotide type. finally, we converted a bp dna sequence into a × binary matrix. feature extraction deephbv model first applied convolution and pooling module to learn and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / obtain sequence features around hbv integration sites (s fig). each binary matrix representing a dna sequence entered the convolution and pooling module to execute convolution calculation. we employed multiple variant convolution kernels to calculation in order to obtain different features. s = (𝑛 ,𝑛 ,…,𝑛 ) denoted as a specific dna sequence and e represented the binary matrix- encoded from s, the convolutional calculation in convolution layer refers to 𝑋 = 𝑐𝑜𝑛𝑣(𝐸), which can be described as: 𝑋𝑘,𝑗= ∑ 𝑝― 𝑗= ∑ 𝐿 𝑙= 𝑊𝑘,𝑗,𝑙𝐸𝑙,𝑖+𝑗 ( ) where ≤ 𝑘 ≤ 𝑑, 𝑑 refers to the number of kernels, ≤ 𝑖 ≤ 𝑛 ― 𝑝 + , 𝑖 refers to the index, 𝑝 refers to the kernel size, n refers to input sequence length, 𝑊 refers to the kernel weight. convolutional layer activated eigen vectors using rectified linear unit (relu) after extracting relative eigen vectors. relu is an activation function in artificial neural networks which can be described as 𝑓(𝑥) = max ( ,𝑥). we applied relu on the output matrix of each convolution layer and mapped each element on a sparse matrix. relu imitates real neuron activation, which enables data fitted to the model better. then we applied max-pooling strategy to complete dimension reduction as well as support the maximum retention of predicted information. till now, we achieved the final eigen vector 𝐹c from the binary matrix represented dna sequence after feature extracting in convolution and pooling module. attention mechanism in deephbv model deephbv added attention mechanism in order to capture and understand the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / position contribution in abstracted eigen-vector 𝐹c. eigen-vector entered the attention layer, which will calculate a weight value to each dimension in 𝐹c. the attention weight represents the contribution level of the convolutional neural network (cnn) in that position. the output of attention weight 𝑡𝑗 is the contribution score, larger 𝑡𝑗 score means bigger contribution in this position to hbv integration sites prediction. all contribution scores were normalized to achieve the dense eigenvector matrix, which denoted as 𝐹𝑎: 𝐹𝑎 = ∑ 𝑞 𝑗= 𝑎𝑗𝑣𝑗 ( ) where， 𝑎𝑗 = 𝑒𝑥𝑝 (𝑡𝑗) ∑𝑞𝑖 𝑒𝑥𝑝 (𝑡𝑖) ( ) where 𝑎𝑗 represents the relevant normalisation score, 𝑣𝑗 represents the eigenvector at position 𝑗 of the input eigenmatrix. each position represents an extracted eigen-vector in each convolution kernel. the convolution-pooling module and the attention mechanism module need to be combined in model prediction progress, in another word, eigen-vector 𝐹c and relative eigen important score 𝐹𝑎 should work together in hbv integration sites prediction. we linked the values in eigen-vector 𝐹c and linearly mapped them to a new vector 𝐹𝑣, which is: 𝐹𝑣= (𝑑𝑒𝑛𝑠𝑒(𝑓𝑙𝑎𝑡𝑡𝑒𝑛(𝐹c))) ( ) in this step, flatten layer performed function 𝑓𝑙𝑎𝑡𝑡𝑒𝑛() to reduce dimension and concatenate data; function 𝑑𝑒𝑛𝑠𝑒() was executed by dense layer, which will map .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / dimension-reduced data to a single value. then 𝐹𝑣 and 𝐹𝑎 concatenated vector entered linear classifier prediction to calculate the probability of hbv integration happened within the current sequence, with: 𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑐𝑜𝑛𝑐𝑎𝑡(𝐹𝑎,𝐹𝑣)) ( ) where 𝑃 is the predicted score, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑() represents the activation function acted as classifier in final output, 𝑐𝑜𝑛𝑐𝑎𝑡() represents the concatenate operation. in the meantime, if we give the output eigenvector 𝐹c from convolution-and-pooling module as input, and execute attention mechanism, weight vector 𝑊 can be achieved: 𝑊 = 𝑎𝑡𝑡(𝑎 ,𝑎 ,…,𝑎𝑞) ( ) where 𝑎𝑡𝑡() refers to the attention mechanism, 𝑎𝑖 denotes the eigenvector in 𝑖𝑡ℎ dimension in the eigenmatrix, 𝑊 represents the dataset containing contribution scores of each position in the eigenmatrix extracted by convolution-and-pooling module. deephbv model training after confirming each parameter in deephbv (s table), we trained the deep learning neural network model deephbv via binary crossentropy. the loss function of deephbv can be defined as: loss = -∑𝑖 𝑦𝑖 log(𝑃) + ( ― 𝑦𝑖) log( ― 𝑃) ( ) where, 𝑦𝑖 is the prediction score, 𝑃 is the binary tag value of that sequence (in this dataset, positive samples were labelled as and negative samples were labelled as ). back propagation algorithm was adapted in training progress and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / nesterov-accelerated adaptive moment estimation (nadam) gradient descent algorithm was applied to optimise parameter initialization. the deep learning neural network model adapted python . , keras library . . [ ] using three nvidia® tesla v -pcie- g（nvidia corporation, california, usa ） for training and testing. deephbv takes around min and s for model training and testing respectively using the computational platform under such software and hardware settings. data availability deephbv is available as an open-source software and can be downloaded from https://github.com/jiuxingliang/deephbv.git reference . liang tj. hepatitis b: the virus and disease. hepatology ; ( suppl):s - . . tu t, budzinska ma, shackel na et al. hbv dna integration: molecular mechanisms and clinical implications. viruses ; ( ). . sung wk, zheng h, li s et al. genome-wide survey of recurrent hbv integration in hepatocellular carcinoma. nat genet ; ( ): - . . zhao lh, liu x, yan hx et al. genomic and oncogenic preference of hbv integration in hepatocellular carcinoma. nat commun ; : . . ding d, lou x, hua d et al. recurrent targeted genes of hepatitis b virus in the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / liver cancer genomes identified by a next-generation sequencing-based approach. plos genet ; ( ):e . . tu t, budzinska ma, vondran fwr et al. hepatitis b virus dna integration occurs early in the viral life cycle in an in vitro infection model via sodium taurocholate cotransporting polypeptide-dependent uptake of enveloped virus particles. j virol ; ( ). . mason ws, gill us, litwin s et al. hbv dna integration and clonal hepatocyte expansion in chronic hepatitis b patients considered immune tolerant. gastroenterology ; ( ): - e . . litjens g, kooi t, bejnordi be et al. a survey on deep learning in medical image analysis. med image anal ; : - . . bailey tl, baker me, elkan cp. an artificial intelligence approach to motif discovery in protein sequences: application to steroid dehydrogenases. the journal of steroid biochemistry and molecular biology ; ( ): - . . yamashita r, nishio m, do rkg et al. convolutional neural networks: an overview and application in radiology. insights into imaging ; ( ): - . . bahdanau d, cho k, bengio y. neural machine translation by jointly learning to align and translate. computer science . . guidotti r, monreale a, ruggieri s et al. a survey of methods for explaining black box models. acm comput. surv. ; ( ):article . . hu z, zhu d, wang w et al. genome-wide profiling of hpv integration in cervical cancer identifies clustered genomic hot spots and a potential .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / microhomology-mediated integration mechanism. nat genet ; ( ): - . . chollet fao. keras. . . hu h, xiao a, zhang s et al. deephint: understanding hiv- integration via deep learning with attention. bioinformatics ; ( ): - . . haeussler m, zweig as, tyner c et al. the ucsc genome browser database: update. nucleic acids res ; (d ):d -d . . inoue f, kircher m, martin b et al. a systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. genome res ; ( ): - . . robinson jt, thorvaldsdottir h, winckler w et al. integrative genomics viewer. nature biotechnology ; ( ): - . . tang d, li b, xu t et al. visdb: a manually curated database of viral integration sites in the human genome. nucleic acids res . . zhang w, itoh k, tanida j et al. parallel distributed processing model with local space-invariant interconnections and its optical architecture. appl opt ; ( ): - . . bruna j, zaremba w, szlam a et al. spectral networks and locally connected networks on graphs. computer science . . heinz s, benner c, spann n et al. simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. molecular cell ; ( ): - . . seide f, gang l, dong y. conversational speech transcription using .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / context-dependent deep neural networks. . . taniguchi k, roberts lr, aderca in et al. mutational spectrum of beta-catenin, axin , and axin in hepatocellular carcinomas and hepatoblastomas. oncogene ; ( ): - . . zheng j, xiong d, sun x et al. signification of hypermethylated in cancer (hic ) as tumor suppressor gene in tumor progression. cancer microenviron ; ( ): - . . paibomesai mi, moghadam hk, ferguson mm et al. clock genes and their genomic distributions in three species of salmonid fishes: associations with genes regulating sexual maturation and cell cycling. bmc res notes ; : . . fekry b, ribas-latre a, baumgartner c et al. incompatibility of the circadian protein bmal and hnf alpha in hepatocellular carcinoma. nat commun ; ( ): . . mukherji a, bailey sm, staels b et al. the circadian clock and liver function in health and disease. j hepatol ; ( ): - . . huh hd, kim dh, jeong hs et al. regulation of tead transcription factors in cancer biology. cells ; ( ). . cai yn, zhou q, kong yy et al. lrh- /hb f and hnf synergistically up-regulate hepatitis b virus gene transcription and dna replication. cell research ; ( ): - . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure legends figure . the deep learning framework applied in deephbv. (a) scheme of encoding a kb dna sequence into a binary matrix using one-hot code; (b) a brief flowchart of deephbv structure, the matrix shape was included in brackets, and a detailed flowchart was in s fig. figure . evaluation of deephbv and deephint model prediction performance on the test dataset. (a) receiver-operating characteristic (roc) curves and (b) precision recall (pr) curves, respectively. “deephbv with hbv integration sequences” refers to deephbv model with only hbv integration sequences as input; “deephint with hbv integration sequences” refers to deephint model with only hbv integration sequences as input; “deephbv with hbv integration sequences + repeat” refers to deephbv integration sequences and repeat sequences as input; “deephbv with hbv integration sequences” refers to deephbv integration sequences and tcga pan cancer sequences as input: “deephbv with hbv integration sequences + repeat + (test) visdb” refers to deephbv using hbv integration sequences and repeat sequences for training and using visdb as independent test dataset; “hbv with hbv integration sequences + tcga pan cancer + (test) visdb” refers to deephbv using hbv integration sequences as tcga pan cancer sequences for training and using visdb as independent test dataset. figure . the attention weight distribution of analysed by deephbv with hbv integration sequences + genomic features. (a) deephbv with hbv integration sequences + tcga pan cancer peaks; (b) deephbv with hbv integration .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / sequences + repeat peaks. the left graph showed the fractions of attention weight, which were averaged among all samples and normalized to the average of all positions, each index represents a bp region due to the multiple convolution and pooling operation. the graphs on the right are representative samples of attention weight distribution of positive samples and negative samples. figure . attention intensive regions highlighted essential local genomic features on predicting hbv integration sites. representative examples showed the positional relationship between the attention intensive sites and several genomic features using deephbv with hbv integration sequences + tcga pan cancer model on (a) chr : , , - , , (hg ), (b) chr : - (hg ). each of these two sequences contains hbv integration sites from both dsvis and visdb. enriched dna binding proteins detected by homer from the attention intensive regions using the output of deephbv then we applied fimo [ ] to find the enriched motif position and label the motifs on attention intensive regions. ucsc genome browser [ ] and matplotlib [ ] was used for visualisation. “hpv integration site” refers to the sites selected from our unpublished database used as testing samples. “attention intensive sites” denotes the sites with top % attention weight. “repeatmasker”, “tcga pan cancer”, “dnase clusters”, “con mammals”, “genehancer”, “layered h k ac”, “layered h k me ” are genomic features. references . grant ce, bailey tl, noble ws. fimo: scanning for occurrences of a given .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / motif. bioinformatics ; ( ): - . . haeussler m, zweig as, tyner c et al. the ucsc genome browser database: update. nucleic acids res ; (d ):d -d . . hunter jd. matplotlib: a d graphics environment. computing in science & engineering ; ( ): - . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supporting information s fig. deephbv framework. each part represents a layer in neural network and 𝑛 × 𝑛 stands for the output dimension which was explained in s note. two continuous convolution layers were used to extract features; max-pooling layers can reduce the dimension while keeping the feature matrix has the ability to predicting information; dropout layer randomly drop some results to prevent over-fit; flatten layer is responsible for reduce the dimensions and connect them; dense layer is used to map the output from last layer to a specific value; attention layer and attention flatten are used to give a weight score to each dimension in the feature matrix; concatenate layer concatenates captured features and importance scores of those features from the convolution module and the attention mechanism model. prediction output offered the final output reveals the probability of hbv infection. s fig. prediction performance on the hbv integration dataset with different types of genomic features added in. we found that character and character outperformed the deephbv model with an significant increase in aupr and auroc score on character and character , indicating that deephbv can capture genomic features from character and character effectively, so we did further analysis on each single items in character group and , and found that repeats and tcga pan cancer are the genomic features that can be captured by deephbv which significantly improved model performance. deephbv with hbv integration sequences + repeats reached the auroc of . and the aupr of . , which deephbv with hbv integration sequences + tcga pan cancer reached the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / auroc of . and the aupr of . . s table. the parameters for the deep neural network used in deephbv. s table. genomic features and sources. (access date: novemember th, ) s table. comparison of deephbv and deephint result record. s table. enriched tfbs from attention intensive regions of deephbv with hbv integration sites + repeat peaks. s note. deephbv framework. deephbv neural network structure design and hyperparameters involved in deephbv are noted. s note. mathematical matters of the deephbv. there are explanations for mathematical matters (i.e. encoding dna sequences, convolution layers, the max pooling layer, dropout layer, attention layer, concatenate layer, linear classifier and optimisation algorithm) of the deephbv in this part. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a validated generally applicable approach using the systematic assessment of disease modules by gwas reveals a multi-omic module strongly associated with risk factors in multiple sclerosis a validated generally applicable approach using the systematic assessment of disease modules by gwas reveals a multi-omic module strongly associated with risk factors in multiple sclerosis tejaswi v.s. badam , †, hendrik a. de weerd , †, david martínez-enguita , tomas olsson , lars alfredsson , ,ingrid kockum ,maja jagodic , zelmina lubovac-pilav *, mika gustafsson * school of bioscience, systems biology research center, university of skövde, sweden bioinformatics, department of physics, chemistry and biology, linköping university, linköping, sweden department of clinical neuroscience, karolinska institutet, center for molecular medicine, karolinska university hospital, se- , stockholm, sweden institute of environmental medicine, karolinska institutet, center for molecular medicine, karolinska university hospital, se- , stockholm, sweden †these authors contributed equally to the work. *these authors share senior authorship. corresponding author: mika gustafsson (mika.gustafsson@liu.se) running title : multi-omic modules in multiple sclerosis keywords : benchmark , multi-omics , network modules ,multiple sclerosis, risk factors summary : our benchmark of multi-omic modules and validated translational systems medicine workflow for dissecting complex diseases resulted in multi-omic module of genes highly enriched for risk factors associated with multiple sclerosis. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract background: there are few (if any) practical guidelines for predictive and falsifiable multi-omics data integration that systematically integrate existing knowledge. disease modules are popular concepts for interpreting genome-wide studies in medicine but have so far not been systematically evaluated and may lead to corroborating multi-omic modules. methods: we assessed eight module identification methods in previously published expression and methylation studies of diseases using gwas enrichment analysis. next, we applied the same strategy for multi-omics integration of datasets of multiple sclerosis (ms), and further validated the resulting module using both gwas and risk-factor associated genes from several independent cohorts. results: our benchmark of modules showed that in immune-associated diseases modules inferred from clique-based methods were the most enriched for gwas-genes. the multi-omics case study using ms revealed the robust identification of a module of genes. strikingly, most genes of the module was differentially methylated upon the action of one or several environmental risk factors in ms (n = , p = - ) and were also independently validated for association with five different risk factors of ms, which further stressed the high genetic and epigenetic relevance of the module for ms. conclusion: we believe our analysis provides a workflow for selecting modules and our benchmark study may help further improvement of disease module methods. moreover, we also stress that our methodology is generally applicable for combining and assessing the performance of multi-omics approaches for complex diseases. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction complex diseases are the result of disruptions of many interconnected multimolecular pathways, reflected in multiple omics layers of regulation of cellular function, rather than perturbations of a single gene or protein[ ]. systems and network medicine aim to translate observed omics differences in patients using networks, in order to personalize medicine[ ]. importantly, genes that are associated with diseases are more likely to interact with each other rather than with non-disease associated genes, forming multi-omics network disease modules[ , ]. owing to the incompleteness of the underlying multi-omics interactions, the networks are often modeled as effective gene-gene interactions, using for example string database[ ]. thus, network modules might be ideal tools for multi-omics analysis. however, the evaluation of performance of different module inference methods remains a poorly understood topic, which creates the need for transparent evaluation of these methods based on objective benchmarks across various diseases and omics. genomic concordance has been suggested as a multi-omics validation principle[ , ], i.e., modules derived from one omic, such as gene expression or dna methylation should be enriched for disease- associated single nucleotide polymorphisms (snps). the variety of algorithms that have been proposed and applied for identification of disease modules can be categorized into two main groups. on the one hand, there are methods which rely purely on clustering of the genes in relevant disease networks[ ]. on the other hand, there are algorithms which make use of disease-associated molecules or genetic loci to reveal disease modules that correlate with disease function, such as the disease module detection (diamond) algorithm[ ], clique-based methods[ ],[ ] and weighted gene co-expression network analysis (wgcna)[ ]. the data-derived information can either be differentially expressed genes or differentially correlated or co-expressed genes. methods following the former approach were recently benchmarked by a metric utilizing genomic concordance within the dream consortia[ ]. however, so far, algorithms from the latter group have not been benchmarked. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . in this study we analyzed, assessed, and compared the performance of eight of the most popular methods for disease module analysis using the r package modifier[ ] on different diseases including expression and ten methylation datasets. we assessed the performance of the methods using genome-wide association (gwas) enrichment analysis from the summary statistics of all assayed snps similarly as in dream[ ]. the resulting workflow provided a systematic procedure for selecting the best method for each disease and set the stage for method development in the disease module area. moreover, it allowed the predictive assessment of combining multiple datasets across several omics using gwas, which we tested in multiple sclerosis (ms), a heterogeneous complex disease. briefly, we derived multi-omic modules in a stepwise optimization of gwas enrichment from transcriptomic and methylomic analyses of ms. we further evaluated the identified multi-omic ms module of genes for its enrichment across dna methylation studies of eight known lifestyle-associated risk factors of ms. additionally, we validated the identified significant enrichment risk factors in an independent dna methylation ms study which indeed showed a very strong and significant ms enrichment for both module genes and risk factor associations. in summary, we provide a robust multi-omics strategy that can be used to disentangle networks of affected genes in complex diseases from both genetic and environmental levels. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . materials and methods benchmark data a total of publicly available datasets for the transcriptomic benchmark and ten publicly available datasets for the methylomic benchmark were used. to avoid bias due to subtypes of diseases and drug treatments, we searched for datasets that have only patient and control samples, and that are available for download from the geo database. we categorized the datasets into seven distinct disease types based on the disease-trait type associations used in choobdar et al[ ]., i.e. autoimmune, cardiovascular, glycemic, inflammatory, neurodegenerative, and psychiatric and social disorders. a total of complex diseases were used in the transcriptomic benchmark analysis, while six complex diseases were used in the methylation benchmark analysis. the methylation benchmark diseases belong to inflammatory, autoimmune, and glycemic disease types. ms use case data a total of publicly available and one non-publicly available transcriptomic and methylomic ms- related datasets were used in the ms multi-omics integration use case. in general, every dataset in the modifier benchmark was also used in the ms use case, with exceptions according to certain criteria. the inclusion of transcriptomic ms datasets followed the criteria: ) the largest dataset by sample number, per tissue, is shown in the modifier benchmark; ) replication cohorts are not included in the ms use case. criteria for inclusion of methylomic ms datasets were the following: ) the largest dataset by sample number, per tissue or cell type, is included in the modifier benchmark; ) a single dataset for every cell-specific tissue was included in the benchmark; ) methylation studies that reported using whole blood as sample tissue were excluded from the ms use case, due to the high heterogeneity of this type of data. for the additional independent validation, we utilized the methylation microarray analysis of blood samples analyzing from kular et al . for each of these ms patients (nms= ) and healthy controls (nhc= ), we also collected their lifestyle-associated risk factors from questionnaires that (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . were part of the epidemiological investigation of multiple sclerosis (eims) study. those factors were smoking status, prior ebv infection, sunbathing, nightshift work, alcohol consumption, as well as phenotypic features (age, sex, bmi at age of ). pre-processing and quality control of risk factor methylation data dna methylation datasets were downloaded from geo as raw idat files, when available, or matrices of beta values. pre-processing of the data was performed using the chip analysis methylation pipeline (champ) r package[ ] , version . . . default parameters were used for probe and sample filtering. probes with a detection p-value above . , probes with a fraction of failed (bead count less than ) samples over . , non-cpg probes, snp-related probes, multi-hit probes, and probes located on chromosomes x and y, were removed. samples with a proportion of failed (na) probe p-values over . were also removed from the analysis. post-filtering imputation of na values was conducted on the beta matrices, with default parameters (“combine” method, k = , probe cutoff = . , sample cutoff = . ). filtered imputed matrices were normalized applying the beta- mixture quantile dilation (bmiq) normalization method[ ]�, including correction of type-i and type-ii probe effects. data quality was assessed by producing multi-dimensional scaling (mds) plots of the top , most variable positions per sample, density plots for the distribution of beta values, and hierarchical clustering of samples, before and after normalization. singular value decomposition (svd) was used to detect the most significant components of variation in the data. unwanted sources of variation in the normalized data were corrected using combat batch effect correction[ ]. module identification the modifier r package offers nine different methods for producing disease modules for which we included all but clique sum exact as it is highly similar to clique sum. the included methods will produce modules based on the provided omics input and background network and do not include prioritization of pathway association. modifier methods used for module identification through this study are listed in the supplementary table . for the methods that require a network, we used the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . human ppi network from string database version , consisting of , , interactions among , unique genes/proteins. we filtered the network to have high confidence interactions by using the cutoff > to reduce the number of false positives, resulting in a subset of , interactions between , unique genes/proteins. for co-expression methods, the network is computed within the method algorithm from the gene expression matrix. in case of the benchmark analysis, we used a stringent cutoff of score > , so that the runs were not computationally intensive. for the ms use case benchmark, we used the network combined score cutoff > . the processed matrix for each dataset and their respective phenotypic information were downloaded from geo. the input object is prepared using the create_input_microarray function from the modifier package which is then used for creating the modules. the input function applies linear model using limma for comparison of patient's vs controls to get the differentially methylated or expressed genes. a dynamic cutoff of % in the differentially methylated or expressed genes is applied for input seed genes for the methods that require seed genes. differential methylation analysis of risk factor data differentially methylated probes (dmps) were found by fitting a linear model to the data using the limma r package[ ]�, version . . implemented in the champ function champ.dmp. p-values were adjusted for multiple testing using benjamini-hochberg false discovery rate (fdr) correction. differentially methylated genes (dmgs) were obtained and annotated using the org.hs.eg.db r package�, version . . . dmg lists were cross-checked against the string database version ppi network used for module identification in the ms multi-omics approach (high confidence interactions, combined score > ). dmgs that were not present in the ppi network were removed. in case of the additional ms validation dataset, a linear mixed effect model with risk factors (age, sex, bmi at age of , smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents) as categorical covariates was implemented to find the differentially methylated genes after the preprocessing step, as described in the preprocessing section of the methods. since all the patients were ebv positive, we did not include it for linear mixed effect model. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . validation of modules the final modules produced from each single algorithm and the consensus were evaluated using pascal[ ] (pathway scoring algorithm). pascal implements a fast and rigorous gene scoring and pathway enrichment pipeline that can be run on a local machine. the snp values are converted to gene scores by computing pairwise snp-by-snp correlations and obtaining z-scores from their distribution. these obtained gene scores are fused with the pathway enrichment analysis to recompute a chi-square p-value for the given set of module genes. thus, the obtained chi-square p- value serves as the significance of the module in its enrichment of the disease-associated pathway gene loci. a combined p-value was computed for each of the methods using fisher’s method[ ], diseases, and datasets for ranking the performance of the modules in each criterion. integration of ms single-omic modules clique sum was ranked as the best performing method on average for both transcriptomic and methylomic data, according to the ms gwas enrichment of the modules calculated by pascal. therefore, significant clique sum modules (p < . ) were selected for further analysis (nine transcriptomic and four methylomic modules). consensus modules were generated across each omic by applying a module count-based method, where the criteria for gene inclusion in the consensus is its presence in a certain number of single-method modules. to balance the weight of each omic in the multi-omics integration, the top four significant modules per omic were used to create each consensus (fig. a, b). single-omic clique sum consensus were ranked again by gwas enrichment, and the best performing consensus per omic was selected for integration into the multi-omics module. enrichment analyses of the ms multi-omics module disease enrichment analysis of the multi-omics module was performed by fisher’s exact test, with a significance threshold of p < . . ms-associated genes were obtained from the gene-disease association summary provided by disgenet database . [ ]�. all genes with a known association (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to the disease “multiple sclerosis” (unified medical language system unique identifier c ) were considered ms-associated genes (n = , ). pathway enrichment analysis was carried out using the function enrichkegg from the clusterprofiler r package[ ]�, version . . . p-values were adjusted for multiple testing using benjamini-hochberg fdr correction, with a significance threshold of adj. p < . . enrichment of the multi-omics module in ms risk-factor-associated genes was performed by fisher’s exact test, with a significance threshold of p < . . to provide a uniform comparison of ms risk factor-associated genes across datasets, the module was tested for enrichment in the top , dmgs (with at least p < . ) obtained from the differential methylation analysis with champ for each risk factor dataset. representation of the ms multi-omics module experimentally validated interactions for the multi-omics module genes were obtained from string database version (experimental score > ) and imported into cytoscape[ ] version . . . to determine representative functional clusters of module genes, overrepresented gene ontology (go) biological process (bp) terms in the module were found using bingo[ ] version . . , with benjamini-hochberg fdr for multiple testing correction, and a significance threshold of adj. p < . . then, enriched go terms with adj. p < x - were summarized using revigo[ ] server tool (medium allowed similarity = . ) and categories of interest were selected by uniqueness (>= %), dispensability (>= %), and frequency (<= %) criteria. further manual assessment was performed to group similar terms with an adequate number of genes in the network. results a benchmark comparing transcriptionally derived disease modules from different diseases. we compiled a benchmark source of disease modules and summary statistics of gwas datasets from well-powered case-control studies (supplementary table ), some of which were previously used in the dream topological disease module challenge[ ]. for these datasets we assessed modules using the same metric as in the recent dream study[ ], based on the pathway scoring (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . algorithm (pascal)[ ]. for each disease we compiled one to five publicly available transcriptomic datasets considering both easily assessable tissues (e.g. blood) and target tissues, thereby covering transcriptomic datasets in total (fig. a). modules were created using eight different methods from modifier[ ]. in addition, we also tested if genes detected by several methods, hereafter called consensus module genes, had higher enrichment scores than single-method module genes. enrichment scores for the non-empty modules (n = ) from this analysis were summarized for each method and dataset (fig. a). in total, we found significantly gwas-enriched modules in . % ( / ) of the single-method modules and . % ( / ) of the non-empty consensus modules that combined at least three methods as a criterion. these numbers seemed higher than expected, which might have been a consequence of the same gwas being used to evaluate multiple transcriptomic datasets of the same disease. hence, we aggregated scores of the same disease and method as meta p-values (see methods). out of the possible disease-method combinations, % of the pairs showed a significant gwas pascal enrichment, which is more than expected by chance (n = , p = . x - ). the most enriched method was clique sum, which showed significant enrichment in seven out of diseases (binomial test p = . x - ). many methods exhibited strong enrichments in coronary artery disease (cad), type diabetes, multiple sclerosis (ms), rheumatoid arthritis (ra), and the inflammatory bowel diseases(ibd), ulcerative colitis (uc) and crohn’s disease (cd), while no significant enrichments were found for asthma, hepatitis c, type diabetes, narcolepsy, parkinson’s disease, or for any psychiatric and social diseases. if we instead ranked methods based on their respective module gwas enrichment, clique sum showed significant association in % ( / ) of the modules corresponding to seven different diseases followed by consensus modules identified by two out of three methods. lastly, diamond and co- expression-based methods all achieved significant results, although worse than clique sum. next, we tested the impact of network centrality and module size as potential confounding factors of the applied performance metric. we found a significant but very modest correlation for module size (fig. c, spearman rho = . , p = . x - ), and a non-significant correlation for interactome (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . centrality (fig. b, rho = . , p = . ). thus, it is meaningful to compare results with differences in those module properties. in summary, we found that the clique sum method resulted in the highest disease enrichment for most diseases, while not producing significant modules for others, such as type diabetes, where co-expression-based methods and diamond scored best. in general, we observed stronger enrichments for inflammatory diseases and weaker results for psychiatric and social diseases. considering that the transcriptomic modules showed that clique sum was the best performing method and that the cardiovascular and inflammatory diseases were the most enriched within the clique sum modules, we wanted to test whether this was true for methylomic data as well. a benchmark comparing methylation-based disease modules from six different diseases using gwas. following the same logic of the transcriptomic benchmark, we performed a similar benchmark study for methylation modules. we collected ten datasets from three different disease categories, including six complex diseases, and ran the eight modifier methods on them (fig. a). in addition, we constructed consensus modules for each of the datasets. modules were then tested for gwas enrichment using pascal. inspecting the overall performance, we found nine single-method modules with a significant gwas enrichment ( / , . %). though this might be due to disease and cell type heterogeneity, the enrichment is more than expected by chance (p= . x - ). interestingly, the inflammatory diseases such as ms and uc showed a more significant gwas enrichment considering that the evaluation of module performance by gwas enrichment may be biased due to differences in module sizes and interactome centrality, we again assessed the correlation between these values. we found a significant correlation between gwas enrichment and module size (fig. c, rho = . , p = . ) and a non-significant correlation between gwas enrichment and interactome centrality (fig. b, rho = . , p = . ). we found that . % of the disease-method combinations yielded significant gwas enrichment, which is more than expected from an independent random selection of modules (fisher’s exact test p = . , n = ). the highly enriched disease modules belong to ms, uc and cd. two out of the six diseases showed significant gwas (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . enrichment by using the clique sum modules (p = . ). in summary, clique sum method resulted in a more significant gwas enrichment for most diseases also for the methylomic benchmark. multi-omics approach revealed a module enriched for ms-associated genes. considering genomic concordance as the guidance principle for the modules that show enrichment for gwas snps, differentially methylated genes and differentially expressed genes, we further wanted to evaluate multiple datasets of one specific disease, i.e., ms. we compiled ms transcriptomic datasets and nine methylation (supplementary table ) comparisons from geo which satisfy the pre-defined dataset criteria (see methods). for each dataset we implemented the pipeline for module identification and scoring shown in fig. b. we evaluated each module using ms snp enrichment analysis and selected the most enriched modules per omic from this metric. this analysis again showed that clique sum yielded the far highest average enrichment score (meta p = . x - ) and was significantly enriched (p < . ) in / transcriptomic datasets (fig. a) and / of the methylation datasets (fig. b). from the significant modules generated by clique sum, we choose the top four modules from each of the gene transcription and methylation sets, and prioritized genes detected in modules from multiple datasets in each omic. this analysis showed that the strongest ms snp enrichment was found for genes in at least three out of four transcriptomic modules (n= , ; p= . x - ) and two out of four methylomic modules (n= , p= . x - ). next, we used the same principle to combine these two and found that the intersection between the gene transcription and methylation consensus resulted in a module (n = genes, fig. ) enriched for ms-associated genes ( / , p < . x - , or = . ) and with the highest gwas enrichment (p = . x - ) which we hereafter referred to as the multi-omics ms module. the multi-omics ms module was enriched in genes associated with major ms pathways. as we used gwas enrichment as a selection criterion, the high gwas enrichment of the final module was partly expected, which led us to analyze its biological functions and their potential epigenetic associations to ms. first, pathway enrichment analysis showed that the multi-omics module genes (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . are significantly involved in several inter-linked immune-related pathways, most of which have been previously associated to ms, including the t cell receptor[ ] (adjusted p = . x - ), pi k/akt[ ] (p = . x - ), erbb[ ] (p = . x - ), fc epsilon ri[ ] (p = . x - ), chemokine[ , ] (p = . x - ), mapk[ , ] (p = . x - ), and b cell receptor[ ] (p = . x - ) signaling pathways; th (p = . x - ), and th and th (p = . x - ) cell differentiation[ ]; natural killer cell mediated cytotoxicity (p = . x - ); and leukocyte transendothelial migration (p = . x - ), which indeed supports their relevance in ms. interestingly, the module was also highly enriched in morphogenetic and neurogenetic signaling pathways, such as the neurotrophin (adjusted p = . x - ), ras (p = . x - ), rap (p = . x - ), vascular endothelial growth factor (vegf, p = . x - ), foxo (p = . x - ), and mtor (p = . x - ) signaling pathways; and in growth hormone synthesis, secretion and action (p = . x - ). the multi-omics ms module was enriched in genes associated with five known environmental ms risk factors validated in an independent cohort. second, from a literature study[ , ] we found nine environmental ms risk factors of varying evidence for which we could identify methylation studies in healthy controls. for each of these risk factors we derived the top differentially methylated genes (dmgs) and tested their enrichment with the module. intriguingly, the module was significantly enriched for genes associated with five risk factors (fig. b), which included the top associated risk factors, i.e., epstein-barr virus (ebv) infection (fisher exact test p = . x - , or = . ) and smoking (p = . x - , or = . ), as well as low sun exposure (p = . x - , or = . ), high bmi (p = . , or = . ) and alcohol consumption (p = . x - , or = . ). then, we asked whether these putative gene-risk factor associations could be validated using an independent omics dataset with paired risk factor associations. for this purpose, we utilized methylation arrays of peripheral blood from ms patients and controls, which have been described previously[ ]. in this analysis we also considered risk factor associations for each individual including age, sex, bmi at age of , smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents. this enabled analysis of dmgs for the ms and risk factor status as covariates in linear mixed effect (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . analysis. indeed, the module genes were highly significantly enriched for ms (n = ; permutation test p = . x - ), but also for all the tested risk factors (ebv was not included, methods) and non- significantly associated to age and sex having - of the genes in each factor ( . x - < p < . ; fig b). combining all these results we found of the module genes to be associated with a risk factors from both the risk factor studies, genes were associated with two risk factors, and seven genes were associated with three risk factors (csk, prkca, prkcz, runx , runx , stat a, and synj ) (fig. c). these associations suggest that the multi-omics module is capturing a key disease network with both genetically and epigenetically driven alterations, thereby providing the possibility to use it to identify potential novel biomarkers or therapeutic targets for ms.� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion the analysis of case control data in the context of networks has gained increased interest to detect consistent robust gene signatures of individual diseases. the application of disease modules might vary for different researchers, but here we systematically aimed at the detection of disease genes supported by genetic association. for this purpose, our study of the transcriptome and methylome profiles of diseases showed significant gwas enrichments for several inflammatory and heart diseases, while psychiatric disorders showed no enrichments and might not be suitable for gwas validation of modules, potentially due to differences in affected tissue types and sampling points. however, analysis of the significant results showed that methods based of differentially expressed cliques in the protein-protein interaction network demonstrated the strongest enrichments (highest scoring for clique sum), while those based primarily on correlations, like wgcna, showed weak enrichments. a potential reason for this could be that gwas has shown to be mostly associated to the central genes of the protein-protein interaction (ppi) network, but our analysis demonstrated that the correlation between gwas enrichment and centrality was non-significant. we also tested whether there was an improvement using consensus approaches that counted the frequency of the result of multiple methods but found this not to increase performance. moreover, we tested the same strategy on a set of inflammatory, glycemic, and autoimmune methylation datasets and found similar results. we would like to emphasize that, rather than scoring a single best working method, our result is a pipeline for evaluating modules using independent high-throughput enrichments. the work on transcription and methylation datasets suggested that ms is a disease highly enriched for gwas, and we therefore tested if increased enrichments could be derived by their integration. we found publicly available datasets and run assessment for both omics independently, which again showed clique sum to score highest. we then tested if improved results could be obtained using modules from multiple datasets of these two omics using consensus modules from clique sum. this resulted in a module of genes highly enriched for gwas (p = . x - ). the multi- omic module was highly enriched in immune-associated pathways, such as t cell and b cell receptor (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . signaling, th /th differentiation, or leukocyte transendothelial migration. these results conform with the current hypothesis that ms is mediated by an autoreactive response of cd + t cells against myelin surrounding neuronal axons, preceded by their migration across the blood-brain barrier (bbb)[ ]. this autoproliferation of brain-targeting th cells has been shown to be driven by memory b cells, in a process mediated by hla-dr [ ]. in addition, another enriched pathway was vegf signaling. ms patients present high serum vegf levels, which is related to pro-inflammatory functions and can alter the permeability of the bbb[ ]. as gwas was used for method prioritization we asked if modules instead could be validated using epigenetics and lifestyle risk factor genes that we identified to associate with ms. with this aim, we compiled a set of publicly available data from omics studies of these risk factors in healthy individuals. this analysis demonstrated that five out of eight risk factors were enriched in our module. in order to validate the use of an environmental assessment using public domain risk factor association we found an independent methylome study of ms comprising environmental data for each ms and healthy individual. this analysis showed a remarkable enrichment of the module genes by to differentially methylated genes for ms (p = . x - ), and a majority to be associated with the tested risk factors. in contrast to previously known community challenges, in our study we not only used the topological property of the network, but we also combined the methods to use an omics-based input to uncover the disease modules that might be dysregulated at each omics level, contributing to the diverse causative mechanisms behind complex diseases. although using the ppi network as background may lead to certain knowledge bias, this kind of benchmark allowed us to look at the relevant risk factors. in our assessment of the disease modules, methods such as clique sum and diamond did perform better than the community-based consensus predictions. in summary, our study provides a practical integrative workflow that enables system-level analysis of heterogeneous diseases, in terms of multi-omics disease modules, as well as the validation of these by using both disease-specific gwas and risk factors enrichment. we believe that this analysis (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . validates our integrated use datasets and suggest a pipeline that readily could be tested in at least in other autoimmune and cardiovascular diseases. lastly, our study did not aim to optimize hyper- parameters for individual disease modules, and instead used default values when possible, and to the methods from the modifier r package implementation of the methods[ ]. however, this might be an important task for specific disease and our code and processed datasets are available at gitlab (https://gitlab.com/gustafsson-lab/modifier-benchmark). in future work, this approach can be expanded to include diverse and context-specific networks to determine whether our multi-omics modules are able to capture various other levels of granularity. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . declarations ethics approval and consent to participate not applicable availability of data and materials the data used for transcriptomic benchmark and methylation benchmark are downloaded from geo. the disease specific gwas files are downloaded from the latest pascal version. the processed data for analysis is available at https://gitlab.com/gustafsson-lab/modifier-benchmark.the risk factor (eims) data will be made available on request. the r-package modifier is available on the gitlab: https://gitlab.com/gustafsson-lab/modifier; the code used for benchmark analysis and risk factor analysis is available on gitlab: https://gitlab.com/gustafsson-lab/modifier-benchmark ; the latest pascal version: https://www .unil.ch/cbg/index.php?title=pascal. competing interests the authors declare no competing interests. funding this work was supported by the swedish research council (grant - (m.g.), grant - (m.j.)), the swedish foundation for strategic research (grant sb - (m.g.)), the center for industrial it (ceniit)(m.g.), european union horizon /european research council consolidator grant (epi ms, grant (m.j.)), knut and alice wallenberg foundation (grant . (m.j.)) and the knowledge foundation (grant (z.l.)). computational resources were granted by swedish national infrastructure for computing (snic; snic / - , liu- - and liu- - ). author contributions t.v.s.b. compiled the necessary data for the benchmark analysis. h.a.w. performed the transcriptomic benchmark analysis. t.v.s.b. performed the methylation benchmark analysis. d.m.e. and h.a.w. performed the ms use case analysis. d.m.e performed the risk factor analysis. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . m.j.,i.k.,t.o., and l.a., provided the raw data and collected the associated risk factor data for the independent methylation dataset. t.v.s.b performed the independent validation dataset analysis. t.v.s.b. and d.m.e. collectively made the plots and figures for the manuscript. m.g. and z.l. designed the study. t.v.s.b. and d.m.e. prepared the manuscript. all authors discussed the results and commented on the manuscript at all stages. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references . naylor s, chen jy. nih public access. natl institutes heal. ; : – . . santiago ja, bottero v, potashkin ja. dissecting the molecular mechanisms of neurodegenerative diseases through network biology. front aging neurosci [internet]. ; : – . available from: http://journal.frontiersin.org/article/ . /fnagi. . /full . barabási al, gulbahce n, loscalzo j. network medicine: a network-based approach to human disease. nat rev genet [internet]. nature publishing group; ; : – . available from: http://dx.doi.org/ . /nrg . gustafsson m, nestor ce, zhang h, barabási a-l, baranzini s, brunak s, et al. modules, networks and systems medicine for understanding disease and aiding diagnosis. genome med [internet]. ; : . available from: http://genomemedicine.biomedcentral.com/articles/ . /s - - - . szklarczyk d, gable al, lyon d, junge a, wyder s, huerta-cepas j, et al. string v [: protein – protein association networks with increased coverage , supporting functional discovery in genome- wide experimental datasets. oxford university press; ; : – . . lamparter d, lin j, kutalik z, choobdar s, hescott b, tomasoni m, et al. open community challenge reveals molecular network modules with key roles in diseases. ssrn electron j. ; – . . schadt ee. molecular networks as sensors and drivers of common human diseases. nature [internet]. ; : – . available from: http://www.nature.com/doifinder/ . /nature . ghiassian sd, menche j, barabási al. a disease module detection (diamond) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. rzhetsky a, editor. plos comput biol [internet]. ; :e . available from: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . https://dx.plos.org/ . /journal.pcbi. . hellberg s, eklund d, gawel dr, köpsén m, zhang h, nestor ce, et al. dynamic response genes in cd + t cells reveal a network of interactive proteins that classifies disease activity in multiple sclerosis. cell rep. ; : – . . wang h, rogers g, benson m, jarvelin m-r, chavali s, ramasamy a, et al. highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. genome biol. ; :r . . langfelder p, horvath s. wgcna: an r package for weighted correlation network analysis. bmc bioinformatics. ; . . choobdar s, ahsen me, crawford j, tomasoni m, fang t, lamparter d, et al. assessment of network module identification across complex diseases. nat methods. ; : – . . de weerd ha, badam tvs, martínez-enguita d, Åkesson j, muthas d, gustafsson m, et al. modifier: an ensemble r package for inference of disease modules from transcriptomics networks. bioinformatics. ; – . . tian y, morris tj, webster ap, yang z, beck s, feber a, et al. genome analysis champ[: updated methylation analysis pipeline for illumina beadchips. ; : – . . teschendorff ae, marabita f, lechner m, bartlett t, tegner j, gomez-cabrero d, et al. gene expression a beta-mixture quantile normalization method for correcting probe design bias in illumina infinium k dna methylation data. ; : – . . johnson we, li c. adjusting batch effects in microarray expression data using empirical bayes methods. ; – . . ritchie me, phipson b, wu d, hu y, law cw, shi w, et al. limma powers differential expression analyses for rna-sequencing and microarray studies. ; . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . lamparter d, marbach d, rueedi r, kutalik z, bergmann s. fast and rigorous computation of gene and pathway scores from snp-based summary statistics. plos comput biol. ; : – . . mosteller, f. and fisher r. a. questions and answers # author ( s ): frederick mosteller and r . a . fisher published by[: taylor & francis , ltd . on behalf of the american statistical association stable url[: http://www.jstor.org/stable/ all use subject to http://about.jsto. ; : – . available from: http://www.jstor.org/stable/ . piñero j, ramírez-anguita jm, saüch-pitarch j, ronzano f, centeno e, sanz f, et al. the disgenet knowledge platform for disease genomics: update. nucleic acids res. ; :d – . . yu g, wang lg, han y, he qy. clusterprofiler: an r package for comparing biological themes among gene clusters. omi a j integr biol. ; : – . . paul shannon, andrew markiel, owen ozier, nitin s. baliga, jonathan t. wang, daniel ramage, nada amin , benno schwikowski, and trey ideker. cytoscape: a software environment for integrated models. genome res [internet]. ; : . available from: http://ci.nii.ac.jp/naid/ / . maere s, heymans k, kuiper m. systems biology bingo[: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. ; : – . . supek f, bošnjak m, Škunca n, Šmuc t. revigo summarizes and visualizes long lists of gene ontology terms. plos one. ; . . carbone f, de rosa v, carrieri pb, montella s, bruzzese d, porcellini a, et al. regulatory t cell proliferative potential is impaired in human autoimmune disease. nat med. ; : – . . mammana s, bramanti p, mazzon e, cavalli e, basile ms, fagone p, et al. preclinical evaluation of the pi k/akt/mtor pathway in animal models of multiple sclerosis. oncotarget. ; : – . . holley je, gveric d, newcombe j, cuzner ml, gutowski nj. astrocyte characterization in the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiple sclerosis glial scar. neuropathol appl neurobiol. ; : – . . pedotti r, devoss jj, youssef s, mitchell d, wedemeyer j, madanat r, et al. multiple elements of the allergic arm of the immune response modulate autoimmune demyelination. proc natl acad sci u s a. ; : – . . cui ly, chu sf, chen nh. the role of chemokines and chemokine receptors in multiple sclerosis. int immunopharmacol [internet]. elsevier; ; : . available from: https://doi.org/ . /j.intimp. . . krumbholz m, theil d, cepok s, hemmer b, kivisäkk p, ransohoff rm, et al. chemokines in multiple sclerosis: cxcl and cxcl up-regulation is differentially linked to cns immune cell recruitment. brain. ; : – . . krementsov dn, thornton tm, teuscher c, rincon m. the emerging role of p mitogen- activated protein kinase in multiple sclerosis and its models. mol cell biol. ; : – . . kotelnikova e, kiani na, messinis d, pertsovskaya i, pliaka v, bernardo-faura m, et al. mapk pathway and b cells overactivation in multiple sclerosis revealed by phosphoproteomics and genomic analysis. proc natl acad sci u s a. ; : – . . kunkl m, frascolla s, amormino c, volpe e, tuosto l. t helper cells: the modulators of inflammation in multiple sclerosis. cells. ; : . . waubant e, lucas r, mowry e, graves j, olsson t, alfredsson l, et al. environmental and genetic risk factors for ms: an integrated review. ann clin transl neurol. ; : – . . olsson t, barcellos lf, alfredsson l. interactions between genetic, lifestyle and environmental risk factors for multiple sclerosis. nat rev neurol. nature publishing group; ; : – . . kular l, liu y, ruhrmann s, zheleznyakova g, marabita f, gomez-cabrero d, et al. dna methylation as a mediator of hla-drb : and a protective variant in multiple sclerosis. nat (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . commun. ; . . compston a, coles a. multiple sclerosis. lancet [internet]. elsevier ltd; ; : – . available from: http://dx.doi.org/ . /s - ( ) - . jelcic i, al nimer f, wang j, lentsch v, planas r, jelcic i, et al. memory b cells activate brain- homing, autoreactive cd + t cells in multiple sclerosis. cell. ; : - .e . . lange c, storkebaum e, de almodóvar cr, dewerchin m, carmeliet p. vascular endothelial growth factor: a neurovascular target in neurological diseases. nat rev neurol [internet]. nature publishing group; ; : – . available from: http://dx.doi.org/ . /nrneurol. . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figures : figure . overview of the benchmark assessment of disease modules and the integration workflow for ms. (a) transcriptomic and methylomic datasets from different diseases were used as inputs for eight modifier module identification methods. the resulting single-omic disease modules (n = (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ) were independently assessed by gwas enrichment analysis of the same disease using pascal module scoring. modifier methods were evaluated by the combined enrichment score of their respective disease modules. (b) multi-omic integrative workflow for multiple sclerosis (ms)- associated modules. data from case-control comparisons were used as input for module detection with modifier methods. clique sum modules presented the highest gwas enrichment score and were therefore used to generate single-omic consensus modules. the intersection of the best transcriptomic and methylomic consensus modules resulted in an ms multi-omic module (n = genes) with the highest gwas enrichment, which was independently found to be enriched for genes associated with five known lifestyle ms risk factors using public omics data from healthy individuals. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic concordance of modifier modules on transcriptomic datasets. (a) heatmap of pascal p-values for eight single-method and eight consensus modifier modules, identified for publicly available transcriptomic datasets. module performance p-values are shown in a white to blue scale, where any shade of blue represents a significant module ( < . ; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. datasets are classified into six disease types: cardiovascular (red), glycemic (golden), inflammatory (green), neurodegenerative (fuchsia), psychiatric and social (pink), autoimmune (dark purple), and others (light purple); and two cell types: blood (maroon), and others (light yellow). datasets are ranked by meta p-values using fisher’s method of the single-method module p-values across and within their disease types (dataset score, bottom boxplot). modifier methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red), plus the consensus modules (blue). single-methods and consensus were scored by meta p-values across (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . datasets (method score, right boxplot). consensus x/ indicates that the module genes are found in at least x methods out of eight. (b) scatter plot showing spearman correlation between module score and betweenness centrality. modules are represented with a different shape depending on their method and colored based on the disease type. (c) scatter plot showing spearman correlation between module score and module size. modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic concordance of modifier modules on methylomic datasets. (a) heatmap of pascal p-values for eight single-method and eight consensus modifier modules, identified for ten publicly available methylomic datasets. module performance p-values are shown in a white to blue scale, where any shade of blue represents a significant module (p < . ; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. datasets are classified into two disease types: glycemic (golden), and inflammatory (green); and two cell types: blood (maroon), and others (light yellow). datasets are ranked by fisher’s combined p of the single-method module p-values across and within their disease types (dataset score, bottom boxplot). modifier methods are organized by algorithm type: seed-based (green), co-expression- based (yellow), and clique-based (red), plus the consensus modules (blue). single-methods and consensus are scored by meta p-values across datasets (method score, right boxplot). consensus x/ indicates that the module genes are found in at least x methods out of eight. (b) scatter plot (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . showing spearman correlation between module score and betweenness centrality. modules are represented with a different shape depending on their method and colored based on the disease type. (c) scatter plot showing spearman correlation between module score and module size. modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic concordance of modifier modules on ms use case data. (a) heatmap of pascal p-values for eight single-method modifier modules, identified for ten ms-related transcriptomic datasets. module performance p-values are shown in a white to blue scale, where any shade of blue represents a significant module (p < . ), white represents a non-significant module, and grey represents a module of size zero. datasets are classified into the reported ms type: ms (blue), rrms (red), ppms (green), spms (orange), and cis (yellow); and four cell types: whole blood (maroon), pbmcs (light brown), white matter (light yellow), and cd + t cells (purple). datasets are meta p- values of the single-method enrichments (dataset score, bottom boxplot). modifier methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red). single methods are scored by p of the significant modules across datasets (method score, right boxplot). (b) heatmap of pascal p-values for four single-method modifier modules, identified for nine ms-related transcriptomic datasets. (c-d) bar plots of pascal p-values for the ms consensus modules generated with clique sum from transcriptomic (a) and methylomic (b) datasets. (e) union and intersection of the top performing modules, shown as a venn diagram. diseas e type ms rrms ppms spms cis module performance α = . - - - ≤ - best worst p cell type wb pbmcs wm cd + t cells cd + monocytes cd + b cells cd + t cells a b c d e / / / / transcriptomic cliq ue sum consensus modules α -l o g p * α -l o g p / / / / methylomic cliq ue sum consensus modules * best transcriptomic consensus best methylomic consensus intersectionunion ngenes *(p = . x - ) (p = . x - ) (p = . x - ) (p = . x - ) diseas e type cell type mod. disco v. mcode correl. clique clique sum wgcna moda di��coex diamond t α = . α = . . -log p disease type cell type mod. disco v. mcode correl. cliq ue clique sum wgcna moda di��coex diamond α = . -log p t t t t t t t t t t m m m m m m m m m na na na nanana na na na na na na na na na na α = . -l o g p -l og p (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . risk factor enrichment and network visualization of the ms multi-omic module. (a) evidence levels and effect on ms of the risk factor. � (b) enrichment overlap of multi-omic ms dync h jun mapk mapk prkca prkce mapk lcp rhoa dynll grap bcl dnm dnm prkacb casp bcl nrip dnm bcl l prkaca pten atf prkci bid rac rac rasa nras sos pik ca hras casp cdc prkcz pard a met plcg irs ptk pgr kras ret hgf pik cb gab vav grb erbb hck pik cd crkl pik r carm igf ptk b kdr vegfa pxn edn cbl bcar app sh gl iqgap shc bdnf ngf ntrk ptpn egfr ins gnb gng arid a trim gnai ar pik r pik r ptprj sp inpp b tnf ctnnb ncam cdh spp sec csk tln rap b abl src itgb ptpn egf it gb itgav synj cd hla-e clta cd hla-dpb hla-a ptpn hla-dra il mmp pip k b cxcr cxcl icam lckhla-drb ap m ap b fcgr a ap m mapk vwf irf irf irf il il ifng akt a p a hsp aa cd d ppp r a gsk b ppp ca fgg eps l fgf ptprc cd g hsp ab epha f n cltc pip k a vcam fyn esr tgfb itgb cd nr c cd cd e ap a runx cd cd cebpb ap s nfkb hdac kit cdk ccna ube i pcna ccnd rela stat a prkcd prkcq zap raf ywhab akt cd rap a mapk mapk ptafr rab a map k smad map k crebbp smad hmgb ngfr daxx akt pparg trim smad myc ctss sirt csf brca sptbn tp h ax sphk ep jak irf stat stat stat pak hif a plcg pdgfb jak pdgfrb ccne runx rb ezh cdk functional clusters cell death and apoptosis morphogenesis and neurogenesis cell cycle and proliferation chemotaxis and cell migration response to hormone stimulus leukocyte activation and di��erentiation node color legend low sun exposure smoking high bmi alcohol use ebv infection associated with ms signif. enriched ms risk factors risk factor evidence e��ect ebv infection smoking low sun exposure adolescent obesity high bmi night shift work organic solvent exposure alcohol consumption oral tobacco +++ +++ ++ ++ ++ ++ + + + � risk � risk � risk � risk � risk � risk � risk � risk a c b module enrichments risk factor datasets -log p α = . validation dataset -log p α = . na na na . � risk (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . module genes in the top , dmgs in risk factor datasets and independent risk factor methylation dataset (see methods) shown as fisher exact test p-values (threshold α= . ). (c) visualization of the module. nodes (module genes) are arranged in functional clusters according to their overrepresented go terms. genes with a known association to ms are marked with a blue circle. node colors display the associations to an ms risk factor for which the module is significantly enriched (red, alcohol use; green, high bmi; yellow, smoking; purple, low sun exposure; light blue, ebv infection; grey, no association). edges were extracted from the stringdb v human ppi network of experimentally validated interactions (confidence score > ). supplementary materials supplementary table : all case-control comparisons used in the transcriptomic and methylomic benchmarks. supplementary table : all case-control comparisons used in the ms use case benchmark. supplementary table : all methods implemented in the benchmark. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . biorxiv.org - the preprint server for biology skip to main content home about submit alerts / rss search for this keyword advanced search subject areas all articles animal behavior and cognition biochemistry bioengineering bioinformatics biophysics cancer biology cell biology clinical trials developmental biology ecology epidemiology evolutionary biology genetics genomics immunology microbiology molecular biology neuroscience paleontology pathology pharmacology and toxicology physiology plant biology scientific communication and education synthetic biology systems biology zoology view by month complex systems analysis informs on the spread of covid- complex systems analysis informs on the spread of covid- xia wang , dorcas washington , georg f. weber * university of cincinnati department of mathematical sciences, cincinnati, oh, usa university of cincinnati health science library, cincinnati, oh, usa university of cincinnati academic health center, cincinnati, oh, usa * send correspondence to: georg f. weber, james l. winkle college of pharmacy, university of cincinnati, albert sabin way, oh - . e-mail: georg.weber@uc.edu, phone - - . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract the non-linear progression of new infection numbers in a pandemic poses challenges to the evaluation of its management. the tools of complex systems research may aid in attaining information that would be difficult to extract with other means. to study the covid- pandemic, we utilize the reported new cases per day for the globe, nine countries and six us states through october . fourier and univariate wavelet analyses inform on periodicity and extent of change. evaluating time-lagged data sets of various lag lengths, we find that the autocorrelation function, average mutual information and box counting dimension represent good quantitative readouts for the progression of new infections. bivariate wavelet analysis and return plots give indications of containment versus exacerbation. homogeneity or heterogeneity in the population response, uptick versus suppression, and worsening or improving trends are discernible, in part by plotting various time lags in three dimensions. the analysis of epidemic or pandemic progression with the techniques available for observed (noisy) complex data can aid decision making in the public health response. keywords covid- , epidemiology, new infections, complex systems, autocorrelation, fractal dimension, average mutual information, wavelet analysis .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction the spread of infectious diseases depends on pathogen factors (virulence), host factors (immunity), and – on the population level – on countermeasures taken by the community. such measures cover a broad spectrum of possible engagements, and they may be highly consequential for the course of an epidemic or a pandemic [ ]. the analysis of acute infectious progression in a society is critical for gauging the effectiveness of public health responses, but it is made difficult through the non-linear nature of the underlying process. conventional approaches of reductionist research or common linearization techniques are not meaningfully applicable. various strategies have been employed to account for the complexity of infectious propagation. the spread of covid- has been modeled with machine learning [ ], networks of compartments [ ] and cellular automata [ ]. power laws have been inferred [ ]. such investigations are of value, even though they are inevitably based on idealizing assumptions. in addition to modeling approaches, the analysis of actually observed data is of critical importance. the numbers in such data sets are noisy, and they are eminently non-linear (also described as “complex data” or “observed chaotic data” [ ]). complex systems research has made techniques and algorithms available to extract information from observed non-linear data series. the manifestations of the covid- pandemic have varied widely among geographic areas, when compared across countries [ , , ] as well as across us states [ ], depending on when the virus reached them, what the population characteristics were at the time of onset, and what actions were taken in response to the infectious spread. here, we set out to investigate underlying patterns. we apply basic tools of complex systems research to compare the spread of covid- in distinct countries, characterized by their varying approaches to the pandemic, from its beginning stages through early or late october . further, we compare various regions within the usa, which has left major decisions to the individual states. patterns are discernible in fourier and wavelet analyses. order can be detected in time-lagged plots. therefrom, quantitative measurements are obtainable, including autocorrelation, average mutual information, fractal dimension, and embedding dimension, which inform on the pandemic progression. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / methods source data: here we analyze the new infections per day, either as absolute numbers or as rates per , inhabitants. the source data utilized for the present analysis came from bing covid- tracker (www.bing.com/covid). fourier spectrum and univariate wavelet analysis: fourier analysis evaluates the spectral density by relative numbers of new infections (case rates per , inhabitants) versus frequency or versus period. wavelet analysis does not assume stationarity in the time-series. thus, it allows the study of localized periodic behavior. in particular, we look for regions of high-power in the frequency-time plot. the calculations for wavelet analyses of new infections were done in r. in waveletcomp, the null hypothesis, that there is no periodicity in the series, is tested via p-values obtained from simulation, where the model to be simulated can be chosen from a range of options [ ]. the algorithm analyzes the frequency structure of uni- or bivariate time series using the morlet wavelet. the time series to be analyzed was standardized, after detrending, in order to obtain a measure of the wavelet power, which is relative to unit-variance white noise and directly comparable to results of other time series. detrending is accomplished using polynomial regression. where indicated, all graphs are normalized to the same y-axis scale. bivariate wavelet analysis: we conducted bivariate analysis of lagged data (t versus t+ or t+ or t+ ) for joint periodicity. the concepts of cross-wavelet analysis provide tools for comparing the frequency contents by two time series as well as for drawing conclusions about their synchronicity at certain periods and across certain ranges of time. while cross-wavelet power corresponds to covariance in the time domain, wavelet coherence is a time-series measure similar to correlation. two waves are coherent if they have a constant relative phase. the bivariate analysis results include the cross-wavelet power plot, the wavelet coherence plot, the average power plot and the phase difference image. the cross-wavelet power and coherence plot contain arrows showing the area of significant joint periods (significance level = . ). the direction of these arrows indicating the direction of phase differences. up-right pointing arrows indicate that the two series are in phase and x(t) series leads, while down-right pointing arrows indicate the two series are in-phase and x(t+n) series leads. similarly, up-left pointing arrows express that the two series are out of phase and x(t+n) series leads, while down-left pointing arrows express that the two series are out of phase and x(t) series leads. the arrows are only plotted within white contour lines indicating significance at the % level. a more explicit global view of the phase difference can be produced with (π/ , π) and (-π, - π / ) for out of phase and (-π / , π / ) for in- .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / phase. the time-averaged cross-wavelet power provides a summarized view on the shared periods, the corresponding power and the statistical significance. cross-wavelet plots may mark areas significant due to one series swinging widely, rather than two series sharing a joint period. to avoid this false positive readout, it is more appropriate to examine wavelet coherence plots, like the coefficient of correlation. it has a value range between and and it shows statistical significance only in areas where the two series actually share jointly significant periods. return plots: from the total numbers of new infections, we generated return plots with increasing lags, plotting daily changes x(t+ ), …, x(t+ ) versus x(t) and weekly changes x(t+ ), …, x(t+ ) versus x(t). short time lags tend to cluster around the o angle, whereas increasing time delays reveal the structure of the oscillations. when graphed in dimensions, these diagrams can aid in reconstructing the underlying attractor. autocorrelation: a time series sometimes repeats patterns or has other properties, whereby earlier values display some relation to later values. the autocorrelation statistic (serial correlation statistic) measures the degree of that affiliation as it refers to linear dependence. the magnitude of its dimensionless number reflects the extent of similarity. the formula for autocorrelation rm is comprised of terms for autocovariance and variance 𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = 𝑎𝑢𝑡𝑜𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑚 = 𝑁 ∑n―m t= (𝑥𝑡 ― 𝑥)(𝑥𝑡+𝑚 ― 𝑥) 𝑁 ∑n t= (𝑥𝑡 ― 𝑥) autocorrelation coefficients range from - to + , with + indicating perfect synchrony and - reflecting exact mirror images. an absence of any correlation yields rm = . box counting dimension: the dimension of a fractal is best described as a non-integer. the dimension is a quantitative measure for the evaluation of geometric complexity by objects. a general relationship assumes 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 ∝ log (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡𝑠) log ( 𝑠𝑐𝑎𝑙𝑒 𝑠𝑖𝑧𝑒 ) here, the characteristic of dimension is that it specifies the rate, at which the number of increments varies with scale size. we calculated the box counting dimension after binning into x squares of -dimensional return plots with various lags. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / average mutual information: the average mutual information (ami) represents a non-linear correlation function, which indicates how much common information is shared by the measurements of x(t) and x(t+n). the average mutual information was calculated with the mutual function r package tserieschaos. it estimates the mutual information index for a specified number of lags. the joint probability distribution function is estimated with a simple bi-dimensional density histogram. embedding dimension: here by r package nonlineartseries, we first use the timelag function to decide the optimal time lag 𝜏 based on the average mutual information and then by the estimateembeddingdim function to assess the optimal embedding dimension m. then the optimal set of regressors related to x(t) is x(t- 𝜏), …, x(t-(m- ) 𝜏), x(t- m 𝜏). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / results . comparison across countries across countries, a wide spectrum of measures was taken to curb the spread of sars- cov . this resulted in a range of very different progression curves when graphing the numbers of new infections over time (figure ). india, brazil, sweden, italy and the united states have been considered as hard-hit for their own internal reasons. france, germany, over a long period poland, and south korea had tighter control and a less aggressive spread. all curves display close to linear ramp-up phases, followed by more or less irregular oscillations. the levels of success at suppressing the new infection rates diverged among countries, and several are experiencing a second peak. wavelet methodology aids in studying periodic phenomena in time series, particularly in the presence of potential frequency changes over time. for cross-country evaluations, all graphs were plotted on the same scale (figure a). each country was also plotted on its own scale (figure b). the univariate analysis of the time course for the countries under study shows prominence of the recent upswing in france (heat intensity on the right margin of the graph). by contrast, there is comparatively more successful management by italy, germany, poland and south korea through october . india, brazil, sweden, and the united states display cyclical fluctuations of various durations, none of which have been contained. a period of days is prominent in the fluctuations of most countries, which may reflect real cyclicity or weekly reporting habits. the worldwide data are displayed in figure s . for cross-country comparisons, we converted the new infection total numbers to new infection rates by relating them to , members of the population (figure a). similarly, complex systems can be analyzed with fourier analysis. we first plotted fourier power spectra versus frequency for the rates of new infections (figure b). spectral density range (high in brazil, low in south korea) and frequency distribution provide a readout for infectious spread. the spectral density of the normalized rates (identically scaled y-axes) (figure c) confirmed good management of the pandemic spread in germany, poland, and south korea (and to some degree in italy). despite the progressive increase in the numbers of infections in india, on a population basis, control has apparently not been lost through october . by contrast, the power spectra for brazil, sweden, and france are reflective of potentially adverse developments. the united states display an anomaly with a periodic behavior that has a prominent cycle around days. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / to gain a better understanding of the dynamics, with which disease spread occurs, we investigated progressive numbers of new infections in comparison to their increasing time lags. this approach may reveal periodicities or aid in the visualization of attractors. expectedly, short time delays were associated with little change. with a lag time of about days onward, distinct patterns emerged among countries. according to bivariate wavelet analysis for time-delayed data series (including the cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image), italy, germany and south korea shared significantly joint periods of - months in the comparison x(t) versus x(t+ ). south korea has comparatively high power and significant shared periods around weeks at the early stage and later the significant shared periods are also - months. the remaining countries all have segments of shorter periods (around days) and longer periods shared. for india, brazil, france, usa and poland, the shared -day period only appear significant in the later part of the series. similar results are observed in the analyses for x(t) versus x(t+ ) and x(t) versus x(t+ ). the phase difference plots show that in the shared longer periods, x(t) are mostly in phase with x(t+ ), while they gradually become out of phase in x(t) versus x(t+ ) and x(t) versus x(t+ ), thus making longer lags more discriminating and informative (figure a and figure s a,b). a reduction in cross-wavelet power levels is apparent in italy, germany and south korea. poland and france are experiencing recent increases. india, brazil and the usa have had protracted periods of high cross-wavelet power levels. containment is associated with longer periodicity in the distribution of cross-wavelet power. this is the case for south korea, germany and italy. high cross-wavelet power around a periodicity of days is reflective of poor control. to generate informative return plots, we utilized dimensions, which allows for the visualization of two lags from x(t) (or a from a later start point) and may reveal the pattern of an attractor. in this depiction, a rapid increase or decrease in new infections is reflected in a close- to straight line, oscillations generate a near-toroid attractor, while successful management shrinks the torus and moves it closer to the origin. initially, we evaluated multiple time delays. most discriminating were x(t)/x(t+ )/x(t+ ), x(t+ )/x(t+ )/x(t+ ), and x(t+ )/x(t+ )/x(t+ ) (figure b). the progressive increase in new cases over the time period in india is reflected in a predominantly linear curve on each scale. the wide fluctuations in brazil generate a largely disordered appearance. disorder is also apparent in sweden. france initially managed the pandemic well, but is experiencing a dramatic upswing, which obscures order. cyclical patterns, suggesting the outlines of attractors, are apparent in usa, italy, germany, and south korea (where most data points are concentrated near the origin). poland initially displayed a well- .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / contained attractor, but the recent substantial upswing in new infections is reflected in a linear progression from there (for separate analyses of the two phases, see figure s ). we also calculated the embedding dimensions for the lagged data (figure c). germany has the highest embedding dimension of , followed by poland with . several countries have an embedding dimension of , including brazil, sweden, usa and south korea. italy and france have the embedding dimension equal to . india is unusual due to its longer lag period of days. when the lag period is set at days, the embedding dimension of india is also equal to . for the worldwide data, the calculated embedding dimension is with a time lag of (not shown). the autocorrelation of two data strings with short time lags is expected to be high (approaching . ) because there is little opportunity for dramatic change (high infection rates on day t likely produce similarly high numbers on the consecutive day t+ , while low numbers are followed by few new infections on the next day). autocorrelation may remain high for extended lags in the initial ramp-up and at the oscillatory stage, depending on the regularity of the fluctuations. a society that succeeds in curbing the disease spread will leave the highly correlated initial ramp-up and consecutive oscillatory phases, thus displaying a gradual decrease in values at the longer lags. the decline in the autocorrelation numbers of progressively lagged data by country appeared to be reflective of the stringency, with which the pandemic was addressed (figure a). from a lag of onward, poland and south korea have substantially declining values (although due to the recent steep upswing in new infections, poland deviates from the trend at very long lags), germany shows a dramatic lowering at a lag of and above. by contrast, india and brazil stay uniformly high. so do the global numbers, which are inherently heterogeneous. the average mutual information reflects information shared by the measurements of x(t) and x(t+n). expectedly, it declines with increasing lag. poland starts with a relatively low value ( . at t versus t+ ) and shows a rapid decrease with longer lag. it then stays around at a low level of . from lags of to days. france displays a gradually decreasing trend with the average mutual information starting at . and ending at . at the lag of days. india shows a similar pattern as france but with much higher average mutual information (due to the constant uptick in numbers), ranging between . and . . four other countries, including germany, usa, sweden and brazil, all express relatively flat average mutual information values, staying around levels of . for the usa and brazil, . for germany, and . for sweden. reflecting progressively improved control, italy and south korea also have decreasing trends, but much flatter at . - . for italy and . to . for south korea, respectively (figure b). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a rapid increase in new infections is reflected in a small fractal dimension (practically approximated by the box counting dimension with values between and ) of the -dimensional return plots with progressive lags. intermediate phases are characterized by higher fractal dimensions (approaching ), depending on the nature of the oscillations. conversely, successful management through the reduction in new infections should be reflected in a contraction of the attractor on the return plot, which is assessable through the box counting dimension. a trend is displayed in the comparisons from shorter to longer lag periods. distinct management strategies across different countries generate a heterogeneous pattern worldwide, rendering the fractal dimension high regardless of the lag in x(t+n) versus x(t) plots. steep increases in new infections (poland, india) have dimensions close to . intermediate phases are characterized by higher numbers. successful fights against the pandemic (south korea) are causative for declining size dimensions with increasing lag (figure c). . comparison across us states within the usa, individual states have encountered a rather wide range of progression phenotypes in the spread of new covid- infections (figure ). this is due to variations in international connectedness and population density (reflected in the early peaks in the northeastern states new york and massachusetts), holiday travel (florida), policy decisions and other factors. wavelet analysis of new infections (one scale across all states) shows good control (right side of the graph) after initial affliction (left area) for massachusetts and new york, which having had early spikes in new infections have achieved good success in containment. through the observation period, control has not been maintained in ohio. the periodicity in individual states (each on their own scales) is poorly defined, except for florida and ohio, where days yield a prominent signal (figure a,b). we normalized the new infection numbers to rates by relating them per , inhabitants (figure a). figure b shows the periodogram for the states under investigation with frequencies between and . (the graph is almost flat for the higher frequencies). there exist clear heterogeneous patterns in the comparison among these states. new york and massachusetts display steadily decreasing spectral density values from the longest period to around - weeks (corresponding to a frequency range around . - . ). florida and texas .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / share similar patterns with a few low spikes in their periodograms after the first highest ones. the graph for california flattens out after the lowest three frequencies, with the longest period (the whole series) having the highest value. ohio’s pattern is quite unique with fluctuating values from the longest periods through around - weeks. the fourier power spectrum for the infection rates (figure c) indicates similar periodic patterns as in the periodograms of figure b. these patterns are less prominent due to the adjustment to the same y-axis scale (the scale reflects the magnitude of the positive rates, the shape shows the evolution of the disease). we conducted bivariate wavelet analysis on the time-lagged data (figure a and figure s ). the shared synchronicity segments between x(t) and x(t+n) can be grouped into shorter periods (around days) and longer periods (approximately weeks, month, months). new york does not display substantial joint short periods. ohio and texas mainly have correlation at the end of the series around the -day period. massachusetts experiences joint periodicity around the -day period at the early stage of the series. florida and california have joint periods in the middle of the observation time frame. the levels of average cross-wavelet power are higher in states with poor control (x-axes scales for florida, ohio). the peak power shifts toward higher periodicity with improved control (y-axes scales for new york, massachusetts). the return plots in dimensions, utilizing the same time lags as for the countries, seemed to reflect contraction of the attractor in massachusetts, cyclicity in new york, florida and california, no containment in texas, and an ejecting diagonal in ohio which may reflect exacerbation (figure b). the embedding dimensions varies among states, such that the most contained states (new york, massachusetts) have the lowest embedding dimension (table ). the autocorrelation for return plots of increasing lags show a progressive decline in the numbers of new york and massachusetts, which implemented strong containment measures after having been afflicted early. the values decline less steeply for texas and california. ohio displays an anomaly with increasing values for very long lags. the state, while not heavily afflicted on a per capita basis, never achieved containment, only a stationary level, and has since experienced another wave (figure a). up to a maximum lag of days, the average mutual information for the us states under study ranges between . and . . overall, all states show a slightly decreasing pattern except for california, which is relatively leveled at a value of . (figure b). unexpectedly, the box counting dimension (figure c) is less discerning than it was for the evaluation across countries. this may be due to the much lower power conveyed by smaller population sizes. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / discussion in the present investigation we find that the analysis tools for observed complex data can aid in the interpretation of pandemic spread across communities. difficulties in analyzing the non- linear patters of infectious disease spread may be tamed by applying the tools of complex systems research. the approach can reveal patterns, where a simple time course of new cases does not. further, non-linear analysis allows the study into various facets of the process, depending on whether the starting data are new cases, hospitalizations, deaths or other readouts. maps can be generated and evaluated for their fractal dimensions [ ]. the operational approximation of lyapunov exponents may be meaningful, although they were largely uninformative for the present study (supplemental figure s ). among the countries analyzed, south korea has had the most successful control of the pandemic spread according to low intensity in univariate wavelet analysis, low spectral density range in fourier analysis, low spectral density of the normalized rates, a reduction in cross- wavelet power levels according to bivariate wavelet analysis and longer periodicity in the distribution of cross-wavelet power. further, declining box counting dimensions, autocorrelation values with increasing time lag, and decreasing trends (at a low slope) in average mutual information confirm containment. cyclical patterns in return plots, suggesting the outlines of attractors, are apparent and most data points are concentrated near the origin of the graph. germany exhibited good management through october according to univariate wavelet analysis, spectral density in the power spectrum of the normalized rates, a reduction of cross- wavelet power levels in bivariate wavelet analysis, longer periodicity in the distribution of cross- wavelet power, a dramatic lowering of autocorrelation values at a lag of and above, and relatively flat average mutual information values, staying around levels of . . cyclical patterns in return plots suggest the outlines of an attractor. good control by italy consecutive to the early impact and through october is reflected in low intensity and fluctuation when applying univariate wavelet analysis, in a reduction of cross-wavelet power levels for bivariate wavelet analysis of time-delayed data, longer periodicity in the distribution of cross-wavelet power, and decreasing trends (at a low slope) in average mutual information. cyclical patterns in return plots, suggesting the outlines of an attractor, are apparent. poland had two distinct phases. by univariate wavelet analysis and density in the power spectrum of normalized rates, there was indication of good management through october . according to bivariate wavelet analysis for time-delayed data series and return plots, the recent substantial upswing in new infections is reflected, which also results in box counting dimensions close to . from a lag of onward, poland .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / has substantially declining autocorrelation values, although due to the recent steep upswing in new infections, the trend reverses at very long lags. the average mutual information starts with a relatively low value ( . at t versus t+ ) and shows a rapid decrease with longer lag, staying level from lags of to days. in the united states, univariate wavelet analysis displays cyclical fluctuations of various durations, none of which have been contained. according to bivariate wavelet analysis for time-delayed data series, there have been protracted periods of high cross- wavelet power levels. cyclical patterns in return plots, suggesting the outlines of attractors, are apparent. the usa expresses relatively flat average mutual information values, staying around levels of . . in france, univariate wavelet analysis of the time course shows prominence of the recent upswing (heat intensity on the right margin of the graph), the power spectrum is reflective of potentially adverse developments. the second wave of infections is apparent in bivariate wavelet analysis and in the obscured order in return plots. france displays a gradually decreasing trend of average mutual information. india expresses cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. on a population bases, the spectral density suggests that control has not been lost through october . bivariate wavelet analysis shows protracted periods of high cross-wavelet power levels, return plots reflect the progressive increase in new cases over the time period in a predominantly linear curve on each scale, box counting dimensions are close to , and autocorrelation values stay uniformly high with increasing time lag. india displays a gradually decreasing trend of average mutual information. brazil experiences cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. by fourier analysis, the spectral density range is high. the power spectrum is indicative of potentially adverse developments. according to bivariate wavelet analysis, there have been protracted periods of high cross-wavelet power levels. in return plots, the wide fluctuations generate a largely disordered appearance. the autocorrelation values stay uniformly high. brazil expresses relatively flat average mutual information values, staying around levels of . . sweden shows cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. the power spectrum is reflective of potentially adverse developments. in return plots, disorder is apparent. sweden expresses relatively flat average mutual information values. prima facie, the curves of new infections versus time for three western european countries, france, italy, and germany, appear similar. complex systems analysis reveals the upswing in france to be much more perilous than the increases in the curves of new infections by the other two countries. the management of infectious spread also requires improvements in .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the united states, sweden and brazil. the selection of the observation period can dramatically influence the results. poland was initially very successful in containing the pandemic, but then experienced a substantial upswing. analyzing these two phases individually or in conjunction yields very different data sets, which inform about distinct aspects of the infectious progression. the fluctuations of new infections in an epidemic or a pandemic pose challenges to the evaluation whether a decline reflects true containment (“rounding the corner”) or just the calm before another wave. the readouts of non-linear systems analysis can aid in making such a distinction. a complex occurrence that experiences containment will strive toward a point attractor in phase space and move toward the origin. such a progression is represented in a declining fractal dimension, and the transition from fluctuations (often associated with a torus attractor) toward limitation of new cases is expected to reduce the autocorrelation. one constraint of complex systems analysis is the need for large data sets. in this regard, the availability of about data points (daily new cases march through october ) for each geographic area in this study is somewhat low. the robustness of pertinent studies increases with larger data sets over time. reporting errors could have a non-trivial impact, and may be reflected in the frequent occurrence of a peak at days in the spectral analysis (possibly indicating weekly totals). this problem can be addressed by utilizing moving averages. the homogeneity or heterogeneity in management by the community under study determines the noise level. the worldwide numbers of new infections have a lot of background due to varying patterns across countries. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / acknowledgements gfw is supported by nih grant ca . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / references [ ] christakis na. apollo’s arrow: the profound and enduring impact of coronavirus on the way we live. new york (hachette book group) . [ ] mehta m, julaiti j, griffin p, kumara s. early stage machine learning–based prediction of us county vulnerability to the covid- pandemic: a machine learning approach. jmir public health and surveillance ; : e . [ ] wang k, ding l, yan y, dai c, qu m, jiayi d, hao x. modelling the initial epidemic trends of covid- in italy, spain, germany, and france. plos one ; :e . [ ] bin s, sun g, chen c-c. spread of infectious disease modeling and analysis of different factors on spread of infectious disease based on cellular automata. int j environ res public health ; : . [ ] blasius, b. power-law distribution in the number of confirmed covid- cases. chaos ; : . [ ] abarbanel hdi. analysis of observed chaotic data. switzerland (springer nature) . [ ] chakraborty i, maity p. covid- outbreak: migration, effects on society, global environment and prevention. science of the total environment ; : . [ ] bertacchini f, bilotta e, pantano ps. on the temporal spreading of the sarscov- . plos one ; :e . [ ] white er, hébert-dufresne l. state-level variation of initial covid- dynamics in the united states. plos one ; :e . [ ] roesch a, schmidbauer h. waveletcomp: computational wavelet analysis. r package version . . . https://cran.r-project.org/package=waveletcomp [ ] păcurar c-m, necula b-r. an analysis of covid- spread based on fractal interpolation and fractal dimension. chaos, solitons & fractals ; , . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / tables and figures figure : time-course of disease spread by country. numbers of new cases, x(t), per day versus time (t, indicating the date). shown are the curves for (top to bottom, left to right) the globe, india, brazil, sweden, italy, usa, france, germany, poland, and south korea. note the different scales of the y-axes. figure : univariate wavelet analysis. cross-wavelet power spectrum in the time-period domain. the x-axis (index) displays the time progression, whereas the y-axis depicts the length of the period. white contour lines indicate significance of periodicity on the . level for probability of error. lines represent the ridge of cross-wavelet power. the color bar reveals the power gradient. a) all countries on the same scale. b) each country on its own scale. figure : fourier analysis. a) new infection rates. daily reported new numbers of infections divided by , inhabitants. the x-axis shows the calendar date. b) power spectrum. fourier power spectra versus frequency for. new infections per , inhabitants per day in each of countries. c) normalized power spectrum. spectral density (y-axis) versus period (in days) for infection rates per , inhabitants (x-axis). the curve shows the smoothed spectral density estimates. all y-axes have the same scale. figure : time-lagged data analysis. a) bivariate wavelet analysis. shown are cross- wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right in each row) time-lagged data were used for x(t)/x(t+ ) (for the lags x(t)/x(t+ ) and x(t)/x(t ) see figure s ). white contour lines indicate significance for joint periodicity, black arrows depict the phase difference in the areas with significant joint periods. the solid red dots on the average power plot (the third from the left) depict significant joint periods at a probability of error of . . where shown, the color bars reveal the ranges of cross-wavelet power levels. b) return plots in dimensions. time-lagged return plots in dimensions are shown, from left to right, for x(t)/x(t+ )/x(t+ ), x(t+ )/x(t+ )/x(t+ ), and x(t+ )/x(t+ )/x(t+ ). each country of interest has its own row. c) embedding dimension. the plots show how cao’s algorithm uses functions in order to estimate the embedding dimension from the time series (the e (d) and e (d) functions), where d denotes the dimension. figure : readouts of complexity for lagged data on covid- spread by country. a) autocorrelation. bar graph of the autocorrelation in covid- spread with each bar color .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / representing a different country. the selected time lags are indicated on the x-axis, all are calculated versus x(t). b) average mutual information. bar graph of average mutual information in covid- spread with each bar color representing a different country. the selected time lags as indicated on the x-axis are all calculated versus x(t). c) fractal dimensions. box counting dimensions are calculated for -dimensional return plots of increasing lags, x(t+ ) versus x(t) through x(t+ ) versus x(t). countries are evaluated, and the worldwide numbers are shown on the left. poland is represented twice, over the entire evaluation period through october (which contains a steep incline) and over the shorter phase of containment through september (cont. = contained period). figure : time-course of disease spread for individual us states. numbers of new cases, x(t), per day versus time (t, indicating the date). shown are the curves for (top to bottom, left to right) massachusetts, new york, florida, texas, california, and ohio. figure : univariate wavelet analysis. wavelet power spectrum in the time-period domain. contour lines indicate significance of periodicity with . significance level. black lines indicate the ridge of wavelet power. the color bar reveals the power gradient. a) all states on the same scale. b) each state on its own scale. figure : fourier analysis. a) new infection rates. daily reported new numbers of infections divided by , inhabitants (infection rates). the x-axis shows the calendar date. b) power spectrum. periodogram plot on the series of the new infection rates. the x-axis is the frequency (per day) and the y-axis represents the spectral density. the y-axis ranges vary among graphs. c) normalized power spectrum. spectral density versus period (in days) for infection rates. all y-axes have the same scale. figure : time-lagged data analysis by us state. a) bivariate wavelet analysis. shown are cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) time-lagged data were used for x(t)/x(t+ ) (for the lags x(t)/x(t+ ) and x(t)/x(t ) see figure s ). white the contour lines indicate significance of joint periodicity, black arrows indicate the phase difference in the areas with significant joint periods. the solid red dots on the average power plot (the third from the left) reflect significant joint periods at a significance level of . . b) return plots in dimensions. time-lagged return plots in .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / dimensions are shown, from left to right, for x(t)/x(t+ )/x(t+ ), x(t+ )/x(t+ )/x(t+ ), and x(t+ )/x(t+ )/x(t+ ). each state under investigation has its own row. figure : readouts of complexity for time-lagged data by u.s. state. us states have been evaluated. a) autocorrelation. bar graph of the autocorrelation in covid- spread with each bar color representing a different us state. the selected time lags are indicated on the x- axis, all are calculated versus x(t). b) average mutual information. bar graph of average mutual information in covid- spread with each bar color representing a different state. the selected time lags are indicated on the x-axis, all are calculated versus x(t). c) fractal dimensions. box counting dimensions are calculated for -dimensional return plots of increasing lags, x(t+ ) versus x(t) through x(t+ ) versus x(t). table : embedding dimension for time-lagged data by u.s. state. embedding dimensions were calculated according to cao’s algorithm, which uses functions in order to estimate the embedding dimension from the time series. the table shows the calculated time lags and embedding dimensions for each u.s. state under study. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supplement figure s : power spectrum and univariate wavelet analysis for worldwide new cases. a) wavelet analysis and model fit (minimum power level: , significance level: . , only coi: false, only ridge: false). b) fourier analysis. figure s : bivariate wavelet analysis by country. the graphs represent cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) time-lagged data were used for x(t)/x(t+ ) (a) and x(t)/x(t ) (b). white contour lines depict joint significance of periodicity. black arrows reflect the phase difference in the areas with significantly joint periods. the solid red dots on the average power plot (the third from the left) indicate significantly joint periods at a probability of error . . the color bars reveal the cross- wavelet power levels. figure s : return plots in dimensions for poland. new infections per day. top) entire observation period. th march through th november . middle) contained phase. partial time frame through th september . bottom) exacerbating phase. partial time frame from st september . figure s : bivariate wavelet analysis by us state. the graphs display cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) time-lagged data were used for x(t)/x(t+ ) (a) and x(t)/x(t ) (b). white contour lines indicate significance of joint periodicity. black arrows indicate the phase difference in the areas with significantly joint periods. the solid red dots on the average power plot (the third from the left) indicate significance at a level of . . figure s : evolution of lyapunov exponents over time. for a discrete mapping x(t+ ) = f(x(t)) we calculate the local expansion of the flow by considering the difference of trajectories. the lyapunov characteristic exponent can be approximated as 𝜆 ≈ ln (|𝑥𝑛+ ― 𝑦𝑛+ |/|𝑥𝑛 ― 𝑦𝑛|) for points xn,yn close to each other on the trajectory [https://www.math.tamu.edu/~mpilant/math /matlab/lyapunov/lorenzspectrum.pdf]. the changes of lyapunov exponents are presented for the return plots of lags x(t+ ) versus x(t), x(t+ ) versus x(t), x(t+ ) versus x(t), and x(t+ ) versus x(t). a) countries. shown are ranges over days. b) us states. shown are ranges over days. mass. = massachusetts. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / metabolite discovery through global annotation of untargeted metabolomics data li chen , , wenyun lu , , lin wang , , xi xing , , xin teng , xianfeng zeng , , antonio d.muscarella , yihui shen , alexis cowan , , melanie r. mcreynolds , , brandon kennedy , ashley m. lato , shawn r. campagna , mona singh , , joshua rabinowitz , , ,# institute of metabolism and integrative biology, fudan university, shanghai, , china. lewis-sigler institute for integrative genomics, princeton university, princeton, nj, , usa. department of chemistry, princeton university, princeton, nj, , usa. department of molecular biology, princeton university, princeton, nj, , usa. lotus separation llc, department of chemistry, princeton university, princeton, nj, , usa department of chemistry, the university of tennessee at knoxville, knoxville, tn, , usa department of computer science, princeton university, princeton, nj, , usa. # corresponding author, e-mail: joshr@princeton.edu abstract a primary goal of metabolomics is to identify all biologically important metabolites. one powerful approach is liquid chromatography-high resolution mass spectrometry (lc-ms), yet most lc-ms peaks remain unidentified. here, we present a global network optimization approach, netid, to annotate untargeted lc-ms metabolomics data. we consider all experimentally observed ion peaks together, and assign annotations to all of them simultaneously so as to maximize a score that considers properties of peaks (known masses, retention times, ms/ms fragmentation patterns) as well network constraints that arise based on mass difference between peaks. global optimization results in accurate peak assignment and trackable peak-peak relationships. applying this approach to yeast and mouse data, we identify a half-dozen novel metabolites, including thiamine and taurine derivatives. isotope tracer studies indicate active flux through these metabolites. thus, netid applies existing metabolomic knowledge and global optimization to annotate untargeted metabolomics data, revealing novel metabolites. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction metabolomics provides a snapshot of small-molecule concentrations in a biological system. in so doing, it reflects the integrated impact of genetics and the environment on metabolism. one important role of metabolomics is annotating previously unknown or underappreciated metabolites. for example, metabolomics facilitated identification of -hydroxyglutarate as an oncometabolite, eventually leading to the development of inhibitors of -hydroxyglutarate synthesis as anticancer agents , . metabolomics also contributed to identification of a diversity of natural products , and disease biomarkers . a common experimental strategy in metabolomics is liquid chromatography-high resolution mass spectrometry (lc-ms). lc-ms metabolomics measures thousands of ion peaks, of which hundreds are associated with known metabolites. a much greater number of peaks, however, still remain unannotated. the standard approach to peak annotation is to compare exact mass and either retention time or ms/ms fragmentation pattern to authenticated standards. to facilitate such comparisons, extensive chemical databases have been developed (e.g. metlin , hmdb , mona , kegg , pubchem , chebi and nist ), with software tools available for automated peak picking and database comparison. modern software also includes features for annotating peaks arising from isotopes and adducts of known metabolites, based on co-elution and characteristic mass differences (e.g. xcms , , gnps , ms-dial , mzmine , and camera ). such peaks seem to account for at least half of non-background lc-ms features , . despite this progress, a great number of unknown peaks remain, and figuring out their identities is a primary challenge in the field. one promising approach is network analysis, capitalizing on peak-peak relationships to increase annotation scope and accuracy. connections can be drawn based on similar responses across experiments and/or ms similarity. such connections can arise either through biochemical activities or mass spectrometry phenomena, such as isotopes, adducts, or in-source fragments. while distinct metabolites typically separate chromatographically, ions connected through mass spectrometry phenomena co-elute. workflows employing the concept of molecular connectivity have been used to build networks (e.g., gnps , , cliquems , metdna , biocan , and ipa ), and are showing increasing utility for annotating metabolomics data in diverse contexts. for example, gnps has been used broadly in identifying natural products. existing algorithms generally focus on metabolite peaks with ms spectra available, using ms spectral data as the main annotation driver. this is an effective strategy for annotating high abundance peaks with informative ms spectra, such as major secondary metabolites. it is less effective, however, for many low abundance metabolomics peaks, due to poor quality or less informative ms spectra. we accordingly set out to develop a network algorithm for annotating the breadth of metabolomics peaks, capitalizing on available ms spectra but including also low .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abundance peaks lacking ms spectra. effective incorporation of peaks without ms spectra required making yet better use of peak-peak relationships to enhance annotation accuracy, which we achieved through the computational approach of global optimization: not dealing with peak annotation one- by-one, but instead all at once to take full advantage of the entire available information. this global optimization strategy had not previously been applied in the context of molecular networking analysis. to this end, we present the algorithm “netid”. similar to existing network analysis approaches, nodes are experimentally observed non-background ion peaks and connections are mass differences between peaks. we explicitly distinguish connections due to biotransformations (“biochemical connections” linking two metabolites) from those due to mass spectrometry phenomenon (“abiotic connections” linking isotopes, adducts, and fragments to the metabolites from which they are derived). peak annotation occurs in a single global optimization step, based on linear programming, that enforces a single formula assignment for each experimentally observed ion peak. using this approach, we can annotate roughly % of untargeted metabolomics peaks, with a majority being isotopes and adducts of known metabolites. through these efforts, we provide likely formulae for several hundred novel metabolites, and confirm the identities of half-dozen species not currently in metabolomics databases. results netid algorithm netid involves three computational steps: initial annotation, scoring, and optimization (figure ). the workflow starts with a peak table that contains a list of peak m/z, rt, intensity, and (when available) associated ms spectra, with background peaks removed by comparing to a process blank sample. each peak defines a node in the network. in the initial annotation phase, we match every experimentally measured node m/z to formulae in the hmdb database. peaks matching to hmdb formula within ppm are annotated as seed nodes, from which we extend edges to build the network. edges connect two nodes via gain or loss of specific chemical moieties (atoms). the atom differences can occur either due to metabolism (biochemical connection) or due to mass spectrometry phenomena (abiotic connections). for example, a difference of h suggests an oxidation/reduction relationship and defines a biochemical edge. a difference of na-h suggests sodium adducting and is a type of abiotic edge (adduct edge). other atom differences define other types of abiotic connections (isotope or fragment edges). most atom differences are specific to biochemical, adduct, isotope, or fragment edges, but a few occur in multiple categories. for example, h o loss can be either biochemical (enzymatic dehydration) or abiotic (in-source water loss). by integrating literature and in-house data, we assembled a list of biochemical atom differences and abiotic atom .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / differences which together define all connections in the network (supplementary table , ). using these lists, starting from the seed nodes, we draw all feasible edges such that (i) Δm/z between the connected nodes matches the atom mass difference and (ii) only co-eluting peaks are connected by abiotic edges. through the edge extension process, possible formulae are assigned to nodes outside the initial seeds. a few rounds of edge extension suffice to give thorough coverage. due to finite mass measurement precision, a single node may be assigned multiple contradictory formulae, which are resolved at the optimization step (see methods). netid then scores every node and edge annotation. node annotations are scored based on precision of m/z match to the molecular formula, precision of retention time match to known metabolite retention time and (when the relevant information is available) quality of ms spectra match to database structure. in addition, there is a bonus for matching to formula in hmdb and a penalty for breaking basic chemical rules (seven golden rules for filtering molecular formulae ). biochemical edges receive a positive score for ms spectra similarity match between the connected nodes, and are otherwise unscored. abiotic edges are scored based on precision of co-elution with the parent metabolite, connection type (adduct, isotope, etc.), and features specific to the connection type, such as expected natural abundance for isotope peaks (see methods). the overall impact is to assign high scores to annotations that effectively align the experimentally observed ion peaks with prior metabolomics knowledge. with a score assigned for each potential node and edge annotation, we formulate the global network optimization problem as that of maximizing the network score with linear constraints that each node and edge has a single unique annotation and that these are consistent (e.g. peaks connected by h edge must have formula differing by h). such optimization is readily performed by linear programing with a typical runtime of hours in r on a personal computer, and results in an optimal and consistent network annotation. global network optimization as an example of the utility of global network optimization, where all peaks and connections are simultaneously considered to enhance annotation accuracy, we present an example network containing five peaks (figure a). we first match experimental measurements to the database, annotating node a and node b as seed nodes adenosine monophosphate (amp, c h n o p) and adenosine (c h n o ), respectively. we also identify five possible connections between the five nodes. two alternative networks are generated by extending annotations. in the left network, node c is annotated as adenosine hcl adduct (c clh n o ), whereas in the right network, node c is annotated as a putative metabolite (c h n o p) resulting from co loss from amp. node d is c isotope of node c in both networks. node e is annotated as cl isotope of node c in the left network, and is unannotated in the right network because there is no cl atom in the parent molecule. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the left network has higher total node and edge annotation scores than the right network, and thus is selected by netid. this selection makes sense to an experienced mass spectroscopist: the cl isotope signature in node e indicates that node c should contain cl. the power of netid is that it automatically captures such logic, and uses the power of global computational optimization to extend such inferences across the network in an automated manner. to test the netid workflow, we applied it to both yeast and liver datasets, in both positive and negative ionization mode (figure b, c). considering the example of negative mode yeast data with a total of , non-background peaks, in the initial annotation step, roughly , potential formulae were assigned to , peaks, with about peaks receiving multiple formula annotations. these nodes were connected by just over , potential edges. edge extension expanded coverage to over , nodes with an average of twelve potential formulae each, highlighting the importance of scoring and network optimization to assign proper formulae. after scoring node and edge annotations, global network optimization settled on about , unique node annotations. about % of the annotated peaks were metabolites, % were putative novel metabolites, and the rest were mass spectrometry phenomena, such as adducts, fragments, isotopes. nodes were connected by about , edges, roughly evenly split between biochemical and abiotic connections (figure c, supplementary fig. a). more than % of annotated nodes fell into a single dominant connected network (supplementary fig. b), reflecting most peaks being connected to core metabolism. about % of peaks, however, remained unannotated. these unannotated peaks likely reflect deficiencies in our lists of allowed atom differences, including additional forms of mass spectrometry phenomena. for example, manual examination of the unconnected peaks revealed a dozen nickel adducts of known compounds (supplementary table. ). importantly, the annotated peaks included several hundred novel metabolite formulae (supplementary fig. , supplementary data ). collectively, these provide a wealth of opportunities for metabolite discovery. thiamine-derived metabolites netid optimization provided not only a list of putative metabolites, but also connections linking these putative metabolites to known metabolites. in the yeast metabolomics dataset, we found three putative metabolites that have total ion current > , connected in a subnetwork around thiamine. their formulae are c h n o s (thiamine+o), c h n o s (thiamine+c h o) and c h n o s, (thiamine+c h o) (figure a). while not found in hmdb, thiamine+o is documented in metlin as a thiamine oxidation product, so we focused on the other two potential thiamine derivatives. ms/ms spectra of the putative thiamine+c h o and thiamine+c h o contained characteristic thiamine fragments. both contained a classical pyrimidine fragment, with thiamine+c h o also containing an acetylated pyrimidine fragment, leading to a probable structure (figure a,b). the structural assignment is further supported by the presence of an unmodified thiazole fragment. in .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / contrast, thiamine+c h o lacked a classical unmodified thiazole fragment, instead showing a thiazole+c h o fragment (and a fragment with further water loss) (figure a,b). isotope tracing experiments further confirm these two peaks contain thiamine. when fed [u- c]glucose as sole carbon source, yeast synthesize thiamine de novo, resulting fully labeled thiamine species, with carbon counts matching the netid formula assignments (figure c). adding unlabeled thiamine to the [u- c]glucose culture media, yeast uptake the unlabeled thiamine, resulting in unlabeled thiamine and m+ labeled thiamine+c h o and thiamine+c h o species. although discovered in yeast, these are conserved metabolites, found also in mammalian samples (figure d). acetylation is one of the biochemical atom transformations allowed in netid. the addition of c h o is much less common biochemically, and was captured in netid as two steps, acetylation followed by reduction. accordingly, we looked into thiamine metabolism to explore how thiamine+c h o might be produced. thiamine pyrophosphate is an important cofactor in pyruvate dehydrogenase (pdh, the entry step to tca cycle) (figure e). the de-pyrophosphorylation product of thiamine intermediate in pdh reaction yields thiamine+c h o matches the proposed thiamine+c h o structure (figure f). based on this biochemical route, we realized that analogous products could be formed by α- ketoglutarate dehydrogenase (thiamine+c h o ) and branched-chain keto acid dehydrogenase (thiamine+c h o) (figure f). peaks at both of these exact masses were also experimentally observed, with isotope labeling results supporting their being thiamine-derived metabolites (supplementary fig. ). thus, netid enabled the discovery of four novel thiamine-derived metabolites. n-glucosyl-taurine we similarly carried out netid annotation of a mouse liver dataset. we observed multiple putative metabolite peaks linked to taurine, by apparent glucosylation (+c h o ), palmitylation (+c h o) and transamination (+o-nh ) (figure a). the latter two, while missing in hmdb, were found in metlin: n-palmitoyl taurine (c h no s) and sulfoacetaldehyde (c h o s). to elucidate the structure of the putative taurine glucosylation product (c h no s), we chemically synthesized n- glucosyl-taurine. synthetic n-glucosyl-taurine matched the retention time and ms/ms fragmentation pattern of the observed c h no s peak (figure b,c). in liver samples of mice infused with [u- c]glucose, c h no s appeared in m+ form, suggesting active synthesis of the n-glucosyl-taurine from circulating glucose (figure d). n-glucosyl-taurine was not observed in yeast extract but was detected in multiple mouse tissues. quantitation using the synthetic standard shows that liver has the highest level of glucosyl-taurine at ~ μm (figure e, supplementary fig. ). this ranks among the few dozen most abundant liver metabolites. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion the advent of lc-ms metabolomics revealed tens of thousands of metabolite peaks not matching known formulae, raising the possibility that the majority of metabolites remained to be discovered. while the biosphere likely contains many novel metabolites, it has been increasingly recognized that most peaks in typical untargeted metabolomics studies do not arise from novel metabolites, but rather mass spectrometry phenomena. the goal of comprehensively annotating untargeted metabolomics peaks with molecular formulae has, however, remained elusive. one promising strategy for peak annotation involves building molecular networks where nodes are lc-ms peaks (with associated molecular formulae) and edges are atom transformations linking the peaks. here we advance this strategy by combining metabolomics knowledge with computational global optimization. we explicitly differentiate biochemical connections reflecting metabolic activity and abiotic connections arising from mass spectrometry phenomena. by formulating the peak annotation challenge as a linear program, we identify an optimal network in light of all observed peaks. rather than weeding out peaks from mass spectrometry phenomena like adducts and natural isotopes, this approach takes advantage of the information embedded in them. it further provides traceable peak-peak relationships, which illuminate the basis for assigned formulae and suggest candidate structures. applying this approach to untargeted lc-ms data from yeast and liver samples, we assign formulae to roughly three-quarters of all non-background peaks. in each of positive and negative mode, the annotated peaks cover about known metabolites, with on average more than four mass peaks for every metabolite (e.g. m+h plus three adduct or isotope peaks). this leaves a couple thousand unannotated peaks from each lc-ms run. based on the observed ratio between peaks and metabolites, this likely correspond to hundreds (but not thousands) of unidentified metabolites. this number may actually be less, due to novel adducts (e.g. nickel adducts, which we discovered via careful examination of the unannotated peaks) or other mass spectrometry phenomena. importantly, this approach has already generated likely formulae for many hundreds of putative novel metabolites (supplementary fig. , supplementary data ), including a half-dozen for which we assign structures (figure , ). a key benefit of molecular network-based annotation is the ability to assimilate steadily new information , . each newly identified metabolite provides an additional anchor point for optimizing the network. other data types can be seamlessly added. for example, compound class categorization based on ms/ms data or retention time prediction can be added to score nodes. labeling similarity upon feeding different isotope-labeled nutrients could potentially be added to score edges. global optimization, integrating all new information comprehensively with prior knowledge to arrive at optimal annotations, is novel and potentially transformative for the field more broadly. the cycle .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / of careful experimentation and focused computational method developments holds the potential to identify most unknown metabolites over the coming decade, providing a robust blueprint of the metabolome (figure ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods yeast metabolomics sample preparation and isotope labeling s. cerevisiae strain fy was grown for at least generations in minimal essential media containing . % [u- c] or [u- c] glucose and mm ammonium sulfate with or without . mg/l thiamine hydrochloride . then, in mid-exponential phase, ml culture broth (od = . ) was filtered and metabolites were extracted using ml extraction buffer ( : : : . acetonitrile:methanol:water:formic acid), followed by adding μl neutralization buffer ( % nh hco ). the extracts were kept at - ℃ for at least min to precipitate protein before centrifuging at , g for min. the supernatant was used for lc–ms analysis. murine metabolomics sample preparation and intravenous infusion experiment animal studies followed protocols approved by the princeton university institutional animal care and use committee. twelve-month-old female wild-type c bl/ mice (the jackson laboratory, bar harbor, me) on normal diet were sacrificed by cervical dislocation and tissues quickly dissected and snap frozen in liquid nitrogen with precooled wollenberger clamp. frozen samples from liquid nitrogen were then transferred to − °c freezer for storage. to extract metabolites, frozen liver tissue samples were first weighed (~ mg each) and transferred to ml round-bottom eppendorf safe-lock tubes on dry ice. samples were then ground into powder with a cryomill machine (retsch, newtown, pa) for seconds at hz, and maintained at cold temperature using liquid nitrogen. for every mg tissues, ul extraction buffer (as above) was added to the tube, vortexed for seconds, and allowed to sit on ice for minutes. then l neutralization buffer was added and the samples vortexed. the samples were allowed to sit on ice for minutes and then centrifuged at , g for min at °c. the supernatants were transferred to another eppendorf tube and centrifuged at , g for another min at °c. the supernatants were transferred to glass vials for lc-ms analysis. a procedure blank was generated identically without tissue, which was used later to remove the background ions. detailed methods for intravenous infusion of mice have been described previously . briefly, in vivo infusions were performed on – -week-old c bl/ mice pre-catheterized in the right jugular vein (charles river laboratories). mice were kept fasted for h and then infused for . h with [u- c]glucose ( mm, . l/min/g). the mouse infusion setup (instech laboratories) included a tether and swivel system so that the animal had free movement in the cage. venous samples were taken from tail bleeds. at the end of the infusion, the mouse was euthanized by cervical dislocation and tissues were collected and extracted as above. serum metabolites were extracted by adding l methanol to l of serum and centrifuging for min. the supernatant was used for lc–ms analysis. lc-ms and lc-ms/ms .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / lc separation was achieved using a vanquish uhplc system (thermo fisher scientific) with an xbridge beh amide column ( × mm, . µm particle size; waters). solvent a is : water: acetonitrile with mm ammonium acetate and mm ammonium hydroxide at ph . , and solvent b is acetonitrile. the gradient is min, % b; min, % b; min, %; min, % b; min, %, min, % b; min, % b; min, % b; min, % b; min, % b; min, % b, . min, % b; min, % b; min, % b. total running time is min at a flow rate of µl/min. lc-ms data were collected on a q-exactive plus mass spectrometer (thermo fisher) operating in full scan mode with a ms scan range of m/z - , and resolving power of , at m/z . other ms parameters are as follows: sheath gas flow rate, (arbitrary units); aux gas flow rate, (arbitrary units); sweep gas flow rate, (arbitrary units); spray voltage, . kv; capillary temperature, °c; s- lens rf level, ; agc target, e and maximum injection time, ms. to demonstrate the utility of inclusion of ms data for netid analysis, and ms spectra were obtained for selected peaks with intensity > in positive and negative ionization mode respectively from a previous liver dataset . targeted ms spectra were collected using the prm function at ev hcd energy with other instrument setting being, resolution , agc target , maximum it ms, isolation window . m/z. glucosyl-taurine synthesis glucosyl-taurine synthesis was carried out following previous literature reports with slight modifications . in brief, dry methanol was obtained by distillation of hplc-grade methanol (fisher; hplc grade . micron filtered) over cah (acros organics; ca. % extra pure, - mm grain size). a flame-dried round-bottom flask equipped with a reflux condenser and stir bar was charged with . g taurine (alfa aesar; %), . g d-glucose (acros organics; acs reagent), and ml of dry methanol. this mixture was sonicated under an inert atmosphere for minutes before being returned to the manifold for the reaction. to the fine-suspension of taurine and glucose in dry methanol at room temperature, . ml . m sodium methoxide in methanol (acros organics) was added via glass syringe. at this point, the suspension began to dissolve and after minutes, gave a clear and colorless solution. the solution was stirred vigorously under an inert atmosphere for hours, which resulted in a faint peach-colored solution. this solution was chilled to ˚c, and ~ ml of absolute ethanol ( proof) was added and precipitation was allowed to occur at this temperature for minutes. solvent was then removed by filtration over a glass filter (medium porosity), and washed with ~ ml of absolute ethanol, affording a fine pale-yellow powder ( . g; crude material). nmr experiment was carried out to validate the structure of synthesized n-glucosyl-taurine. selective tocsy experiments using dipsi spin-lock and with added chemical shift filter were run on a bruker avance iii hd nmr spectrometer equipped with a custom-made qci-f cryoprobe (bruker, billerica, ma) at mhz and at . k controlled temperature. the sample was dissolved in dmso- .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / d . the spectra shown on the plots are results of ms sl mixing, scans each. data processing (mnova v. , mestrelab research s.l., santiago de compostela, spain) included zero filling, hz gaussian apodization, phase- and baseline correction. nmr analysis suggests that the final crude material contains . % n-glucosyl-taurine and unreacted substrates (supplementary figure ). netid algorithm i. data preparation and input lc-ms raw data files (.raw) were converted to mzxml format using proteowizard (version . . ). el-maven (version . ) was used to generate a peak table containing m/z, retention time, intensity for peaks. parameters for peak picking were the defaults except for the following: mass domain resolution is ppm; time domain resolution is scans; minimum intensity is ; minimum peak width is scans. the resulting peak table was exported to a .csv file. redundant peak entries due to imperfect peak picking process are removed if two peaks are within . min and their m/z difference are within ppm. background peaks are removed if its intensity in procedure blank sample is > . -fold of that in biological sample. the m/z of the remaining peaks are recalibrated by applying an absolute m/z adjustment factor εabsolute (independent of measured m/z) and a relative m/z adjustment factor εrelative (linearly dependent on measured m/z). for each peak i the recalibrated values im/z, adjusted are computed as 𝑖 / , = 𝑖 / , × ( + 𝜀 ) + 𝜀 ( ) the εrelative and εabsolute values are fit via linear regression using measured m/z values of selected known metabolite ion peaks and their calculated m/z. that is, for each of these known metabolite k, we have equations 𝑘 / , = 𝑘 / , × ( + 𝜀 ) + 𝜀 ( ) lc-ms/ms data were extracted from the mzxml files using lab-developed matlab code. ms spectra may contain interfering product ions from co-eluting isobaric parent ions. these interfering product ions were removed by examining the extracted ion chromatogram (eic) similarity between the product ions in ms data and the parent ion in ms data. a pearson correlation coefficient of . was used as a cutoff to retain those product ions that has similar eic as the parent ion. the cleaned ms data were exported to excel files for further processing. structures, formulae, m/z and ms spectra of metabolites were obtained from the human metabolome database (hmdb, version . ), and retention times of selected metabolites were determined through running authentic standards using the above-mentioned lc-ms method. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / netid algorithm requires three types of input files: a peak table (in .csv format) recording m/z, retention time, intensity for peaks; an atom difference rule table (in .csv format) containing a list of biochemical atom differences and abiotic atom differences which together define all connections in the network (supplementary table , ), and metabolite information files containing structure, formula, m/z and ms spectra of hmdb metabolites and retention time of selected metabolites under different lc conditions. exemplary peak table from the yeast dataset, atom difference rule table and hmdb metabolite information file are provided in supplementary data . ii. initial annotation of nodes and edges in the network the first step of netid algorithm is to make an initial annotation for seed nodes, determine possible annotations for other nodes, and determine edges in the network. each peak is a node in the network. we compare the experimentally measured m/z for each node to those of all metabolite formulae in the hmdb database. when the m/z difference is within ppm, candidate formulae and hmdb ids are assigned to the node, and this node is defined as a primary seed node. a primary seed node can contain more than one candidate formulae and hmdb ids if all are within the m/z difference range. edges connect two nodes via gain or loss of specific atoms. we assembled a list of biochemical atom differences and abiotic atom differences which together define all connections in the network (supplementary table , ). let each of these differences be denoted by di. for each node u, if there is a node v such that the difference in the measured m/z of the nodes matches one of the those in the list of atom mass differences, we add an edge between u and v. that is, if um/z and vm/z are the experimentally measured m/z for the peaks corresponding to nodes u and v respectively (assuming vm/z > um/z for simplicity), then there is an edge between these nodes if there is some difference di such that | 𝑣 / − 𝑢 / − 𝐷 | < 𝑣 / × ppm ( ) if di is an abiotic difference, in order to add an edge, it is additionally required that the retention time between two nodes should be within . min. that is, if urt and vrt are the retention times for u and v respectively, then it is required that | 𝑣 − 𝑢 | < . min ( ) for each node, its candidate formulae set will expand due to propagating formulae from its neighboring nodes through edge atom differences. for example, when applying the atom difference of edge (u, v) on the formula assigned to primary seed node u, we can derive a new candidate formula for the connected node v. if the derived formula’s calculated m/z is within ppm of node v’s measured m/z, then a new candidate formula is added for node v. iterating the process to all candidate formulae of node u through edge (u, v) will further expand candidate formulae for node v. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we apply the above extension process to formulae of all primary seed nodes through atom difference edges, and these new candidate formulae can themselves be used for another round of extension. note that a primary seed node will be treated as the rest of nodes during the subsequent rounds of extension, and may as well be assigned with new formulae. to avoid duplicated efforts in the extension process, we allow formulae of primary seed nodes and biotransformed formulae thereof to be extended through both biotransformation and abiotic atom difference edges, and do not allow abiotic candidate formulae be further extended through biotransformation atom difference edges. the default extension process includes two rounds of biotransformation edge extensions and three rounds of abiotic edge extensions. iii. scoring node annotations netid then scores every candidate node and edge annotation assigned in the initial annotation step. the node scoring system aims to assign high scores to annotations that align observed ion peaks with known metabolites based on m/z, retention time, ms/ms, and/or isotope abundances. let the set of candidate annotation for node u be denoted as {𝑎 … 𝑎 … 𝑎 }. for each node u and each of its candidate annotation 𝑎 , let s(u, 𝑎 ) denotes the score of candidate annotation 𝑎 for node u. different scoring components for candidate node annotations are defined as below: (a) sm/z(u, 𝑎 ) is negative when measured m/z differs from the calculated m/z of assigned molecular formula. a larger ppm difference between calculated formula m/z and measurement m/z results to lower scores. the default scale factor is - . . let 𝑎 , / be the calculated formula m/z of annotation 𝑎 , and 𝑢 / be the measured m/z of node u, then s / (𝑢, 𝑎 ) = − . × 𝑢 / − 𝑎 , / / 𝑢 / × ( ) (b) srt(u, 𝑎 ) is positive if the measured rt for the peak corresponding to node u matches to a known standard. a smaller difference between known and measured rt results in a higher score. let 𝑎 , is the known rt of annotation 𝑎 , and 𝑢 be the measured rt of node u, then s (𝑢, 𝑎 ) = − 𝑢 − 𝑎 , , if 𝑢 − 𝑎 , < . min otherwise, s (𝑢, 𝑎 ) = ( ) (c) sms (u, 𝑎 ) is positive if the measured ms spectrum of node u matches the database ms spectrum of annotation 𝑎 . a dot product scoring function is used to score the ms spectra similarity . the intensities of the fragment ions in the ms spectra are rescaled so that the highest fragment ion is set to . ms spectra are represented as w = [relative intensity of ms ions]n[m/z value]m, with n = , m = . dot product (dp) and score for ms match (sms (u, 𝑎 )) are defined as below. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / 𝐷𝑃 = ∑ ∑ × ∑ ( ) s (𝑢, 𝑎 ) = dp, if dp > . otherwise s (𝑢, 𝑎 ) = ( ) (d) sdatabase(u, 𝑎 ) is positive if the annotated formula 𝑎 exists in hmdb. we give a positive score to a primary seed node annotation if that annotated formula exists in hmdb. s (𝑢, 𝑎 ) = . , if 𝑎 in hmdb otherwise, s (𝑢, 𝑎 ) = ( ) (e) smissing_isotope(u, 𝑎 ) is negative if an isotopic peak is missing. we penalize a formula annotation if it passes the intensity threshold (default at x ) but does not have isotopic peaks of specified elements. the default isotope being evaluated is cl. any other elements, such as c or o, can be included by users. s _ (𝑢, 𝑎 ) = − , if isotopic peak is missing otherwise s _ (𝑢, 𝑎 ) = ( ) (f) srule(u, 𝑎 ) is negative if annotation 𝑎 violates basic chemical rules. we strongly penalize formulae that violate basic chemical rules, including a negative rdbe (ring and double bond equivalents), and unlikely element ratios in metabolites (o/p < , o/si < ). s (𝑢, 𝑎 ) = − , if chemical rules are violated otherwise, s (𝑢, 𝑎 ) = ( ) (g) sderivative(u, 𝑎 ) is positive if the annotation 𝑎 is derived from a parent peak p with an annotation h that has high score sparent(p, h), which is calculated by summing up scores in (a)-(f) for s(p, h). s (𝑢, 𝑎 ) = s (𝑝, ℎ) − . ( ) s (𝑝, ℎ) = s / (𝑝, ℎ) + s (𝑝, ℎ) + s (𝑝, ℎ) + s (𝑝, ℎ) + s _ (𝑝, ℎ) + s (𝑝, ℎ) ( ) this is particularly helpful in annotating abiotic peaks. for example, annotation of glutamate sodium adduct will be given a positive sderivative when its parent node is annotated as glutamate with high sparent score. a final score s(u, 𝑎 ) for each candidate annotation 𝑎 of node u is calculated by summing scores in (a)-(g). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / s(𝑢, 𝑎 ) = s / (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s _ (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) ( ) note that for each node u, we have one of candidate “annotations” that corresponds to no annotation being chosen for that node. the node score for this null annotation is at default, and can be set at a negative value to promote choosing actual annotations. iv. scoring edge annotations (biological, adduct, isotope) the edge scoring system aims to assign high scores to edge annotations that correctly capture biochemical connections between metabolites (based on ms spectra similarity) and abiotic connections between metabolites and their mass spectrometry phenomena derivatives, such as isotopes and adducts. biochemical, isotope, and adduct edge annotations are the most common types, and other less common abiotic connection types are then described in the subsequent section. suppose we consider two nodes u and v that are connected by an edge (u, v). for each pair of nodes u and v such that there is an edge (u, v), let the set of candidate formula for node u and v be denoted as {𝑎 … 𝑎 … 𝑎 } and {𝑏 … 𝑏 … 𝑏 }, respectively, and let the set of candidate atom differences for edge (u, v) be {𝐷 … 𝐷 … 𝐷 }. let s(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) be the score of choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v). note that s(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is set to be if atom difference 𝐷 does not represent the formula difference of 𝑎 and 𝑏 . s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝑎 − 𝑏 ≠ 𝐷 different scoring components for candidate edge annotations are defined as below: (h) when node u and v have experimental measured ms spectra, sms _similarity( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for a biochemical edge, and is a positive score if two connected nodes u and v have ms similarity, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . sms _similarity is determined using the dot product (dp), as described in previous section, and reverse dot product (dp_r), which evaluates the neutral ion loss similarity in the ms spectra . a reverse ms spectrum is represented as r = [relative intensity of ms ions]n[parent m/z – measured m/z value]m, with n = , m = . dp = ∑ ∑ × ∑ ( ) dp_r = ∑ ∑ × ∑ ( ) s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = max (dp, dp_r), if max(dp, dp_r) > . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / otherwise, s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = ( ) (i) sco_elution(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for an abiotic edge, and is a negative score if the rt of two connected nodes differ more than a threshold ( . min), given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = − × |𝑢 − 𝑣 |, if |𝑢 − 𝑣 | ≥ . min otherwise, s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = ( ) (j) stype(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for all edges, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 , and is a non-negative score depending on the connection type of edge, which is defined by 𝐷 , including biotransformation, adduct, isotope and fragment (supplementary table , ). the magnitude of scores reflects the empirical confidence in the annotation type when certain atom differences occur, and can be adjusted based on personal use. s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ biotransformation s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ adduct s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ isotope s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ fragment ( ) (k) for each 𝐷 ϵ isotope, sisotope_intensity(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for isotope edge (u, v) where 𝑏 is the isotopic derivative of 𝑎 with atom difference of 𝐷 , and is a negative score if the measured isotope peaks deviate from expected natural abundance. the score for an isotope edge depends on how likely the ratio of measured and expected isotopic intensity (ratioisotope) is observed in an empirical normal distribution n , σ . isotopes of all elements included in the atom difference table are evaluated. ratio = / ( , , ) ( ) s (𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) = 𝑙𝑜𝑔 𝜇 = ratio n , σ 𝜇 = n , σ ( ) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / σisotope is empirically defined as below, so that when measured isotope intensity is close to detection limit, a larger σisotope (a widened distribution, which is more tolerant to discrepancy) will be used. σ = . + ( ) ( ) a final edge annotation score s(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(k), if other less common abiotic connection types are not considered (see next section). s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ( ) v. additional abiotic edge types lc-ms metabolomics may include additional abiotic relationships. in orbitrap data, these include oligomers, multi-charge species, heterodimers, in-source fragments of known or unknown metabolites , and ringing artifact peaks surrounding high intensity ions , . these relationships were included in netid as additional edge types, which are evaluated for all m/z pairs within a predefined rt range ( . min). (l) oligomer and multi-charge species. an oligomer/multi-charge edge is assigned between two nodes u and v, if their m/z satisfy |𝑣 / − n × 𝑢 / | < 𝑢 / × ppm, n ϵ {positive integers} ( ) (m) heterodimer. heterodimer peak (node v) may be observed when one abundant metabolite (node u) forms ion cluster with other ion species (node t). we examine nodes that have intensity above , and assign a heterodimer edge between two nodes u and v if their m/z difference satisfy |( 𝑣 / − 𝑢 / ) − 𝑡 / | < 𝑢 / × ppm ( ) (n) in-source fragments. fragmentation peaks may be observed when one abundant metabolite breaks up into fragments during the ionization process. database ms of known metabolites can be used to identify known ion fragmentation peaks . if candidate annotation 𝑏 of node v is annotated with a hmdb id associated with database ms spectrum, and m/z of node u matches to a fragment m/z in 𝑏 ’s ms spectrum, then a database fragment edge will connect such two nodes. that is, .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / 𝑢 / ϵ database ms spectrum of candidate annotation 𝑏 of node v ( ) measured ms spectra can be used to identify unknown ion fragmentation peaks. if node v is associated with a measured ms spectrum, and m/z of another node u matches to a fragment m/z in the ms spectra, then an experiment fragment edge will connect such two nodes. that is, 𝑢 / ϵ measured ms spectrum of node v ( ) (o) ringing artifacts. ringing peaks are artifact peaks (node v) often observed on both sides of the m/z of an intense ion peak (node u) in fourier-transformed ms instrument including orbitrap. we examine nodes that have intensity above , and assign a ringing artifact edge between two nodes if two nodes satisfy ppm < | 𝑣 / − 𝑢 / | / 𝑢 / < ppm 𝑢 / 𝑣 > ( ) scoring of these additional abiotic edges follow the same rules described in the “scoring edge annotations” section with additional stype defined as below. s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ oligomer or multi-charge s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ heterodimer s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ database ms fragment s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ measured ms fragment s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ ringing artifacts ( ) a final edge annotation score s( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(o). s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ( ) vi. global network optimization using linear programing .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / using scores assigned for each candidate node and edge annotation, our goal is to find annotations for each node so as to maximize the sum of the scores across the network under the constraints that each node is assigned a single annotation, and that the network annotation is consistent. we use linear programming to solve this optimization problem optimally, as described next. for each node u and each of its candidate formula 𝑎 , we define a node binary decision variable 𝑥 , to denote whether candidate formula 𝑎 is selected as the annotation for node u. that is, 𝑥 , = , if node u is annotated with formula 𝑎 otherwise, 𝑥 , = ( ) we define a binary decision variable 𝑐 , , , , to denote whether candidate formulae 𝑎 and 𝑏 are chosen for nodes u and v , and the candidate atom difference 𝐷 corresponds to the formula difference of candidate formulae 𝑎 and 𝑏 of the connected nodes u and v. that is, 𝑐 , , , , = , if 𝑎 and 𝑏 are chosen for nodes u and v respectively, and 𝑎 − 𝑏 = 𝐷 otherwise, 𝑐 , , , , = ( ) we constrain the optimization so that each node has a single annotation, and an edge exists and only exist if the atom difference of that edge annotation matches the formula difference of nodes. as a result, the node and edge binary variables should satisfy ∑ 𝑥 , = ( ) 𝑐 , , , , ≤ 𝑥 , , 𝑐 , , , , ≤ 𝑥 , ( ) 𝑐 , , , , ≥ 𝑥 , + 𝑥 , − ( ) for all variables defined above, we add the constraints that they are either or . with each candidate node and edge annotation being scored, the objective for the optimization is to find values for all variables 𝑥 , and 𝑐 , , , , so as to maximize the sum of all node scores and edge scores in a network while satisfying the constraints. maximize: ∑ 𝑥 , × s(𝑢, 𝑎) + ∑ 𝑐 , , , , × s(𝑢, 𝑣, 𝑎, 𝑏, 𝐷) ( ) the optimization result provides a string of binary numbers that denote if a candidate node or edge annotation is selected for the global optimal network. ibm ilog cplex optimization studio (version . . or later) is used to solve the linear programing problem. a cplexapi package for r is used to .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / call cplex optimization function in an r environment. for the yeast datasets and using the above scoring parameters, optimization finishes within an hour on a standard laptop. depending on the number of peaks in data tables, the entries in the atom difference tables, and the parameters involved in scoring, runtimes during internal testing ranged from minutes to h. code availability netid was developed mainly in r, and used a mixture of ibm ilog cplex optimization studio, matlab and python. netid code is available for non-commercial use in github at https://github.com/lichenpu/netid, under the gnu general public license v . . a shinyr app is provided to visualize the network results from netid in a local environment, along with a detailed user guide and example files (supplementary note , supplementary data ). acknowledgement this work was supported by a department of energy (doe) grant (no. de-sc to j.d.r.), the center for advanced bioenergy and bioproducts innovation (grant no. de-sc , subcontract to j.d.r.) and nih grant r ca to w.l. m.r.m is funded by the howard hughes medical institute and burroughs wellcome fund via the pdep and hanna h. gray fellows programs. we thank istvan pelczer at nmr facility of department of chemistry, princeton university for the nmr analysis, and x. su for scientific discussion and help. the center for advanced bioenergy and bioproducts innovation and the center for bioenergy innovation are both u.s. department of energy bioenergy research centers supported by the office of biological and environmental research in the doe office of science. any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the u.s. department of energy. competing interests the authors declare no competing interests. author contributions l.c., m.s. and j.d.r. conceived the project. l.c. developed the netid algorithm. w.l., l.w., x.z., a.c. m.m. performed experiments on mouse. l.w., w.l. and l.c. performed experiments on yeast. l.c., w.l., l.w. and x. x. analyze lc-ms and lc-ms/ms data. x.t., a.m. and y.s. contributed to coding development. b.k., a.m.l., and s.r.c. provided chemical synthesis of taurine-related compounds. l.c. and j.d.r. wrote the manuscript. all authors discussed the results and commented on the manuscript. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure legends figure . a global network optimization approach for untargeted metabolomics data annotation (netid). the input data are lc-ms peaks with m/z, retention times, intensities and optional ms spectra. the output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). peaks are classified as “metabolite” (m+h or m-h peak of formula found in hmdb), “putative metabolite” (formula not found in hmdb but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). netid algorithm involves three steps. initial annotation first matches peaks to hmdb formulae. these seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. each node and edge annotation are then scored based on match to known masses, retention times, and ms/ms fragmentation patterns. global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. figure . utility of global network optimization. (a) an example network demonstrating the value of the global optimization step in netid. node a and node b match hmdb formulae and are connected by an edge of phosphate (hpo ). node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. the table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (b) visualization of the optimal network obtained from negative mode lc-ms analysis of baker’s yeast, containing nodes and connections. metabolite and putative metabolite peaks are in green and artifact peaks in purple. (c) summary table of netid annotations of negative and positive mode lc-ms data from baker's yeast and mouse liver. figure . netid reveals thiamine-derived metabolites in yeast. (a) subnetwork surrounding thiamine. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) ms spectra of thiamine, thiamine+c h o, and thiamine+c h o, with proposed structures of the major fragments. (c) labeling fraction of thiamine and its derivatives, in [u- c]glucose with and without unlabeled thiamine in the medium. (d) the thiamine derivatives are also found in mouse tissues and urine. (e) proposed mechanism for formation of thiamine+c h o. pyruvate dehydrogenase (pdh) decarboxylates pyruvate, and adds the resulting [c h o] unit (in red) to thiamine. (f) the same enzymatic mechanism occurs in oxoglutarate dehydrogenase (ogdh) and branched-chain α-ketoacid dehydrogenase complex (bckdc), and generates thiamine+c h o and thiamine+c h o respectively. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid discovers mammalian taurine derivatives. (a) subnetwork surrounding taurine from mouse liver extract data. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) lc-ms chromatogram of n-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (c) ms spectrum of glucosyl-taurine peak from liver extract (top), and synthetic n-glucosyl-taurine standard (bottom). (d) isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for h with [u- c]glucose. (e) absolute n-glucosyl-taurine concentration in murine serum and tissues. figure . netid applies global optimization for metabolomics data annotation and metabolite discovery. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / reference . dinardo, c. d. et al. durable remissions with ivosidenib in idh -mutated relapsed or refractory aml. n. engl. j. med. , – ( ). . dang, l. et al. cancer-associated idh mutations produce -hydroxyglutarate. nature , ( ). . doroghazi, j. r. et al. a roadmap for natural product discovery based on large-scale genomics and metabolomics. nature chemical biology , – ( ). . aron, a. t. et al. reproducible molecular networking of untargeted mass spectrometry data using gnps. nature protocols , – ( ). . johnson, c. h., ivanisevic, j. & siuzdak, g. metabolomics: beyond biomarkers and towards mechanisms. nature reviews molecular cell biology , – ( ). . guijas, c. et al. metlin: a technology platform for identifying knowns and unknowns. anal. chem. , – ( ). . wishart, d. s. et al. hmdb . : the human metabolome database for . nucleic acids res , d –d ( ). . tsugawa, h. et al. hydrogen rearrangement rules: computational ms/ms fragmentation and structure elucidation using ms-finder software. anal. chem. , – ( ). . kanehisa, m., sato, y., kawashima, m., furumichi, m. & tanabe, m. kegg as a reference resource for gene and protein annotation. nucleic acids res , d –d ( ). . kim, s. et al. pubchem update: improved access to chemical data. nucleic acids res , d –d ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . hastings, j. et al. chebi in : improved services and an expanding collection of metabolites. nucleic acids res , d –d ( ). . sherena.johnson@nist.gov. nist standard reference database a. nist https://www.nist.gov/srd/nist-standard-reference-database- a ( ). . tautenhahn, r., patti, g. j., rinehart, d. & siuzdak, g. xcms online: a web-based platform to process untargeted metabolomic data. anal. chem. , – ( ). . forsberg, e. m. et al. data processing, multi-omic pathway mapping, and metabolite activity analysis using xcms online. nature protocols , – ( ). . wang, m. et al. sharing and community curation of mass spectrometry data with global natural products social molecular networking. nature biotechnology , – ( ). . tsugawa, h. et al. a cheminformatics approach to characterize metabolomes in stable-isotope- labeled organisms. nature methods , ( ). . pluskal, t., castillo, s., villar-briones, a. & orešič, m. mzmine : modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. bmc bioinformatics , ( ). . kuhl, c., tautenhahn, r., böttcher, c., larson, t. r. & neumann, s. camera: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. anal. chem. , – ( ). . sindelar, m. & patti, g. j. chemical discovery in the era of metabolomics. j. am. chem. soc. , – ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . wang, l. et al. peak annotation and verification engine for untargeted lc–ms metabolomics. anal. chem. , – ( ). . schmid, r. et al. ion identity molecular networking in the gnps environment. http://biorxiv.org/lookup/doi/ . / . . . ( ) doi: . / . . . . . nothias, l.-f. et al. feature-based molecular networking in the gnps analysis environment. nat methods , – ( ). . senan, o. et al. cliquems: a computational tool for annotating in-source metabolite ions from lc-ms untargeted metabolomics data based on a coelution similarity network. . . shen, x. et al. metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. nature communications , ( ). . alden, n. et al. biologically consistent annotation of metabolomics data. anal. chem. , – ( ). . del carratore, f. et al. integrated probabilistic annotation: a bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns, and adduct relationships. anal. chem. ( ) doi: . /acs.analchem. b . . kind, t. & fiehn, o. seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. bmc bioinformatics , ( ). . dührkop, k. et al. systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. nat biotechnol ( ) doi: . /s - - - . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . bonini, p., kind, t., tsugawa, h., barupal, d. k. & fiehn, o. retip: retention time prediction for compound annotation in untargeted metabolomics. anal. chem. , – ( ). . xu, y.-f. et al. discovery and functional characterization of a yeast sugar alcohol phosphatase. acs chem. biol. , – ( ). . hui, s. et al. glucose feeds the tca cycle via circulating lactate. nature , – ( ). . lu, w. et al. improved annotation of untargeted metabolomics data through buffer modifications that shift adduct mass and intensity. anal. chem. , – ( ). . cho, h. j., you, j. s., chang, k. j., kim, k. s. & kim, s. h. anti-adipogenic effect of taurine- carbohydrate derivatives. bulletin of the korean chemical society , – ( ). . robinson, p. t., pham, t. n. & uhrıń, d. in phase selective excitation of overlapping multiplets by gradient-enhanced chemical shift selective filters. journal of magnetic resonance , – ( ). . chambers, m. c. et al. a cross-platform toolkit for mass spectrometry and proteomics. nat biotechnol , – ( ). . xue, j. et al. enhanced in-source fragmentation annotation enables novel data independent acquisition and autonomous metlin molecular identification. anal. chem. , – ( ). . mitchell, j. m. et al. new methods to identify high peak density artifacts in fourier transform mass spectra and to mitigate their effects on high-throughput metabolomic data analysis. metabolomics , ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . a global network optimization approach for untargeted metabolomics data annotation (netid). the input data are lc-ms peaks with m/z, retention times, intensities and optional ms spectra. the output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). peaks are classified as “metabolite” (m+h or m-h peak of formula found in hmdb), “putative metabolite” (formula not found in hmdb but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). netid algorithm involves three steps. initial annotation first matches peaks to hmdb formulae. these seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. each node and edge annotation are then scored based on match to known masses, retention times, and ms/ms fragmentation patterns. global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . utility of global network optimization. (a) an example network demonstrating the value of the global optimization step in netid. node a and node b match hmdb formulae and are connected by an edge of phosphate (hpo ). node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. the table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (b) visualization of the optimal network obtained from negative mode lc-ms analysis of baker’s yeast, containing nodes and connections. metabolite and putative metabolite peaks are in green and artifact peaks in purple. (c) summary table of netid annotations of negative and positive mode lc-ms data from baker's yeast and mouse liver. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid reveals thiamine-derived metabolites in yeast. (a) subnetwork surrounding thiamine. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) ms spectra of thiamine, thiamine+c h o, and thiamine+c h o, with proposed structures of the major fragments. (c) labeling fraction of thiamine and its derivatives, in [u- c]glucose with and without unlabeled thiamine in the medium. (d) the thiamine derivatives are also found in mouse tissues and urine. (e) proposed mechanism for formation of thiamine+c h o. pyruvate dehydrogenase (pdh) decarboxylates pyruvate, and adds the resulting [c h o] unit (in red) to thiamine. (f) the same enzymatic mechanism occurs in oxoglutarate dehydrogenase (ogdh) and branched-chain α-ketoacid dehydrogenase complex (bckdc), and generates thiamine+c h o and thiamine+c h o respectively. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid discovers mammalian taurine derivatives. (a) subnetwork surrounding taurine from mouse liver extract data. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) lc-ms chromatogram of n-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (c) ms spectrum of glucosyl- taurine peak from liver extract (top), and synthetic n-glucosyl-taurine standard (bottom). (d) isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for h with [u- c]glucose. (e) absolute n-glucosyl-taurine concentration in murine serum and tissues. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid applies global optimization for metabolomics data annotation and metabolite discovery. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / deepstrain: a deep learning workflow for the automated characterization of cardiac mechanics deepstrain: a deep learning workflow for the automated characterization of cardiac mechanics manuel a. morales, maaike van den boomen, christopher nguyen, jayashree kalpathy-cramer, bruce r. rosen, collin m. stultz, david izquierdo-garcia*, and ciprian catana* abstract—myocardial strain analysis from cinematic magnetic resonance imaging (cine-mri) data could provide a more thorough characterization of cardiac mechanics than volumetric parameters such as left-ventricular ejection fraction, but sources of variation including segmentation and motion estimation have limited its wide clinical use. we designed and validated a deep learning (dl) workflow to generate both volumetric parameters and strain measures from cine-mri data, including strain rate (sr) and regional strain polar maps, consisting of segmentation and motion estimation convolutional neural networks developed and trained using healthy and cardiovascular disease (cvd) subjects (n= ). dl-based volumetric parameters were correlated (> . ) and without significant bias relative to parameters derived from manual segmentations in healthy and cvd subjects. compared to landmarks manually-tracked on tagging-mri images from healthy subjects, landmark deformation using dl-based motion estimates from paired cine-mri data resulted in an end- point-error of . ± . mm. measures of end-systolic global strain from these cine-mri data showed no significant biases relative to a tagging-mri reference method. on healthy subjects, intraclass correlation coefficient for intra- scanner repeatability was excellent (> . ) for strain, moderate to excellent for sr ( . - . ), and good to excellent ( . - . ) in most polar map segments. absolute relative change was within ~ % for strain, within ~ % for sr, and < % in half of polar map segments. in conclusion, we developed and evaluated a dl-based, end- to-end fully-automatic workflow for global and regional myocardial strain analysis to quantitatively characterize cardiac mechanics of healthy and cvd subjects based on ubiquitously acquired cine-mri data. index terms—cardiac cine-mri, deep learning, motion estimation, myocardial strain, segmentation. submitted for review on dec , . this work was supported in part by the u.s. national cancer institute under grant r ca - a . (asterisk indicates d. izquierdo-garcia and c. catana contributed equally to this work). (corresponding authors: d. izquierdo-garcia; c. catana). m.a. morales, d. izquierdo-garcia and b.r. rosen, with athinoula a. martinos center for biomedical imaging, mgh, hms, th st, boston, ma (email: moralesq@mit.edu; davidizq@nmr.mgh.harvard.edu; brrosen@mgh.harvard.edu) and with harvard-mit health science and technology, massachusetts ave, cambridge, ma, . m.v.d. boomen and c. nguyen, with cardiovascular research center and martinos center for biomedical imaging, mgh, hms, th st, boston, ma , with department of radiology, and m.v.d. boomen also with university medical center groningen, gz groningen (email: mvandenboomen@mgh.harvard.edu; christopher.nguyen@mgh.havard.edu). c.m. stultz, with electrical engineering and computer science, with harvard-mit health science and technology, massachusetts ave, cambridge, ma, , and with division of cardiology, mgh, fruit st, boston, ma, (cmstultz@mit.edu). j. kalpathy-cramer, and ciprian catana, with athinoula a. martinos center for biomedical imaging, mgh, hms, th st, boston, ma (jkalpathy-cramer@mgh.harvard.edu; ccatana@mgh.harvard.edu). i. introduction ardiac mechanics reflects the precise interplay between myocardial architecture and loading conditions that is essential for sustaining the blood pumping function of the heart. the ejection fraction (ef) is often used as a left- ventricular (lv) functional index, but its value is limited when mechanical impairment occurs without an ef reduction [ ]. alternatively, tissue tracking approaches for strain analysis provide a more thorough characterization through non-invasive evaluation of myocardial deformation from echocardiography or cinematic magnetic resonance imaging (cine-mri) data [ ], and could be used to identify dysfunction before ef is reduced [ ]. unfortunately, various sources of discrepancies have limited the wide clinical applicability of these techniques, including factors related to imaging modality, algorithm, and operator [ ]. more accurate measures could be obtained from tagging-mri data widely regarded as the reference standard for strain quantification [ ], [ ], but collection of these data requires highly specialized and complex sequences that have mainly remained research tools, whereas echocardiography and cine-mri data are ubiquitously acquired in clinical practice. irrespective of algorithm or modality, e.g., speckle tracking for echocardiography or feature tracking for cine-mri, the main challenge is to estimate motion within regions along the myocardial wall [ ]. operator-related discrepancies are introduced when the myocardial wall borders are delineated manually, a time-consuming process that requires considerable expertise and results in significant inter- and intra-observer variability [ ], [ ]. automatic delineation approaches have been implemented within computational pipelines [ ], but other factors related to motion tracking algorithms also influence strain assessment, including the appropriate selection of tuneable parameters whose optimal values can differ between c .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / patient cohorts and acquisition protocols (e.g., the size of the search region in block-matching methods [ ]). further, these algorithms often make assumptions about the properties of the myocardial tissue (e.g., incompressible and elastic [ ], [ ]), or use registration methods to drive the solution towards an expected geometry. however, recent evidence has shown the validity of these assumptions varies between healthy and diseased myocardium [ ], [ ], suggesting these approaches may not accurately reflect the underlying biomechanical motion [ ]. lastly, modality-related image quality could complicate interpretation of abnormal strain values since these could reflect either real dysfunction or artifact-related inaccuracies, leading to some degree of subjectivity or non-conclusive results [ ]. deep learning (dl) methods have demonstrated the advantage of allowing real-world data guide learning of abstract representations that can be used to accomplish pre-specified tasks, and have been shown to be more robust to image artifacts than non-learning techniques for some applications [ ], [ ]. dl segmentation methods have been proposed [ ]–[ ] and implemented within strain computational pipelines [ ], [ ], and recent studies have shown that cardiac motion estimation can also be recast as a learnable problem [ ]–[ ]. these methods usually consist of an intensity-based loss function and a constrain term [ ], [ ], the latter using common machine learning techniques (e.g., l regularization of all learnable parameters [ ]) or direct regularization of the motion estimates (e.g., smoothness penalty [ ], anatomy-aware [ ]). however, because ground-truth cardiac motion is challenging to acquire, whether these constrains improve the accuracy of motion or strain estimates is not yet clear. further, the added-value of dl- based regional strain estimation has not been demonstrated. we have recently developed a learning method for cardiac motion estimation that produces more accurate estimates than various techniques, including b-spline, diffeomorphic, and mass-preserving algorithms [ ], and showed these estimates could potentially be used to detect regional dysfunction. thus, incorporating our method within a strain analysis framework could potentially enable accurate, user-independent, and quantitative characterization of cardiac mechanics at a both global and regional level. once trained, such method would not necessitate further parameter tunning or optimization, which is time-consuming for larger datasets and daily clinical practice. while this framework could be based on echocardiography images [ ], these data remain limited for strain mapping tasks by their low reproducibility of acquisition planes [ ] and temporal stability of tracking patterns [ ]. in contrast, cine- mri offers the most accurate and reproducible assessment of cardiac anatomy and function, thus providing a more thorough set of data for learning-based motion models. we propose deepstrain, an automated workflow that derives global and regional strain measures from cine-mri data by decoupling motion estimation and segmentation tasks. after verifying the effects of smoothing and anatomical regularizers on motion and strain, convolutional neural networks for pre- processing (i.e., centering and cropping), segmentation, and motion estimation were implemented, trained, validated, and compared to state-of-the-art methods. finally, accuracy of strain values was assessed using a tagging-mri algorithm as reference standard, intra-scanner repeatability was measured using subjects with repeated scans, and potential clinical applications of global and regionals myocardial strain measures were demonstrated on patient populations. ii. method a. datasets for development we used the automated cardiac diagnosis challenge (acdc) dataset [ ], consisting of cine-mri data from subjects evenly divided into five groups: healthy and patients with hypertrophic cardiomyopathy (hcm), abnormal right ventricle (arv), myocardial infarction with reduced ejection fraction (mi), and dilated cardiomyopathy (dcm). these data were publicly available as train (n= ) and test (n= ) sets, with manual segmentations included for the train set only. for validation of motion and strain measures we used the cardiac motion analysis challenge (cmac) dataset [ ], consisting of paired tagging- and cine-mri data from healthy subjects. to assess intra-scanner repeatability, four healthy volunteers were recruited to undergo repeated scans on a t mri scanner. all cine-mri frames and corresponding segmentations were resampled to a × × volume grid with . mm × . mm in-plane resolution and variable slice thickness ( - mm). see supplementary section s for acquisition protocol. b. myocardial strain definitions strain represents percent change in myocardial length per unit length. the three-dimensional ( d) analog for mri is given by the lagrange strain tensor 𝝐 𝑡 = 𝛻𝒖 𝑡 + 𝛻𝒖 𝑡 ( + 𝛻𝒖 𝑡 ( 𝛻𝒖 𝑡 / , ( ) where 𝒖 𝑡 denotes myocardial displacement from a fully- relaxed end-diastolic phase at t= , to a contracted frame at t> . radial and circumferential strain are the diagonal components of the tensor 𝝐 evaluated in cylindrical coordinates. strain rate (sr) is the time derivative of ( ). global strain is defined as the average of 𝝐 over the whole lv myocardium (lvm) volume. regional strain is defined as the average of 𝝐 over the volume of specific lvm segments defined by the american heart association (aha) polar map [ ], which requires labels of the right ventricle to construct. specific parameters based on timing and magnitude are extracted from the measures evaluated over a whole cardiac cycle: end-systolic strain (ess), defined as the global strain value at end-systole; systolic strain rate (srs), defined as the peak (i.e., maximum) absolute value of global sr during systole; early-diastolic strain rate (sre), defined as the peak absolute value of global sr during diastole. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c. centering, segmentation, and motion estimation deepstrain (fig. ) consists of a series of convolutional neural networks that perform three tasks: a ventricular centering network (vcn) for automated centering and cropping, a cardiac motion estimation network (carmen) to generate 𝒖, and a cardiac segmentation network (carson) to generates tissue labels. estimates of 𝒖 are used to calculate myocardial strain, and segmentations are used to derive volumetric parameters, identify a cardiac coordinate system for strain analysis, and generate tissue labels used for anatomical regularization of the motion estimates at training time. let 𝑉- be a cine-mri frame at time t defined over a d spatial domain 𝛺 ⊂ ℝ . using a pair of frames 𝑉 ,𝑉- as an input, vcn centers and crops the images around the center of mass of the lv, carson generates segmentations 𝑀 ,𝑀- of the lv, rv, and lvm, and carmen estimates the motion 𝒖- of the heart from 𝑉 to 𝑉-. thus, for each voxel 𝑝 ∈ 𝛺, 𝒖- 𝑝 is an approximation of the myocardial displacement during contraction such that 𝑉 (𝑝) and (𝒖- ∘ 𝑉-)(𝑝) correspond to similar cardiac regions. the operator ∘ refers to application of a spatial transform to 𝑉- using 𝒖- via trilinear interpolation [ ]. ) architectures all networks have a common encoder-decoder architecture consisting primarily of convolution, batch normalization [ ], and prelu [ ] layers with residual connections [ ] (see supplementary section s ). briefly, vcn is a d architecture that uses a single-channel array 𝑉 with size × × to generate a single-channel array 𝐺<=>? of equal size, where 𝐺<=>? corresponds to a gaussian distribution with mean defined as the lvm center of mass. v is centered and cropped around the voxel with the highest value in 𝐺<=>? to generate a new cropped array of size × × , which is then the input to segmentation and motion estimation networks. carson is a two-dimensional ( d) architecture that uses images of size × to generate a -channel segmentation 𝑀<=>? of equal size, each channel corresponding to a label. carmen uses a - channel input volume, consisting of two concatenated arrays with size × × , to generate a -chanel array 𝒖 of equal size. each channel in 𝒖 represents the 𝑥, 𝑦 and 𝑧 components of motion. ) loss functions vcn was evaluated using the mean square error ℒdef 𝐺g-,𝐺<=>? = h |j| 𝐺(𝑝) − 𝐺<=>? 𝑝 l <∈j . ( ) for carson, we implemented a multi-class dice coefficient function ℒn>g 𝑀g-,𝑀<=>? = − h o bna-c that evaluates carmen using the input volumes and generated motion estimates ℒab->bna-c 𝑉 ,𝑉-,𝑢- = h j 𝑉 𝑝 − (𝑢- ∘ 𝑉- 𝑝<∈j . ( ) second, we used a supervised function ℒebe-fgahei that leverages segmentations of the input volumes at training time to impose an anatomical constrain on the estimates fig. . overview of proposed deepstrain workflow. vcn centers and crops the input pair of cine-mri frames. tissue labels generated by carson are used to build an anatomical model. motion estimates derived from carmen are used to calculate strain measures, and these estimates are combined with the anatomical model to enable global and regional strain analyses. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ℒebe-fgahei 𝑀 ,𝑀-,𝑢- = ℒn>g 𝑀 ,𝑢- ∘ 𝑀- . ( ) third, smooth estimates were encouraged by using a diffusion regularizer ℒngff-jb>nn(𝑢-) = 𝛻𝑢- 𝑝 ⋅ 𝑑𝑟 l<∈j ( ) where 𝑑𝑟 is the spatial resolution of 𝑉 and accounts for differences between in-plane and slice resolution. thus, the loss function for carmen is a linear combination of ( ), ( ), and ( ), weighted by 𝜆a, 𝜆e,𝜆n, accordingly. we conducted optimization experiments using synthetic data [ ], [ ] to assess the impact of smoothing and anatomical regularization on motion and strain estimates (supplementary section s ). these experiments showed smoothness improves the accuracy of the motion vectors direction, and anatomical regularization improves the magnitude of the vectors relative to the ground-truth motion (see supplementary fig. s and s ). the optimal values 𝜆a = . , 𝜆e = . ,𝜆n = . were used to train carmen. ) training and testing networks were trained in tensorflow ver. . with adam optimizer parameters beta , = . , . , batchsize = ( for carmen), and epochs = ( for carmen). ground-truth distributions for vcn were created using the manual segmentations. vcn and carson were trained using the end- diastolic and end-systolic frames of the train set, as only these included ground-truth segmentations. this provided training samples for vcn and for carson, the latter having more samples since it is a d architecture and all frames were resampled to a volume with slices. vcn was tested by five-fold cross-validation, whereas the accuracy of carson was assessed by submitting the results to the challenge website. once carson was trained, we generated segmentations of the test set to train carmen using the entire acdc dataset. only the [end-diastolic, end-diastolic] and [end-diastolic, end- systolic] pairs were used. the former is essential for the network to adequately learn how to scale the motion vectors, i.e., motion should be exactly zero if the frames are equal. the entire cycle is analyzed at testing time by using sequential input pairs [𝑉 , • ] that kept the end-diastolic frame constant while we varied 𝑉- for all time frames t > . using this approach 𝒖- was derived for all times. data augmentation included random rotations and translations, random mirroring along the x and y axes, and gamma contrast correction. all data augmentation was performed only in the x-y plane. d. evaluation ) segmentation and motion estimation carson and manual segmentations were compared using the hausdorff distance (hd) and dice similarity coefficient (dsc) metrics at both end-diastole and end-systole. accuracy of lv volumetric measures derived from segmentations, including end-diastolic volume (edv), ef, and lvm, was assessed using the correlation, bias, and standard deviation metrics. the mean absolute error (mae) for the lv edv and lvm were also computed for comparison against the intra- and inter-observer variability reported by [ ]. we compared our results to top- ranked methods published for the acdc test set as these appear in the leader-board of the challenge [ ]–[ ]. the cmac organizers defined landmarks at the intersection of gridded tagging lines at end-diastole on tagging images, one landmark 𝑝 per wall per ventricular level. these landmarks were manually-tracked by two observers over the cardiac cycle. conversion from tagging to cine coordinates was done using dicom header information. we used the carmen motion estimates 𝑢- to automatically deform the landmarks at end-diastole, and the accuracy was assessed using the in-plane end-point error (epe) between deformed 𝑝-q = 𝑢- ∘ 𝑝 and manually-tracked 𝑝- landmarks, defined by 𝐸𝑃𝐸 𝑝,𝑝q = 𝑝t − 𝑝tq l + 𝑝c − 𝑝cq l . ( ) due to temporal misalignment between the tagging and cine acquisitions, epe was evaluated only at end-systole (𝑡 = 𝑡fe). specifically, let 𝑝au(𝑡) denote the manually-tracked landmarks of subject 𝑖 at frame 𝑡 by observer 𝑗. the accuracy of carmen was assessed using the average epe aepe = h lb 𝐸𝑃𝐸(𝑝au 𝑡fe ,𝑢a(𝑡fe) ∘ 𝑝 ) l u[h b a[h . ( ) our results were compared to those reported by the four groups that responded to the challenge [ ], mevis [ ], iucl [ ], upf [ ], and inria [ ], [ ]. all groups submitted tagging-based motion estimates, but only upf and inria provided estimates based on cine-mri. ) strain validation and intra-scanner repeatability the tagging-mri method with the lowest aepe was used as the reference for strain analysis. the tagging-mri-based motion estimates were registered and resampled to the cine- mri space. global strain and sr values throughout the entire cardiac cycle were derived from the resampled estimates as described in [ ]. global- and regional-based analyses were performed to assess the repeatability of measures from two acquisitions. relative changes (rc) and absolute relative changes (arc) were calculated, taking the first acquisition as the reference. ess and sr were calculated for the global-based analysis, and for region-based analyses, ess values were normalized using the aha polar map, and both rc and arc were evaluated for each of the segments in the polar map. ) statistics bland-altman analysis was used to quantify agreement between predicted and tagging strain measures. we used the term bias to denote the mean difference and the term precision to denote the standard deviation of the differences. differences were also assessed using a paired t-test with bonferroni correction for multiple comparisons. for global- and regional- based analyses of intra-scanner repeatability, icc estimates and their % confidence intervals (ci) were calculated based on a single-rating, absolute agreement, -way mixed-effects model. analyses were performed on python v . [ ]. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / iii. results a. segmentation and motion estimation centering, segmentation, and motion estimation for an entire cardiac cycle (~ frames) was accomplished in < s on a gb gpu and < . min on a gb ram cpu. vcn located the lv center of mass with a median error of . mm. correlation of carson and manual lv volumetric measures was > . across all measures (table ), and biases in ef (+ . ± . %), end-diastolic (+ . ± . ml) and end-systolic (+ . ± . ml) volumes, and mass (+ . ± . g) were not significant. further, these biases were smaller than those obtained with other methods, which were positive for lv edv ( . to . ml), negative for lvm (- . to - . g), and close to zero (± . %) for ef. simantiris et al. [ ] obtained the best precision for lv ef ( . vs. . % variance with carson), edv ( . vs. . mm), and lvm ( . vs. . g). isensee et al. [ ] obtained the best results on geometric metrics, i.e., lower hd for the lv (end-diastole . vs. . mm; end-systole . vs. . mm) and lvm ( . vs. . mm; . vs. . mm), and higher dsc for the lvm ( . vs. . ; . vs. . ). the dsc for the lv was similar for all methods (~ . , ~ . ). mae for the lv edv and lvm were . ± . ml and . ± . g. fig. a illustrates a representative example of the tagging and cine images from a cmac subject. landmarks defined at end-diastole were deformed to end-systole using the carmen estimates and compared to manual tracking. banding artifacts on cine images showed no clear effect on derived motion estimates or landmark deformation, as shown in end-systole (fig. a, yellow arrow) or throughout the whole cardiac cycle (see supplementary video). the manual tracking inter-observer variability was . mm (fig. b, dotted line). within cine- table i state-of-the-art methods for left-ventricular segmentation shown at end-diastole (ed) and end-systole (es) on the acdc test set compared to proposed approach. red are the best results for each metric. fig. . validation of motion and strain. (a) landmarks at end-diastole (unfilled green) are manually-tracked (green) and deformed with carmen to end-systole (red). yellow arrow indicates a banding artifact. (b) average end-point-error (aepe) was assessed and compared to other methods. (c) mevis- and deepstrain- based strain (top) and strain rate (sr, bottom) measures are compared. black arrow shows strain inaccuracies with mevis. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / based techniques, carmen ( . ± . mm) and upf ( . ± . mm) had lower (p< . ) aepe relative to inria ( . ± . mm), but there was no significant difference between carmen and upf. all tagging-based methods had lower aepe compared to cine approaches, particularly mevis ( . ± . mm) and upf ( . ± . mm). b. strain analysis table shows the normal ranges (mean [ % ci]) of strain derived from cine-mri data for all healthy subjects, including subjects from the training, validation, and repeatability cohorts. deepstrain generated values with narrow ci for circumferential (~ %) and radial (~ %) ess, and circumferential (~ . s- ) and radial (~ . s- ) sr. specifically, circumferential and radial values across datasets were: - . % [- . - . ] and . % [ . . ] for ess, - . s- [- . - . ] and . s- [ . . ] for srs, and . s- [ . . ] and - . s- [- . - . ] for sre, accordingly. these values were similar to those from tagging-based ones, although circumferential sre from cine-mri data was lower, mostly in the train set ( . ± . s- ). comparison of tagging- and cine-based strain measures with matched subjects showed an overall agreement in timing and magnitude of strain and sr throughout the cardiac cycle, although tagging-based measures of radial ess diverge after early diastole (fig. c, black arrow), and there were visual differences in peak sr parameters. visual inspection of image artifacts on cine data showed no clear evidence that these artifacts affected strain values derived with deepstrain (see supplementary fig. s ). quantitative comparisons of tagging- and cine-based measures showed biases in circumferential ess (- . ± . vs. - . ± . %; bias - . ± . %), radial ess ( . ± . vs. . ± . %; + . ± . %) and sre (- . ± . vs. - . ± . ; - . ± . s- ) were small and not significantly different from zero (see supplementary fig. s ). however, there were larger differences (p< . ) in radial srs ( . ± . vs. . ± . s- ; . ± . s- ), and circumferential srs (- . ± . vs. - . ± . s- ; . ± . s- ) and sre ( . ± . vs. . ± . s- ; . ± . s- ). representative strain measures of a single subject derived from two acquisitions are shown in fig. . the aha polar maps from both acquisitions showed comparable regional variations in ess, particularly for circumferential ess in the inferoseptal wall (fig. a, orange arrows). global curves throughout the entire cardiac cycle also showed visual agreement in both timing and magnitude (fig. b). from these data, circumferential (- . vs. - . %) and radial ( . vs. %) ess (fig. b, purple), circumferential srs ( . vs. . s- ) and sre (- . vs. - . s- ), and radial srs ( . vs. . s- ) and sre (- . vs. - . s- ) global parameters were also found to be similar (fig. b, yellow). in addition, while not quantified in this study, the late-diastolic filling peaks were also comparable (fig. b, blue). table shows the rc, arc, icc, and loa across subjects for the global parameters. the average arc was below % for ess (circumferential: . ± . %; radial: . ± . %), below % for srs ( . ± . %; . ± . %), and below % for sre parameters ( . ± . %; . ± . %). icc results showed repeatability was excellent for ess ( . ; . ), good for srs ( . ; . ), moderate for circumferential sre ( . ), and excellent for radial sre ( . ) values. the loa, which defines the interval where to find the expected differences in % of the cases assuming normally distributed data, were ~ % and ~ % for circumferential and radial ess, and < . s- for all sr measures. the ess, rc, and arc maps averaged across subjects are shown in fig. . visually, these maps (fig. b) showed the average rc and arc were marginal ( ~ %) in more than half of the polar map segments. specifically, values were marginal for circumferential ess (~ %) in the anterior, anteroseptal, and anterolateral walls, but were larger in the inferior region, most notably in the basal- and mid-inferoseptal segments ( %). for radial ess the largest changes were found in the mid- anterolateral segment ( %), whereas changes in the anteroseptal, inferior and inferolateral walls were very small (~ %). the rc and arc per subject are provided in boxplot form in supplementary fig s . these results showed that, in most of the segments, the rc and arc were less than ~ %, although larger differences were noted in the inferoseptal wall for radial ess, and anterolateral wall for circumferential. supplementary table s shows the icc and loa per segment, including the whole-map average. for radial ess, the icc results showed excellent repeatability across all segments. circumferentially, all segments showed good to excellent repeatability, except for the basal- and mid-inferolateral segments. loas showed that % of differences occurred within ~ % and ~ % intervals for circumferential and radial ess. c. evaluation in patients with cardiovascular disease regional measures of ess averaged over patient population (see supplementary figure s ), as well as global values of strain and sr across the cardiac cycle (fig. ) for all subjects in the acdc train set showed progressive decline in strain values table ii normal ranges of strain with deepstrain in healthy subjects. tagging-based measures are shown for the cmac cohort. deepstrain repeatability is shown for two acquisitions (acq). table iii intra-scanner repeatability of global circumferential (circ) and radial (rad) strain measures. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / starting with hcm, followed by arv, mi, and dcm. specifically, relative to the healthy group, radial ess was reduced in all patient populations. radial systolic and early- diastolic sr were also reduced in all patient groups, except for systolic sr in hcm. fig. shows both the cine-mri image and the circumferential ess polar map of a healthy subject and two patients with mi. strain values in the healthy polar map have a homogeneous distribution. in contrast, in one mi patient the map indicates a diffused reduction, and inspection of the myocardium on the cine-mri image shows an anteroseptal infarct that coincides in location with segments with more prominent decreases in strain. in a different mi patient with an infarct located in a similar septal region, strain changes are focal and localized to the anteroseptal wall. iv. discussion learning-based methodologies have the potential to meet the technical challenges associated with myocardial strain analysis. in this study we developed a fast dl framework for strain analysis based on cine-mri data that does not make assumptions about the underlying physiology, and we benchmarked its segmentation, motion, and strain estimation components against the state-of-the-art. we compared our segmentations to other dl methods, motion estimates to other non-learning techniques, and strain measures to a reference tagging-mri technique. we also presented the intra-scanner repeatability of deepstrain-based global and regional strain measures, and showed that these measures were robust to image artifacts in some cases. global and regional applications were also presented to demonstrate the potential clinical utilization of our approach. a. volumetric measures segmentation from mri data is a task particularly well suited for convolutional networks given the excellent soft-tissue contrast, thus all top performing methods on the acdc test set were based on dl approaches (table ). isensee et al. [ ] had remarkable success on geometric metrics, but this and other approaches result in a systematic overestimation of the lv edv and thus underestimation of lvm. in contrast, carson generated less biased measures of lv volumes and mass, which were not significant. although simantiris et al. [ ] obtained the most precise measures, possibly due to their extensive use of augmentation using image intensity transformations, across methods the precision of ef was within the ~ - % [ ] needed when it is used as an index of lv function in clinical trials [ ]. lastly, we showed that the error in our measures of lv edv and lvm was almost half the inter-observer (~ . ml, . g), and comparable to the intra-observer (~ .. ml, . g) mae reported in [ ], but further investigations are required to assess the performance on more heterogeneous populations. b. strain measures the application of myocardial strain to quantify abnormal deformation in disease requires accurate definition of normal ranges. however, previously reported normal ranges vary largely between modalities and techniques, particularly for radial ess [ ]. in this study we showed deepstrain generated strain measures with narrow ci in healthy subjects from across three different datasets (table ). although direct comparison with the literature is difficult due to differences in the datasets, overall our strain measures agreed with several reported results. specifically, circumferential strain is in agreement with studies fig. . global and regional strain measures of representative subject. (a) regional end-systolic strain measures show visual agreement (orange arrow). (c) global strain and strain rate (sr) measures also show visual agreement. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in healthy participants based on tagging (- . %, n= ) and speckle tracking echocardiography (- %, n= ) datasets [ ], [ ], as well a recently proposed (- . % basal, n= ) tagging-based dl method [ ]. radial strain is in agreement with tagging-based ( . %, n= ; . % basal, n= ) studies [ ], [ ], but are lower than most reported values [ ]. this is a result of smoothing regularization used during training to prevent overfitting. however, lowering the regularization without increasing the size of the training set would lead to increased epe and wider ci. sr measures derived with deepstrain were also in good agreement with previous tagging- based studies [ ] . the cmac dataset enabled us to compare our results to non-learning methods using a common dataset. we found that aepe was lower with tagging-based techniques, reflecting the advantage of estimating cardiac motion from a grid of intrinsic tissues markers (i.e., grid tagging lines). further, the tagging techniques also benefited from the fact that landmarks were placed at the center of the ventricle, whereas motion estimation from tagging data at the myocardial borders and in thin-walled regions of the lv is less accurate due to the spatial resolution of the tagging grid [ ]. in addition, some of the tagging-mri images did not enclosed the whole myocardium and some contained imaging artifacts, which resulted in strain artifacts towards the end of the cardiac cycle (fig. c, black arrow). we found that mevis had the lowest aepe, which could be a result of their image term ( ) that penalizes phase shifts in the fourier domain instead of intensity values, an approach that is less affected by desaturation (i.e., fading) of the tagging grid over time. the upf approach also achieved a low aepe using multimodal integration and d tracking to leverage the strengths of both modalities and improve temporal consistency [ ]. although this approach could in principle be recast as dl technique using recurrent neural networks [ ], this would require a significant increase in the number of learnable parameters, therefore very large datasets would be needed to avoid overfitting. using mevis as the tagging reference standard, we found no significant differences in measures of radial and circumferential ess (fig. c). temporally, we found significant differences in sr measures between the two techniques that could be due to drift errors in the mevis implementation, i.e., errors that accumulate in sequential implementations in which motion is estimated frame-by-frame [ ]. although we did not observe considerable improvements in aepe compared to tagging- and cine-based methods, an important advantage of our approach is the reduced computational complexity (~ sec in gpu) relative to the proposed mevis ( - h), iucl ( - h), upf ( h) and inria ( h) approaches [ ]. specifically, because once trained our network does not optimize for a specific test subject (i.e., it does not iterate on the cine-data to generate the desired output), centering, segmentation, and motion estimation for the entire cardiac cycle can be accomplished much faster (< min in cpu). an additional advantage of non-iterative implementations is that we obtain deterministic results. since this implies the exact same motion estimates are generated given the same input, we expect strain measures not to vary meaningfully if the anatomy and function remain fixed. here we studied this property by evaluating the intra-scanner repeatability, an important aspect to consider when assessing the potential clinical utility of deepstrain. global measures of ess showed excellent repeatability with narrow loas and with absolute fig. . intra-scanner repeatability of regional myocardial strain measures. (a) average of subject-specific regional end-systolic strain (ess) maps during two acquisitions. (b) average changes between acquisitions. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rcs of less than % on average, and regional analyses also showed the average rc and arc was less than % in more than half of the polar map segments, with the maximum difference being %. finally, all sr measures showed good to excellent repeatability, except for sr which was moderate. c. clinical evaluation deepstrain could be applied in a wide range of clinical applications, e.g., automated extraction of imaging phenotypes from large-scale databases (e.g., uk biobank [ ]). such phenotypes include global and regional strain, which are important measures in the setting of existing dysfunction with preserved ef [ ]. deepstrain generated measures of global strain and sr over the entire cardiac cycle from a cohort of subjects in < min (fig. ). these results showed that radial sre was reduced in patients with hcm and arv, despite having a normal or increased lv ef. decreased sre with normal ef is suggestive of subclinical lv diastolic dysfunction, which is in agreement with previous findings [ ], [ ]. our results also showed deepstrain-based maps could be used to characterize regional differences between groups (supplementary fig. s ). at an individual level (fig. ), we showed that in mi patients, polar segments with decreased circumferential strain matched myocardial regions with infarcted tissue. further, we showed that the changes in regional strain due to mi can be both diffuse and focal. these abnormalities could be used to discriminate dysfunctional from functional myocardium [ ], or as inputs for downstream classification algorithms [ ]. more generally, deepstrain could be used to extract interpretable features (e.g., strain and sr) for dl diagnostic algorithms [ ], which would make understanding of the pathophysiological basis of classification more attainable [ ]. d. study limitations a limitation of our study was the absence of important patient information (e.g., age), which would be needed for a more complete interpretation of our strain analysis results, for example to assess the differences in strain values found between the healthy subjects from the acdc and cmac datasets. however, using publicly available data enables the scientific community to more easily reproduce our findings, and compare our results to other techniques. another limitation was the absence of longitudinal analyses, i.e., longitudinal strain was not reported because it is normally derived from long-axis cine- mri data not available in the training dataset. the size of the datasets is another potential limitation. the number of patients used for training is much smaller than the number of trainable parameters, potentially resulting in some degree of overfitting. to correct this, the training set for motion estimation could be expanded by validating the proposed segmentation network on more heterogeneous populations. also, while our repeatability results were promising despite testing in only a small number of subjects, repeatability in patient populations was not shown. e. conclusion we developed an end-to-end learning-based workflow for strain analysis that is fast, operator-independent, and leverages real-world data instead of making explicit assumptions about myocardial tissue properties or geometry. this approach enabled us to derive strain measures from new data without further training or parameter finetuning, and our measures were robust to image artifacts, repeatable, and comparable to those derive from dedicated tagging data. these technical and practical attributes position deepstrain as an excellent candidate for use in routine clinical studies or data-driven research. acknowledgment we acknowledge the support of nvidia corporation with the donation of the titan x pascal gpu used for this research. we also thank p. jodoin (acdc) and c. tobon-gomez (cmac) for their assistance with the challenge datasets. references [ ] m. a. konstam and f. m. abboud, “ejection fraction: misunderstood and over-rated (changing the paradigm in categorizing heart failure),” circulation, vol. , no. , pp. – , feb. . [ ] p. claus, a. m. s. omar, g. pedrizzetti, p. p. sengupta, and e. nagel, “tissue tracking technology for assessing cardiac mechanics,” jacc: cardiovascular imaging, vol. , no. , pp. – , dec. . [ ] o. a. smiseth, h. torp, a. opdahl, k. h. haugaa, and s. urheim, “myocardial strain imaging: how useful is it in clinical decision making?,” eur heart j, vol. , no. , pp. – , apr. . [ ] m. s. amzulescu, m. de craene, h. langet, a. pasquet, d. vancraeynest, a. c. pouleur, j. l. vanoverschelde, and b. l. gerber, fig. strain and strain rate measures computed on the acdc train set. fig. . regional strain in healthy and patients with mi. myocardial infarction can result in diffused (center) and focal (right) strain reduction. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / “myocardial strain imaging: review of general principles, validation, and sources of discrepancies,” european heart journal - cardiovascular imaging, mar. . [ ] n. f. osman, s. sampath, e. atalar, and j. l. prince, “imaging longitudinal cardiac strain on short-axis images using strain-encoded mri,” magn. reson. med., vol. , no. , pp. – , aug. . [ ] d. kim, w. d. gilson, c. m. kramer, and f. h. epstein, “myocardial tissue tracking with two-dimensional cine displacement-encoded mr imaging: development and initial evaluation,” radiology, vol. , no. , pp. – , mar. . [ ] n. risum, s. ali, n. t. olsen, c. jons, m. g. khouri, t. k. lauridsen, z. samad, e. j. velazquez, p. sogaard, and j. kisslo, “variability of global left ventricular deformation analysis using vendor dependent and independent two-dimensional speckle-tracking software in adults,” journal of the american society of echocardiography, vol. , no. , pp. – , nov. . [ ] a. schuster, v.-c. stahnke, c. unterberg-buchwald, j. t. kowallick, p. lamata, m. steinmetz, s. kutty, m. fasshauer, w. staab, j. m. sohns, b. bigalke, c. ritter, g. hasenfuß, p. beerbaum, and j. lotz, “cardiovascular magnetic resonance feature-tracking assessment of myocardial mechanics: intervendor agreement and considerations regarding reproducibility,” clinical radiology, vol. , no. , pp. – , sep. . [ ] wenzhe shi, xiahai zhuang, haiyan wang, s. duckett, d. v. n. luong, c. tobon-gomez, kaipin tung, p. j. edwards, k. s. rhode, r. s. razavi, s. ourselin, and d. rueckert, “a comprehensive cardiac motion estimation framework using both untagged and -d tagged mr images based on nonrigid registration,” ieee trans. med. imaging, vol. , no. , pp. – , jun. . [ ] g. pedrizzetti, p. claus, p. j. kilner, and e. nagel, “principles of cardiovascular magnetic resonance feature tracking and echocardiographic speckle tracking for informed clinical use,” journal of cardiovascular magnetic resonance, vol. , no. , p. , dec. . [ ] m. de craene, g. piella, o. camara, n. duchateau, e. silva, a. doltra, j. d’hooge, j. brugada, m. sitges, and a. f. frangi, “temporal diffeomorphic free-form deformation: application to motion and strain estimation from d echocardiography,” medical image analysis, vol. , no. , pp. – , feb. . [ ] t. mansi, x. pennec, m. sermesant, h. delingette, and n. ayache, “ilogdemons: a demons-based registration algorithm for tracking incompressible elastic biological tissues,” int j comput vis, vol. , no. , pp. – , mar. . [ ] r. avazmohammadi, j. s. soares, d. s. li, t. eperjesi, j. pilla, r. c. gorman, and m. s. sacks, “on the in vivo systolic compressibility of left ventricular free wall myocardium in the normal and infarcted heart,” journal of biomechanics, vol. , p. , jun. . [ ] v. kumar, a. j. ryu, a. manduca, c. rao, r. j. gibbons, b. j. gersh, k. chandrasekaran, s. j. asirvatham, p. a. araoz, j. k. oh, a. c. egbe, a. behfar, b. a. borlaug, and n. s. anavekar, “cardiac mri demonstrates compressibility in healthy myocardium but not in myocardium with reduced ejection fraction,” international journal of cardiology, vol. , pp. – , jan. . [ ] b. zhu, j. z. liu, b. r. rosen, and m. s. rosen, “image reconstruction by domain transform manifold learning,” arxiv: . [cs], apr. . [ ] p. dong, b. provencher, n. basim, n. piché, and m. marsh, “forget about cleaning up your micrographs: deep learning segmentation is robust to image artifacts,” microsc microanal, pp. – , jul. . [ ] g. simantiris and g. tziritas, “cardiac mri segmentation with a dilated cnn incorporating domain-specific constraints,” ieee j. sel. top. signal process., vol. , no. , pp. – , oct. . [ ] f. isensee, p. jaeger, p. m. full, i. wolf, s. engelhardt, and k. h. maier-hein, “automatic cardiac disease assessment on cine-mri via time-series segmentation and domain specific features,” arxiv: . [cs], vol. , . [ ] c. zotti, z. luo, a. lalande, and p.-m. jodoin, “convolutional neural network with shape prior applied to cardiac mri segmentation,” ieee j. biomed. health inform., vol. , no. , pp. – , may . [ ] m. baldeon calisto and s. k. lai-yuen, “adaen-net: an ensemble of adaptive d– d fully convolutional networks for medical image segmentation,” neural networks, vol. , pp. – , jun. . [ ] k. hammouda, f. khalifa, h. abdeltawab, a. elnakib, g. a. giridharan, m. zhu, c. k. ng, s. dassanayaka, m. kong, h. e. darwish, t. m. a. mohamed, s. p. jones, and a. el-baz, “a new framework for performing cardiac strain analysis from cine mri imaging in mice,” sci rep, vol. , no. , p. , dec. . [ ] e. puyol-anton, b. ruijsink, w. bai, h. langet, m. de craene, j. a. schnabel, p. piro, a. p. king, and m. sinclair, “fully automated myocardial strain estimation from cine mri using convolutional neural networks,” in ieee th international symposium on biomedical imaging (isbi ), washington, dc, , pp. – . [ ] c. qin, w. bai, j. schlemper, s. e. petersen, s. k. piechnik, s. neubauer, and d. rueckert, “joint learning of motion estimation and segmentation for cardiac mr image sequences,” arxiv: . [cs], jun. . [ ] m. qiao, y. wang, y. guo, l. huang, l. xia, and q. tao, “temporally coherent cardiac motion tracking from cine mri: traditional registration method and modern cnn method,” med. phys., vol. , no. , pp. – , sep. . [ ] h. yu, s. sun, h. yu, x. chen, h. shi, t. s. huang, and t. chen, “foal: fast online adaptive learning for cardiac motion estimation,” in ieee/cvf conference on computer vision and pattern recognition (cvpr), seattle, wa, usa, , pp. – . [ ] p. chen, x. chen, e. z. chen, h. yu, t. chen, and s. sun, “anatomy- aware cardiac motion estimation,” arxiv: . [cs, eess], aug. . [ ] b. d. de vos, f. f. berendsen, m. a. viergever, m. staring, and i. išgum, “end-to-end unsupervised deformable image registration with a convolutional neural network,” arxiv: . [cs], vol. , pp. – , . [ ] m. a. morales, d. izquierdo-garcia, i. aganj, j. kalpathy-cramer, b. r. rosen, and c. catana, “implementation and validation of a three- dimensional cardiac motion estimation network,” radiology: artificial intelligence, vol. , no. , p. e , jul. . [ ] a. Østvik, e. smistad, t. espeland, e. a. r. berg, and l. lovstakken, “automatic myocardial strain imaging in echocardiography using deep learning,” in deep learning in medical image analysis and multimodal learning for clinical decision support, vol. , d. stoyanov, z. taylor, g. carneiro, t. syeda-mahmood, a. martel, l. maier-hein, j. m. r. s. tavares, a. bradley, j. p. papa, v. belagiannis, j. c. nascimento, z. lu, s. conjeti, m. moradi, h. greenspan, and a. madabhushi, eds. cham: springer international publishing, , pp. – . [ ] j.-u. voigt, g. pedrizzetti, p. lysyansky, t. h. marwick, h. houle, r. baumann, s. pedri, y. ito, y. abe, s. metz, j. h. song, j. hamilton, p. p. sengupta, t. j. kolias, j. d’hooge, g. p. aurigemma, j. d. thomas, and l. p. badano, “definitions for a common standard for d speckle tracking echocardiography: consensus document of the eacvi/ase/industry task force to standardize deformation imaging,” european heart journal - cardiovascular imaging, vol. , no. , pp. – , jan. . [ ] o. bernard, a. lalande, c. zotti, f. cervenansky, x. yang, p.-a. heng, i. cetin, k. lekadir, o. camara, m. a. gonzalez ballester, g. sanroma, s. napel, s. petersen, g. tziritas, e. grinias, m. khened, v. a. kollerathu, g. krishnamurthi, m.-m. rohe, x. pennec, m. sermesant, f. isensee, p. jager, k. h. maier-hein, p. m. full, i. wolf, s. engelhardt, c. f. baumgartner, l. m. koch, j. m. wolterink, i. isgum, y. jang, y. hong, j. patravali, s. jain, o. humbert, and p.-m. jodoin, “deep learning techniques for automatic mri cardiac multi- structures segmentation and diagnosis: is the problem solved?,” ieee trans. med. imaging, vol. , no. , pp. – , nov. . [ ] c. tobon-gomez, m. de craene, k. mcleod, l. tautz, w. shi, a. hennemuth, a. prakosa, h. wang, g. carr-white, s. kapetanakis, a. lutz, v. rasche, t. schaeffter, c. butakoff, o. friman, t. mansi, m. sermesant, x. zhuang, s. ourselin, h.-o. peitgen, x. pennec, r. razavi, d. rueckert, a. f. frangi, and k. s. rhode, “benchmarking framework for myocardial tracking and deformation algorithms: an open access database,” medical image analysis, vol. , no. , pp. – , aug. . [ ] american heart association writing group on myocardial segmentation and registration for cardiac imaging:, m. d. cerqueira, n. j. weissman, v. dilsizian, a. k. jacobs, s. kaul, w. k. laskey, d. j. pennell, j. a. rumberger, t. ryan, and m. s. verani, “standardized myocardial segmentation and nomenclature for tomographic imaging of the heart: a statement for healthcare professionals from the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cardiac imaging committee of the council on clinical cardiology of the american heart association,” circulation, vol. , no. , pp. – , jan. . [ ] m. jaderberg, k. simonyan, a. zisserman, and k. kavukcuoglu, “spatial transformer networks,” arxiv: . [cs], jun. . [ ] s. ioffe and c. szegedy, “batch normalization: accelerating deep network training by reducing internal covariate shift,” arxiv: . [cs], mar. . [ ] b. xu, n. wang, t. chen, and m. li, “empirical evaluation of rectified activations in convolutional network,” arxiv: . [cs, stat], nov. . [ ] k. he, x. zhang, s. ren, and j. sun, “deep residual learning for image recognition,” arxiv: . [cs], dec. . [ ] w. p. segars, g. sturgeon, s. mendonca, j. grimes, and b. m. w. tsui, “ d xcat phantom for multimodality imaging research: d xcat phantom for multimodality imaging research,” medical physics, vol. , no. , pp. – , aug. . [ ] l. wissmann, c. santelli, w. p. segars, and s. kozerke, “mrxcat: realistic numerical phantoms for cardiovascular magnetic resonance,” journal of cardiovascular magnetic resonance, vol. , no. , dec. . [ ] l. tautz, a. hennemuth, and h.-o. peitgen, “motion analysis with quadrature filter based registration of tagged mri sequences,” in statistical atlases and computational models of the heart. imaging and modelling challenges, vol. , o. camara, e. konukoglu, m. pop, k. rhode, m. sermesant, and a. young, eds. berlin, heidelberg: springer berlin heidelberg, , pp. – . [ ] k. mcleod, a. prakosa, t. mansi, m. sermesant, and x. pennec, “an incompressible log-domain demons algorithm for tracking heart tissue,” in statistical atlases and computational models of the heart. imaging and modelling challenges, vol. , o. camara, e. konukoglu, m. pop, k. rhode, m. sermesant, and a. young, eds. berlin, heidelberg: springer berlin heidelberg, , pp. – . [ ] e. ferdian, a. suinesiaputra, k. fung, n. aung, e. lukaschuk, a. barutcu, e. maclean, j. paiva, s. k. piechnik, s. neubauer, s. e. petersen, and a. a. young, “fully automated myocardial strain estimation from cardiovascular mri–tagged images using a deep learning framework in the uk biobank,” radiology: cardiothoracic imaging, vol. , no. , p. e , feb. . [ ] r. vallat, “pingouin: statistics in python,” joss, vol. , no. , p. , nov. . [ ] n. painchaud, y. skandarani, t. judge, o. bernard, a. lalande, and p.- m. jodoin, “cardiac mri segmentation with strong anatomical guarantees,” in medical image computing and computer assisted intervention – miccai , vol. , d. shen, t. liu, t. m. peters, l. h. staib, c. essert, s. zhou, p.-t. yap, and a. khan, eds. cham: springer international publishing, , pp. – . [ ] m. khened, v. alex, and g. krishnamurthi, “densely connected fully convolutional network for short-axis cardiac cine mr image segmentation and heart diagnosis using random forest,” in statistical atlases and computational models of the heart. acdc and mmwhs challenges, vol. , m. pop, m. sermesant, p.-m. jodoin, a. lalande, x. zhuang, g. yang, a. young, and o. bernard, eds. cham: springer international publishing, , pp. – . [ ] j. a. san román, j. candell-riera, r. arnold, p. l. sánchez, s. aguadé-bruix, j. bermejo, a. revilla, a. villa, h. cuéllar, c. hernández, and f. fernández-avilés, “quantitative analysis of left ventricular function as a tool in clinical research. theoretical basis and methodology,” revista española de cardiología (english edition), vol. , no. , pp. – , may . [ ] j. p. kelly, r. j. mentz, a. mebazaa, a. a. voors, j. butler, l. roessig, m. fiuzat, f. zannad, b. pitt, c. m. o’connor, and c. s. p. lam, “patient selection in heart failure with preserved ejection fraction clinical trials,” journal of the american college of cardiology, vol. , no. , pp. – , apr. . [ ] b. a. venkatesh, s. donekal, k. yoneyama, c. wu, v. r. s. fernandes, b. d. rosen, m. l. shehata, r. mcclelland, d. a. bluemke, and j. a. c. lima, “regional myocardial functional patterns: quantitative tagged magnetic resonance imaging in an adult population free of cardiovascular risk factors: the multi-ethnic study of atherosclerosis (mesa): reference values of strain from tagged mri,” j. magn. reson. imaging, vol. , no. , pp. – , jul. . [ ] d. muraru, u. cucchini, s. mihăilă, m. h. miglioranza, p. aruta, g. cavalli, a. cecchetto, s. padayattil-josè, d. peluso, s. iliceto, and l. p. badano, “left ventricular myocardial strain by three-dimensional speckle-tracking echocardiography in healthy subjects: reference values and analysis of their physiologic and technical determinants,” journal of the american society of echocardiography, vol. , no. , pp. - .e , aug. . [ ] z. gan, j. tang, and x. yang, “left ventricle motion estimation based on unsupervised recurrent neural network,” in ieee international conference on bioinformatics and biomedicine (bibm), san diego, ca, usa, , pp. – . [ ] a. fry, t. j. littlejohns, c. sudlow, n. doherty, l. adamska, t. sprosen, r. collins, and n. e. allen, “comparison of sociodemographic and health-related characteristics of uk biobank participants with those of the general population,” american journal of epidemiology, vol. , no. , pp. – , nov. . [ ] s. chen, j. yuan, s. qiao, f. duan, j. zhang, and h. wang, “evaluation of left ventricular diastolic function by global strain rate imaging in patients with obstructive hypertrophic cardiomyopathy: a simultaneous speckle tracking echocardiography and cardiac catheterization study,” echocardiography, vol. , no. , pp. – , may . [ ] a. j. marian and e. braunwald, “hypertrophic cardiomyopathy: genetics, pathogenesis, clinical manifestations, diagnosis, and therapy,” circ res, vol. , no. , pp. – , sep. . [ ] m. j. w. götte, a. c. van rossum, j. w. r. twisk, j. p. a. kuijer, j. t. marcus, and c. a. visser, “quantification of regional contractile function after infarction: strain analysis superior to wall thickening analysis in discriminating infarct from remote myocardium,” journal of the american college of cardiology, vol. , no. , pp. – , mar. . [ ] n. zhang, g. yang, z. gao, c. xu, y. zhang, r. shi, j. keegan, l. xu, h. zhang, z. fan, and d. firmin, “deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri,” radiology, vol. , no. , pp. – , jun. . [ ] q. zheng, h. delingette, and n. ayache, “explainable cardiac pathology classification on cine mri with motion characterization by semi-supervised learning of apparent flow,” arxiv: . [cs, stat], mar. . [ ] p. n. kampaktsis and m. vavuranakis, “diastolic function evaluation,” jacc: cardiovascular imaging, vol. , no. , pp. – , jan. . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / improving variant calling using population data and deep learning improving variant calling using population data and deep learning nae-chyun chen , ‡,∗, alexey kolesnikov , sidharth goel , taedong yun , pi-chuan chang , †, and andrew carroll , †,∗ department of computer science, johns hopkins university, baltimore, md , usa google health, palo alto, ca and cambridge, ma , usa corresponding author: cnaechy @jhu.edu; awcarroll@google.com †these authors contributed equally to this work. ‡work performed while an intern at google health. january , abstract large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. these approaches do not incorporate population information directly into the process of variant calling, and are often limited to filter- ing which trades recall for precision. in this study, we modify deepvariant to add a new channel encoding population allele frequencies from the genomes project. we show that this model reduces variant calling errors, improving both precision and recall. we assess the impact of using population-specific or diverse reference panels. we achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. finally, we show that this benefit generalizes to samples with differ- ent ancestry from the training data even when the ancestry is also excluded from the reference panel. background variant calling [ – ] identifies the positions in an individual genome which differ from a reference or population, and is used to characterize a single sample or build large research cohorts [ , ]. variant calling is non-trivial, because of sequencing errors, systematic errors in mapping to repetitive and variable regions [ ], and imbalanced sampling of alleles needed to identify a heterozygous variant from a homozygous one. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . variant calling can be improved by jointly genotyping multiple samples together [ – ], but the raw sequence data for a cohort is not always available, and this process is computationally expensive. instead, large-scale reference panels from a wide range of populations can provide similar information [ , ]. recent studies use such information to improve alignment accuracy and reduce biases in alignment [ – ], but there has been little work to incorporate population data with variant calling. because far more variants are transmitted than arise de novo, real variants in a pop- ulation tend to recur at various frequencies [ ], while false positives are often either not seen elsewhere in a population, or are seen with a consistent signature [ ]. researchers use this knowledge to filter variant calls, often with rules which lose recall for a gain in precision [ ]. more sophisticated machine-learning methods to filter are used in larger cohorts, such as gnomad, but these also trade recall for precision and also only operate on variant calls and summary information [ ]. we reason that including population-level information at an earlier stage in variant calling, when the full read-level data is available, might allow for more effective use of population data. to do this, we adapted deepvariant [ ], which represents bam infor- mation as a multi-dimensional pileup and uses a convolutional neural network (cnn) to call variants. because deepvariant learns the features important for variant classifica- tion directly from the data, it allows us to feed in the population allele information as an additional channel. we trained population-aware models and compared them with the default deepvari- ant v . models which are agnostic of population information. the population-aware approach reduces the number of errors for all tested datasets, including wgs and wes reads, when using the allele frequencies from genomes. it also shows stronger error reduction efficacy for lower-coverage read sets. while traditional filtering approaches will increase precision at the expense of recall, we observe improvements to both precision and recall with this method. when incorporating population data, it is also important for fairness and equity to understand how it changes the accuracy of methods for individuals with ancestries out- side of those used in the development of the population resources. it is known that many genomic databases have collected more data for the european population than others [ – ]. we demonstrate that even using frequencies from a genetically distinct popula- tion, the population-aware model still performs similarly as the baseline. we find that a reference panel consisting of all ancestries in the genomes project ( genomes) outperforms a reference panel with only one of the genomes population groups, even when that population matches the sample being called. this implies that maximizing the diversity of ancestries in population resources has the potential to improve variant calling for all populations. the genome in a bottle (giab) truth sets used to train deepvariant are from eu- ropean, ashkenazi, and asian ancestry. to assess whether the addition of the refer- ence panel information improves variant calling for populations outside of the popula- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tions represented in training, we use high quality pacbio hifi [ ] data from the human genome structural variation consortium for an individual of puerto rican ancestry as an evaluation set. we show that an illumina model using the reference panel has superior concordance with the highly accurate pacbio hifi variant calls compared to an illumina model without the reference panel. results . population information improves deepvariant performance deepvariant converts input from a bam file into a pileup image with channels, repre- senting ) bases, ) base qualities, ) mapping quality, ) strand, ) supports variant, and ) base differs from reference. we modified deepvariant v . to take an additional input channel, the allele-frequency (af) of the variant [ ]. we trained deepvariant models with and without the af channel with the testing samples held out. we first compared the whole-genome sequencing (wgs) variant calling accuracy for sample hg , sequenced with x coverage from the precisionfda v truth challenge [ ], using the latest giab v . . truth set [ ] (figure ). hg is not used in the training of these deepvariant models, and so acts as an independent holdout to evaluate their quality. the population-aware model has superior accuracy than default deepvariant v . in both precision and recall for both types of variants. it has an overall error reduction of ( . %). for snps, the error rate (defined as -f score) decreases from . to . ; for indels, the error rate decreases from . to . . notably, the population- aware model improves snp false discovery rate (fdr, defined as -precision) from . to . , equivalent to an error reduction of , ( . %) variants. we then down-sampled the hg reads from x to x to evaluate the performance of the models with lower-coverage datasets. the population-aware method demonstrates a larger improvement in accuracy over default deepvariant v . by reducing , ( . %) overall errors. the error rate decreases from . to . for snps, and . to . for indels. similar to using the x read set, the population-aware model shows the strongest improvement to reduce false-positive snps, reducing fdr from . to . , equivalent to , ( . %) errors. we further evaluated the performance of the models using two whole-exome sequenc- ing (wes) datasets from a recently released set of genome and exome data [ ] (figure ). for both wes datasets, the population-aware model outperforms deepvariant v . in overall number of errors. it has an overall error reduction of ( . %) for the idt dataset, and ( . %) for the oslo dataset. it has a slightly higher rate for snps for the oslo dataset, from . to . , but the difference is smaller than the gain for indels for that dataset. the population-aware model tends to have a larger lead on precision for both types of variants compared to the baseline, but still has similar or better recall. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x -f -precision (fdr) -recall (fnr) indel . . . . . v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x -f -precision (fdr) -recall (fnr) snp figure : wgs variant calling error rates for hg . all results are evaluated using the giab v . . truth set in the high-confidence regions. v . : deepvariant v . ; af: the population-aware model that uses the allele-frequency channel. the column label suffixes show the average coverage of the read sets. lower values correspond to better accuracy. . . . . . . . . v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo -f -precision (fdr) -recall (fnr) indel . . . . . . . v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo -f -precision (fdr) -recall (fnr) snp figure : wes variant calling error rate for hg . the idt results (“*-idt”) are grch -based and evaluated using the giab v . . truth set; the oslo datasets (“*-oslo”) are grch -based and evaluated using the giab v . . truth set. v . : deepvariant v . ; af: the population-aware model that uses the allele-frequency channel. lower values correspond to better accuracy. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . model-specific errors for population-aware models intuitively, population information helps deepvariant decide whether to make a call based on the commonness of a variant, especially for cases where the variant calling confidence levels are low. with a population-aware model, a variant caller should be more likely to make a positive variant call for a candidate with high allele frequency, and is less likely to make a call when seeing a rare candidate variant. to understand the influence of allele frequencies in the model, we design an analy- sis framework to compare a population-agnostic model with a population-aware model. we call this a model-specific error analysis. we stratify the errors into three groups: population-resolved, population-induced and common. the population-resolved vari- ants are called correctly with the allele frequency model, but called incorrectly when us- ing the baseline model. we say such errors are “rescued” by population information. the population-induced errors are specific to the population-aware model, i.e. they are in- duced by the extra features. the common group contains errors called by both models. the common errors are viewed as ones more difficult to solve without major changes in the data processing pipeline, such as variant caller, upstream computational methods, or sequencing technology. thus, in this analysis we focus on the first two groups. for sim- plicity, we only considered bi-allelic calls in this analysis, which are the majority of overall errors. we used the x hg wgs dataset to perform the model-specific error analysis. af- ter extracting model-specific erroneous calls, we matched the calls with the genomes variants to obtain associated allele frequencies. we first examined the relationship be- tween allele frequency (af) and variant allele fraction (vaf), which is the fraction of reads supporting an alternate allele in a given sample, of each false-positive call. there is an ob- servable distinction between the population-induced group and the population-resolved group in the vaf-af plots (figure , left and middle panels). among the population- resolved false-positive errors, more than two third ( . %) are uncommon (allele fre- quency ≤ %) among the genomes samples, whereas there are only . % uncom- mon variants for population-induced false positives. this observation supports the hy- pothesis that the population-aware model uses allele frequency to adjust its variant calls. we then investigated bi-allelic false-negative errors, as shown in the right panel in fig- ure . variant allele fraction for false negatives are not always available because many false negatives are not identified as a variant candidate due to reasons including low read coverage, incorrect mapping or insufficient sensitivity in variant candidate discovery. thus, we only evaluated the allele frequency distribution for false negatives. we noticed a significant difference in the number of common variants (with greater than % allele frequency). among all population-resolved false negatives, . % ( , out of , ) are common variants. for population-induced false negatives, . % ( out of ) are un- common. the model-specific analysis highlights the difference of the deepvariant models with or without the af channel. with the additional population information, deepvari- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure : errors specific to a population-agnostic model (in blue) and a population-aware model (in red) using x hg wgs data. ant is capable of adjusting the calls according to the commonness of a variant and shows improvements in both precision and recall. . performance on zero-frequency variants a potential concern for population-aware variant calling models is increasing false neg- ative rate for novel alleles. since it is not trivial to define a set of truly novel variants in the genomes project, we extracted variants with zero allele frequency to investigate the impact when population information is included in a variant calling model. using the giab v . . truth set, there are , ( . %) snps and , ( . %) indels that have zero allele frequency for sample hg . we then use the zero-frequency variant set to evaluate recall of actual variant calls using hap.py [ ]. we observed that the recall on zero-frequency variants underperforms the rest using all deepvariant models, regardless of variant types and whether to utilize population information. with x reads, the false-negative rate (fnr, or -recall) of the population- agnostic model is . for snps and . for indels (figure ). the fnrs further in- crease to . for snps and . for indels when using the population-aware model. when using x reads, the drop in accuracy gets larger for both types of variants. this is consistent with our analysis that the population-aware deepvariant model requires stronger evidence (higher-quality pileup images) to call zero-frequency variants, thus re- ducing recall. further, the population information has a stronger influence in variant call- ing for low-coverage datasets. despite the disadvantages, the negative impact on zero- frequency variants is small compared to overall error reduction. to better understand the zero-frequency variants, we called variants using the deep- variant pacbio model with the precisionfda v x hg reads set sequenced with the pacbio hifi technology [ ]. the fnrs for the zero-frequency variants improve to . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . . . -r e c a ll ( f n r ) v . af v . - x af- x v . af v . - x af- x indel snp figure : the false negative rate (fnr) of zero-frequency variants for hg with differ- ent models. lower values correspond to better accuracy. for snps and . for indels. the large difference in recall/fnr indicates that many of the zero-frequency variants are hard to genotype using illumina reads, and may not be novel mutations relative to samples in reference panels. in the future, reference panels utilizing high-quality long reads will likely provide better allele frequency estimates and improve the population-aware model performance. . assessing biases using different genomes populations it is important to understand if the inclusion of population information reduces deep- variant’s performance for populations that are not well represented, especially when they have a large genomic difference with the reference panel. we first note that ashke- nazi jewish, the ethnicity of the hg , is not among the ethnicities collected by genomes. using a testing sample not in the reference panel reduces the risk of bias. second, we ran inference on the population-aware model using reference panels of alleles frequencies. we split the genomes sample into five groups based on the superpopu- lation labels (african, afr; admixed american, amr; east asian, eas; european, eur; south asian, sas) and calculated allele frequencies for each super-population. we show that all population-aware approaches outperform for snps but underperform for indels when evaluated using hg (figure ). when considering the overall number of errors, only the model inferred with eas frequencies calls more errors than the baseline, but the deficit ( , or . %) is small. we also compared the performance of using different superpopulation frequencies and observed a correlation between variant calling accuracy and the distance between the tested sample and ethnicity groups. according to the principal component (pc) analysis performed by gnomad v [ ], ashkenazi jewish is closer to the european populations and is farther from east asian and african in the pc -pc space. we observed that using frequencies from a genetically closer population usually resulted in higher variant calling (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . . . . v . all eur amr sas afr eas v . all eur amr sas afr eas v . all eur amr sas afr eas -f -precision (fdr) -recall (fnr) indel . . . . . . . . v . all eur amr sas afr eas v . all eur amr sas afr eas v . all eur amr sas afr eas -f -precision (fdr) -recall (fnr) snp figure : variant calling accuracy when inferring x illumina reads from hg using default deepvariant v . (v . ), allele frequencies in the entire genomes (all) and five genomes superpopulations (eur, amr, sas, afr and eas). lower values cor- respond to better accuracy. accuracy. using eur frequencies outperforms using other population frequencies, only falling behind using the entire genomes. on the other hand, using eas frequencies results in the highest numbers of errors among all population-aware methods. we point out that using genomes frequencies from all populations results in the lowest number of errors among all population-aware results, suggesting an advantage to using a diverse population than finding a genetically similar group. this finding echoes our previous statement that we anticipate the population-aware variant calling model to improve further with larger-scaled and more diverse population callsets. . silver-standard truth set for hg genome-in-a-bottle (giab) truth variant sets provide gold standards to benchmark vari- ant callers, but until now there are only three samples (hg -hg -hg , the ashke- nazi trio) with curated calls in difficult-to-map regions added in the v . . release [ ]. further, the samples are from the same ancestry, making it challenging to perform a generalized benchmarking considering the genetic diversity of the human population. to deal with this difficulty, it is desirable to have other high-quality variant sets from non-giab samples, preferably from ancestries not covered by giab. thus, we called variants using the deepvariant pacbio model with x high-coverage pacbio hifi reads [ ] for hg , a puerto rican (labelled as pur under the amr superpopulation in genomes) sample. the deepvariant pacbio model has a snp f score higher than (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x -f -precision (fdr) -recall (fnr) snp figure : variant calling results when evaluated using hg data, compared to the pacbio-deepvariant silver-standard truth set. lower values correspond to better accu- racy. . % and is one of the most accurate models using pacbio hifi data [ ]. we used the deepvariant hg pacbio snp calls as a “silver-standard” truth set and benchmarked the performance for models using illumina reads. we excluded the puerto rican popula- tion when calculating allele frequencies to avoid biases in favor of the population-aware models. we used x illumina wgs reads sequenced by the new york genome center to test all hg models. because the genomes has a collection of pur samples, we excluded all pur samples and re-calculated allele frequencies for both genomes and the amr superpopulation. the population-aware model has a lower snp error rate ( . vs. . ), fdr ( . vs. . ) and fnr ( . vs. . ) than the baseline for hg (figure ). the number of snp errors is reduced by , ( . %). similar to the finding using hg , the population-aware model performs strongly with a down-sampled ( x) read set. the error rate for the x read set is reduced from . to . , and the snp error reduction is , ( . %). we also tested the model using different superpopulation fre- quencies (figure ). all but the eas population-aware model has lower snp error rates than the baseline. when inferred using the eas allele frequencies, the snp error rate in- creased from . to . , equivalent to ( . %) more errors. all population-aware models, including eas, outperform the baseline on fdr and only eas has a higher fnr than the baseline ( . vs. . ). discussion we designed a new population-aware deepvariant model which can incorporate both base- and read-level information with the population information. we find that population- aware models reduce error rates by . % for wgs and . - . % for wes compared to population-agnostic baselines (default deepvariant v . ) the relative advantage of the population-aware model increases at lower coverage ( . % reduction at x and . % at x). the increased accuracy at lower coverage suggests that population information is (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . . . . v . all eur amr sas afr eas v . all eur amr sas afr eas v . all eur amr sas afr eas -f -precision (fdr) -recall (fnr) snp figure : number of snp errors when evaluated using x wgs reads from a puerto ri- can sample hg . all models other than v . are population-aware, inferred using alleles frequencies from different populations. lower values correspond to better accu- racy. most valuable in difficult examples, where read-level information alone may not be suffi- cient for confident calling. in population sequencing projects, this finding could be rele- vant to the question of whether to sequence more individuals at lower coverage, or fewer at a high coverage. when sequencing for a species without a reference panel, it is possible that sequencing more, diverse individuals at lower coverage could still retain compara- ble accuracy to traditional methods which do not incorporate population information in calling. we evaluate potential biases introduced by population information in variant call- ing by comparing population-aware models that use allele frequencies from different genomes superpopulation. this experiment simulates a scenario where the tested sample is genetically distinct from the reference panel. only one population-aware method (inferred with eas frequencies) underperforms the baseline in total number of errors, but with a small deficit. furthermore, using allele frequencies calculated from the entire genomes outperforms population-specific methods. this finding implies that a di- verse population can provide more benefits than using a homogeneous one, even when the homogeneous population is more genetically similar with the tested sample. this finding may inform efforts to build population or country-specific resources. increasing the number of samples for a given population will improve accuracy for that population, but the inclusion of samples from diverse populations will also improve the resource. we believe that the accuracy of the population-aware model can further improve with a larger and more diverse population callset in the future, reinforcing the benefit of collaboration between nation-scale efforts. we provide an additional “silver-standard” snp set for a purto rican sample, hg , a population not present in the labeled training data. we used high-coverage pacbio hifi reads and an accurate deepvariant pacbio model to generate this high-quality call set. this method can provide high-confidence snp calls for non-giab samples and increase population diversity when assessing variant calling results. similar to the results using hg data, we show that the proposed model has strong performance compared to the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . baseline, and only suffers slight loss of accuracy when inferred using a distinct popu- lation. when more high-coverage pacbio hifi data become available in the future, the high-quality calls generated by deepvariant can provide a more diversified dataset for variant calling benchmarking and downstream analysis. despite greater overall accuracy, we note that the population-aware model under- performs on variants with zero allele frequencies in genomes. although the dis- advantage is small compared to the overall gain, this results suggests that the decision of whether to use population-aware models should consider the end goal. if reducing po- tential false positives is a larger concern, the use of a population-aware method could be recommended, but if the goal is to maximize recall of rare or novel variants, traditional methods could be preferred. we also notice that all tested illumina models performed poorly on the zero-frequency variants, regardless of using population information or not. by analyzing the variants with pacbio reads, we point out many zero-frequency variants in genomes located in difficult-to-map regions, but likely not genetically novel in the population. this suggests that the power of population-aware methods should increase as large panels of long-read population data become available. methods . training the model we trained the model following the procedure described in [ ], with additional illumina wgs datasets included [ ]. variants in chromosomes to are used as the training ex- amples, and those in chromosome and are used for tuning. variants in chromosome are never used in the training process. . datasets the model is evaluated using the giab v . . truth set for hg across whole genomes [ ]. we also generated another high-quality snp set using deepvariant v . and hg pacbio hifi data [ ] across the whole genome. we used the intersection of high-confidence regions of hg , hg , and hg (giab v . . ) as the high-confidence regions for the hg snp set. the read sets used for experiments are listed in table and the read sets for supporting experiments are provided in table . . allele matching algorithm when incorporating population information in deepvariant, we need to match a variant candidate with a cohort variant. however, this is not a straightforward task since a vari- ant can be represented in multiple formats [ , ]. a common approach is to normalize variants, such as using bcftools norm [ ], but that’s not sufficient for complicated (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table : testing datasets. sample ethnicity truth variant dataset hg ashkenazi jewish v . . (grch ) x illumina wgs [ ] x illumina wes [ ] hg ashkenazi jewish v . . (grch ) x illumina wes [ ] hg puerto rican deepvariant v . pacbio snp calls (grch ) x illumina wgs (nygc) table : other datasets used in this study. sample ethnicity dataset hg ashkenazi jewish x pacbio hifi [ ] hg puerto rican x pacbio hifi [ ] cases. we designed an algorithm that constructed local haplotypes and performed pre- cise allele matching (figure ). the algorithm starts with querying all cohort variants vc overlapped with a window [startv, endv), where startv and endv are the starting and ending positions of a variant candidate v respectively. the queried cohort variants and the candidate variant form set v ≡ v ∪ v c. then the window is extended to the small- est starting position and the largest ending position within v , as [startv , endv ), where startv ≡ min(startu)∀u ∈ v and endv ≡ max(endw)∀w ∈ v . local reference haplotype is queried from the reference genome in window [startv , endv ]. for each variant allele in v , its allele haplotype is updated in this window. if there’s a perfect match between a cohort allele haplotype and a candidate allele haplotype, the allele frequency of the cohort allele is added to an allele frequency dictionary, using the alternate allele of the candidate variant as key. afterwards, deepvariant looks up the dictionary when processing reads overlapped with the candidate variant. . allele frequency channel for deepvariant to make full advantages of the cnn-based classifier of deepvariant, allele frequencies need to be encoded in pileup images. we apply a logarithmic transformation to gain resolution for low-frequency signals. for each variant candidate, an additional allele fre- quency channel is added to the pileup image. in this channel, a read is colored by the transformed frequency of its allele at the variant candidate position. a read can carry multiple alternate alleles with different frequencies, so its color intensity may vary across pileup images, where the variant candidates differ. an alternative method to encode al- lele frequencies is to include the information as features in the fully-connected layers [ ], but this approach sacrifices the capability to incorporate allele frequencies with base- and read-level information and thus is not adopted. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cohort variants position= ref=tttcca alt=t,tttccattcca af= . e- , e- position= ref=ttccag alt=t af= . e- variant candidate position= ref=attccag alt=at reference: - tttccattccag a b b updated haplotypes tttcca-----t----- t----------ttccag tttccattccattccag tttcca-----t----- c tttccat tttccag tttccattccattccag tttccat d dict(at= . , attccag= . ) variant candidate position= ref=attccag alt=at cohort variants cohort variant position= ref=tttcca alt=t,tttccattcca af= . , . cohort variant position= ref=ttccag alt=t af= . reference: - tttccattccag updated haplotypes tttcca-----t----- t----------ttccag tttccattccattccag tttcca-----t----- tttccat tttccag tttccattccattccag tttccat candidate frequency at: . figure : an example for the allele matching algorithm. this algorithm first queries cohort variants overlapped with the variant candidate. these cohort variants and the candidate determine the window where haplotypes are updated. the frequencies of matched allele haplotypes are then updated for the variant candidate as a dictionary. in this diagram, haplotypes are updated with dashes to keep sequenced aligned for better visualization. in practice, dash-free haplotypes are generated by the allele matching algorithm. to enable the allele frequency channel, users need to enable flag --use allele frequency and provide deepvariant cohort variants in vcf format with flag --population vcfs. . model-specific error analysis we compared actual variant calls with giab v . . truth variants using bcftools isec. variants specific to actual calls are regarded as false positives, and those specific to the truth set are regarded as false negatives. we generated the false-positive and false-negative sets for two models, and then applied bcftools isec again to obtain model-specific false positives and false negatives. for both sets, we applied the allele matching algo- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . rithm to obtain allele frequencies for the variants. for the false-positive sets, we extracted variant allele fractions from the vcf files generated by deepvariant. . genomes frequencies from the deepvariant-glnexus pipeline we used the genomes reference panel generated with the deepvariant-glnexus pipeline (v ) [ ] for all population-aware experiments, including training and inferring the models. we fill the missing genotypes with the reference genotypes with bcftools +missing ref to make sure all variants have the same denominator. availability of data and materials the deepvariant source code is available at https://github.com/google/deepvariant under the bsd- -clause license. the pacbio-based hg snp set is available at https://console.cloud.google.com/storage/browser/brain-genomics-public/ research/allele_frequency/hg _snp_set. the pre-trained population-aware deepvariant models are available at https://console.cloud.google.com/storage/ browser/brain-genomics-public/research/allele_frequency/pretrained_ model_wgs (wgs) and https://console.cloud.google.com/storage/browser/ brain-genomics-public/research/allele_frequency/pretrained_model_wes (wes). the vcf files used in this study are available at https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/ cohort_dv_glnexus_opt/v _missing ref (grch ) and https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/ cohort_dv_glnexus_opt/v _grch _missing ref (grch ). ethics approval and consent to participate not applicable. consent for publication not applicable. competing interests ak, sg, ty, pc and ac are employees of google llc and own alphabet stock as part of the standard compensation package. this study was funded by google llc. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/google/deepvariant https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/hg _snp_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/hg _snp_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wgs https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wgs https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wgs https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wes https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wes https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _grch _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _grch _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _grch _missing ref https://doi.org/ . / . . . funding all compute resources used in this work were provided by google, llc. ak, sg, ty, pc and ac are full-time, salaried employees of google, llc. nc con- tributed to this work as a salaried intern of google, llc. acknowledgments we thank babak alipanahi, gunjan baid, daniel cook, alexander d’amour, hojae lee, cory mclean, maria nattestad and other colleagues at google for their feedback on this manuscript and the project in general. the hg illumina data were generated at the new york genome center with funds provided by nhgri grant um hg - s . authors’ contributions nc, ak, pc and ac designed the method. nc, ak and pc implemented the software. nc and pc performed the experiment. nc, ak, sg, ty, pc and ac analyzed the re- sults. nc, pc and ac wrote the manuscript. all authors read and approved the final manuscript. references . depristo, m. a., banks, e., poplin, r., garimella, k. v., maguire, j. r., hartl, c., philippakis, a. a., del angel, g., rivas, m. a., hanna, m., et al. a framework for variation discovery and genotyping using next-generation dna sequencing data. nature genetics , ( ). . poplin, r., chang, p.-c., alexander, d., schwartz, s., colthurst, t., ku, a., new- burger, d., dijamco, j., nguyen, n., afshar, p. t., et al. a universal snp and small- indel variant caller using deep neural networks. nature biotechnology , – ( ). . krusche, p., trigg, l., boutros, p. c., mason, c. e., francisco, m., moore, b. l., gonzalez- porta, m., eberle, m. a., tezak, z., lababidi, s., et al. best practices for benchmark- ing germline small-variant calls in human genomes. nature biotechnology , – ( ). . karczewski, k. j., francioli, l. c., tiao, g., cummings, b. b., alföldi, j., wang, q., collins, r. l., laricchia, k. m., ganna, a., birnbaum, d. p., et al. the mutational constraint spectrum quantified from variation in , humans. nature , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . genomes project consortium et al. a global reference for human genetic varia- tion. nature , – ( ). . li, h. toward better understanding of artifacts in variant calling from high-coverage samples. bioinformatics , – ( ). . lin, m. f., rodeh, o., penn, j., bai, x., reid, j. g., krasheninina, o. & salerno, w. j. glnexus: joint variant calling for large cohort sequencing. biorxiv, ( ). . yun, t., li, h., chang, p.-c., lin, m. f., carroll, a. & mclean, c. y. accurate, scalable cohort variant calls using deepvariant and glnexus. biorxiv ( ). . poplin, r., ruano-rubio, v., depristo, m. a., fennell, t. j., carneiro, m. o., van der auwera, g. a., kling, d. e., gauthier, l. d., levy-moonshine, a., roazen, d., et al. scaling accurate genetic variant discovery to tens of thousands of samples. biorxiv, ( ). . chen, n.-c., solomon, b., mun, t., iyer, s. & langmead, b. reducing reference bias using multiple population reference genomes. biorxiv ( ). . rautiainen, m. & marschall, t. graphaligner: rapid and versatile sequence-to-graph alignment. genome biology , – ( ). . garrison, e., sirén, j., novak, a. m., hickey, g., eizenga, j. m., dawson, e. t., jones, w., garg, s., markello, c., lin, m. f., et al. variation graph toolkit improves read mapping by representing genetic variation in the reference. nature biotechnology , – ( ). . witherspoon, d. j., wooding, s., rogers, a. r., marchani, e. e., watkins, w. s., batzer, m. a. & jorde, l. b. genetic similarities within and between human populations. genetics , – ( ). . abramovs, n., brass, a. & tassabehji, m. hardy-weinberg equilibrium in the large scale genomic sequencing era. frontiers in genetics , ( ). . pedersen, b. s., brown, j. m., dashnow, h., wallace, a. d., velinder, m., tvrdik, t., mao, r., best, h. d., bayrak-toydemir, p. & quinlan, a. r. effective variant filter- ing and expected candidate variant yield in studies of rare human disease. biorxiv ( ). . sirugo, g., williams, s. m. & tishkoff, s. a. the missing diversity in human genetic studies. cell , – ( ). . martin, a. r., kanai, m., kamatani, y., okada, y., neale, b. m. & daly, m. j. clinical use of current polygenic risk scores may exacerbate health disparities. nature genetics , – ( ). . mcguire, a. l., gabriel, s., tishkoff, s. a., wonkam, a., chakravarti, a., furlong, e. e., treutlein, b., meissner, a., chang, h. y., lópez-bigas, n., et al. the road ahead in genetics and genomics. nature reviews genetics , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . wenger, a. m., peluso, p., rowell, w. j., chang, p.-c., hall, r. j., concepcion, g. t., ebler, j., fungtammasan, a., kolesnikov, a., olson, n. d., et al. accurate circular con- sensus long-read sequencing improves variant detection and assembly of a human genome. nature biotechnology , – ( ). . carroll, a. & chang, p.-c. improving the accuracy of genomic analysis with deepvariant . https://ai.googleblog.com/ / /improving-accuracy-of- genomic-analysis.html. . (accessed: - - ). . olson, n. d., wagner, j., mcdaniel, j., stephens, s. h., westreich, s. t., prasanna, a. g., johanson, e., boja, e., maier, e. j., serang, o., et al. precisionfda truth chal- lenge v : calling variants from short-and long-reads in difficult-to-map regions. biorxiv ( ). . wagner, j., olson, n. d., harris, l., khan, z., farek, j., mahmoud, m., stankovic, a., kovacevic, v., wenger, a. m., rowell, w. j., et al. benchmarking challenging small variants with linked and long reads. biorxiv ( ). . baid, g., nattestad, m., kolesnikov, a., goel, s., yang, h., chang, p.-c. & carroll, a. an extensive sequence dataset of gold-standard samples for benchmarking and development. biorxiv. eprint: https://www.biorxiv.org/content/early/ / / / . . . .full.pdf. https://www.biorxiv. org/content/early/ / / / . . . ( ). . porubsky, d., ebert, p., audano, p. a., vollger, m. r., harvey, w. t., marijon, p., ebler, j., munson, k. m., sorensen, m., sulovari, a., et al. fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. nature biotechnology. issn: - . https://doi.org/ . /s - - - (dec. ). . zook, j. m., catoe, d., mcdaniel, j., vang, l., spies, n., sidow, a., weng, z., liu, y., mason, c. e., alexander, n., et al. extensive sequencing of seven human genomes to characterize benchmark reference materials. scientific data , – ( ). . sun, c. & medvedev, p. varmatch: robust matching of small variant datasets using flexible scoring schemes. bioinformatics , – ( ). . li, h. a statistical framework for snp calling, mutation discovery, association map- ping and population genetical parameter estimation from sequencing data. bioinfor- matics , – ( ). . yi, r., chang, p.-c., baid, g. & carroll, a. learning from data-rich problems: a case study on genetic variant calling. arxiv preprint arxiv: . ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://ai.googleblog.com/ / /improving-accuracy-of-genomic-analysis.html https://ai.googleblog.com/ / /improving-accuracy-of-genomic-analysis.html https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://www.biorxiv.org/content/early/ / / / . . . https://www.biorxiv.org/content/early/ / / / . . . https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . / . . . background results population information improves deepvariant performance model-specific errors for population-aware models performance on zero-frequency variants assessing biases using different genomes populations silver-standard truth set for hg discussion methods training the model datasets allele matching algorithm allele frequency channel for deepvariant model-specific error analysis genomes frequencies from the deepvariant-glnexus pipeline availability of data and materials ethics approval and consent to participate consent for publication competing interests funding acknowledgments authors' contributions liquidcna: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations liquidcna: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations eszter lakatos ⇤, helen hockings , , maximilian mossner , weini huang , michelle lockley , , trevor a. graham ⇤ centre for genomics and computational biology, barts cancer institute, queen mary university of london, london, uk centre for cancer cell and molecular biology, barts cancer institute, queen mary university of london, london, uk barts health nhs trust, st bartholomew’s hospital, west smithfield, london, uk school of mathematical sciences, queen mary university of london, london, uk department of gynaecological oncology, cancer services, university college london hospital, london, uk ⇤ correspondence: e.lakatos@qmul.ac.uk; t.graham@qmul.ac.uk abstract cell-free dna (cfdna) measured via liquid biopsies provides a way for minimally-invasive monitoring of tumour evolutionary dynamics during therapy. here we present liquidcna, a method to track subclonal evolution from longitudinally collected cfdna samples based on somatic copy number alterations (scnas). liquidcna utilises scna profiles derived through cost-e↵ective low-pass whole genome sequencing to automatically and simulta- neously genotype and quantify the size of the dominant subclone without requiring prior knowledge of the genetic identity of the emerging clone. we demonstrate the accuracy of liquidcna in synthetically generated sample sets and in vitro and in silico mixtures of cancer cell lines. application in vivo in patients with metastatic lung cancer reveals the progressive emergence of a novel tumour sub-population. liquidcna is straightfor- ward to use, computationally inexpensive and enables continuous monitoring of subclonal evolution to understand and control therapy-induced resistance. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction liquid biopsies, primarily the analysis of cell free dna (cfdna) present in blood samples, o↵er the potential for regular longitudinal and minimally invasive monitoring of cancer dynamics [ , , , , , , ]. circulating cfdna is released into the blood via apoptosis or necrosis of cells. tumour-derived cfdna in the blood is detectable from tumours as small as million cells [ ], it shows correlation with disease stage [ , ], and o↵ers the same diagnostic potential as tissue-based biopsies [ ]. cfdna is an aggregate of dna shed from multiple locations and multiple malignant cells across the body and hence a single sample can provide a comprehensive overview of systemic disease. consequently, cfdna is an exceptional resource for non-invasive tracking of tumour composition and for monitoring response to therapy or clinical relapse. typically, cfdna analysis has focused on the detection of driver gene single nucleotide variants (snvs), with the size of mutation-bearing clones inferred from the relative se- quencing read count at the mutation site. for instance, in high-grade serous ovarian cancer (hgsoc) the frequency of tp mutation in cfdna is a measure of tumour burden and is predictive of treatment response [ ]. in colorectal cancer, kras mutation frequency in cfdna is predictive of response to anti-egfr therapy [ ]. somatic copy number alterations (scnas) are widespread in cancers [ , , ], and have been used extensively to track tumour composition and dynamics over time [ , , , ]. scnas can be detected in cfdna without prior knowledge of the tumour scna profile, through measurement of the relative number of reads mapping within ‘bins’ spaced across the genome [ ]. relative di↵erences in read count between bins can be sensitively detected even when the total read count is low [ , , ], meaning that scnas can be detected with a fraction of the sequencing depth required for snv detection. therefore scna profiling o↵ers a high-throughput and cost-e↵ective means to evaluate cfdna samples [ , , , , , ]. whilst measuring clone sizes based on the frequency of snvs is straightforward, de- riving quantitative information on the proportion of tumour population that carries a particular scna is challenging. tumour cells are not the only contributors to the cfdna pool, and an scna can in theory change the copy number to any non-negative integer value. thus total read count per bin is a noisy compound function of the relative tumour cell contribution to the total cfdna pool, and the specific copy number of the alteration. here we present a new method to identify and track tumour subclonal evolution based solely on measurement of scnas from longitudinal cfdna samples. our algorithm, named liquidcna, firstly determines the contribution of tumour dna to the total cfdna pool (i.e. cellularity/purity) and then uses scna data to characterise and quantify the size of the most pervasive (putatively resistant) subclone emerging or contracting over time. the e�cacy of the method is demonstrated using synthetic datasets, in vitro cell line mixtures, and in vivo via longitudinal analysis of cfdna from lung cancer patients undergoing targeted treatment. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results emergent subclone tracking from copy number information first, we derive a mathematical definition of the problem of tracking an emergent (pu- tatively resistant) tumour subclone from longitudinal cfdna samples, typically taken throughout the course of treatment. we consider a tumour cell population undergoing continuous evolution characterised by two cell types, ancestral tumour cells (a) and an emerging subclone (s). we assume that liquid biopsies contain dna originating from an- cestral and subclonal tumour cells, as well as contaminating dna from normal cells (n). the proportion of dna arising from cells of the emergent subclone within the tumour is expressed by the subclonal-ratio, ri, while the overall proportion of tumour-originating dna is termed the purity or tumour fraction of the sample, denoted by pi. we consider that the copy number (cn) profile of each sample has been measured – for example using low-pass whole genome sequencing (lpwgs) – and so the genome can be divided into segments, contiguous regions of constant cn. each measured segment cn in sample i (c j i ) is the combination of each cell population’s cn at the jth genomic location ( for normal cells and c(a) and c(s) for ancestral and subclonal tumour cells, respectively), weighted by the proportions of the three populations (fig. ). c j i = + pi � ( � ri)c(a)j + ric(s)j � � . ( ) we assume that each segment can fall into one of three categories depending on its cn in ancestral and subclonal tumour cells. clonal alterations (and unaltered segments) are at the same cn in both tumour populations, and their measured cn is only a↵ected by the purity of a sample. subclonal segments represent scnas that are unique to the emerging subclone. their measured cn is influenced by the subclonal-ratio of a sample, as well as sample purity. finally, segments that do not follow either of these patterns – due to uncertain measurements or ongoing instability – are termed unstable. our aim is to estimate the underlying purity and subclonal-ratio, pi and ri, from longitudinal cn measurements of clonal and subclonal segments (fig. ). estimation of subclonal-ratio estimation is carried out in three steps (fig. a and methods). first, the purity of each sample is assessed using the distribution of segment cn values. we assume that the majority of segments have integer cn in all tumour cells, hence the distribution is expected to have distinct peaks at regular intervals of pi, corresponding to clonal segments with cn of , , , etc. (fig. b). we derive the purity estimate as the value that minimises the squared error between observed and expected peaks (fig. c). the inferred purity values are used to correct the segment cn values, thus estimating the tumour-specific cn of each segment. liquidcna does not require a mainly diploid tumour genome (i.e. major peak at cn= ) to derive correct estimates, but will derive erroneous conclusions if the cn values .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / – as measured by the cn quantification software, e.g. qdnaseq [ ] – are incorrectly centred (e.g. major peak is defined as copy number , but the true value is copy number ). to control for this an initial manual check of the cn profile is recommended prior to applying liquidcna and renormalisation to the correct ploidy if required. next, for every segment we compute the change in cn, �cn, between each sample and a baseline sample that is assumed to have negligible proportions of the emerging (putatively resistant) subclone – for example a sample taken upon diagnosis or before start of therapy. �cn values naturally highlight subclone-associated segments altered in non-baseline samples, as these segments display markedly positive (cn gain compared to baseline) or negative (cn loss) values (fig. d). from these �cns we then establish the set of segments that are subclonal and the sample ordering that reflects increasing subclonal proportions. to do this, we examine each possible order of samples, classifying each segment as clonal (if the variance of its �cns across samples is below a pre-defined threshold), subclonal (if it shows monotone change in �cn value along the order of the samples - i.e. if the �cns are consistent with an emerging subclone) or unstable (if it does not correlate with sample order) according to that order (fig. e). the order with the highest proportion of segments classified as subclonal is selected, and these subclonal segments are used for downstream computation of tumour composition (fig. f). the methodology ensures that the dominant subclone associated with the most pervasive sc- nas is evaluated and that subclonal-ratio inference is robust to segments with unstable cn. finally, we compute the relative and absolute subclonal-ratio of each sample using the identified set of subclonal segments. relative subclonal-ratios are defined as the median ratio of segment �cns compared to the sample with the maximum subclonal proportion (fig. g). the absolute subclonal-ratio is computed based on the assumption that sub- clonal segment cn values correspond to distinct scnas that di↵er between ancestral and subclonal cells. the subclonal-ratio of sample i is therefore derived as the shared mean (ri) of a mixtures of gaussian distributions with constrained means �ri, +ri, etc., fitting the �cn distribution of subclonal segments (fig h). we also provide the % confidence interval of the absolute subclonal-ratio estimate based on the shared variance of the fitted gaussians (fig i). liquidcna outputs both relative and absolute subclonal-ratio measures, since for most applications the relative value holds su�cient information on how the subclonal (putative resistant) population changes between time-points. relative proportions are also less susceptible to the measurement noise in the measured segment cns, while a combination of low subclonal proportion and high sequencing noise can cause the fitting of absolute subclonal-ratio estimates to fail to converge. synthetic mixed populations we first evaluated the performance of liquidcna using synthetic datasets where input values of subclonal proportion and purity were known. we generated synthetic datasets .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / characteristics matching typical longitudinal measurements of patients. in order to simu- late imperfect measurements, we added varying levels of normally distributed measurement noise (defined by the dimensionless parameter �) to bin-wise cn values (fig. a-c and methods). we evaluated the accuracy of the purity estimation on synthetic samples (fig. d), and found that purity p could be estimated within % of the true tumour fraction in % of samples at noise levels �  . the error on the purity estimation was greater when the noise was increased (fig. e), and was most pronounced in samples with high noise and low tumour fraction. consequently, we restricted our subsequent analysis to only cases of higher purity (pi � . ). next, we derived subclonal-ratios using purity-corrected cn profiles on the higher purity subset of synthetic mixtures. we set a threshold to filter out clonal segments (see fig. e) such that at least segments were retained and the proportion of retained segments classified as subclonal was maximal following segment classification. fig. f shows the true and estimated subclonal-ratios for synthetic experiments. overall, we found that subclonal-ratio was estimated with ⇠ % error, and the accuracy was influenced by measurement noise (fig. g). relative subclonal-ratios (calculated compared to the sample with highest subclonal proportion) were estimated with higher accuracy (error ⇠ %, fig. s a-b). we found that computing absolute subclonal-ratios in a two-step process from these values yielded similar results to direct estimation by fitting a gaussians mixture model, and provided an estimate even in cases where the direct estimation did not converge (fig. s c and methods). the proportion of unstable segments, unlike noise, had little e↵ect on the estimation accuracy (fig. s ). mixtures of ovarian cancer cell lines next, we evaluated liquidcna on real data derived from in vitro mixtures of two paired high grade serous ovarian cancer (hgsoc) cell lines [ ] (see method and table s ). hgsoc cells were ideally suited for this evaluation as high levels of chromosomal insta- bility are a hallmark of the disease [ , ]. we anticipated that liquidcna will be most applicable for the tracking of subclonal evolution in malignancies with high cna burden [ ]. we divided a population of ovcar cells into two aliquots, and the first aliquot was untreated and classified as ‘sensitive’. in a process described in detail by hoare et al. [ ], cells from the second aliquot were cultured so that they evolved resistance to platinum- containing chemotherapy and thus were termed ‘resistant’. in addition to the high scna burden inherited from the ancestral sensitive cell line, resistant cells acquired new scnas during the in vitro evolution of resistance (figure a). we then mixed, in varying known proportions, the genomic dna extracted from the two cell lines, with sensitive cells representing the ancestral and resistant cells the emerging subclonal population. the mixtures were further diluted with dna from blood samples of healthy volunteers assumed to have a diploid genome; this modelled the e↵ect of normal .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / contamination in patient samples (table s ). these dna mixtures were sequenced to mean depth . x and composite scna profiles were generated (see methods). in addition, we generated further in silico mixtures by sampling and mixing genome-aligned reads from sequencing data from each of the three cell types sequenced individually. in these mixtures, we controlled the total number of reads per sample to study the e↵ect of variable read depth and associated measurement noise. first, we used liquidcna to estimate the purity of in vitro mixed samples (samples s -s ). the purity of each sample was estimated to be lower than the theoretical mixing proportion (fig. b). in the in silico mixed samples, we found that there was a strong linear relationship between estimated and true purity (fig. c). the underestimation of purity in the samples might be explained by our definition of theoretical purity in the in vitro and in silico mixing procedure (respectively defined as proportion of dna weight versus the proportion of read counts). a highly aneuploid genome will likely have a higher weight than a diploid genome, therefore mixing of equal weights results in a higher pro- portion of normal genomes than expected. our purity estimates were in agreement with observed peaks of the cn distribution (fig. s a), further confirming that there was no bias in the estimation. by fitting a linear model to the estimates, the theoretical tumour fraction could be fully recovered, as illustrated by the ‘corrected’ estimates of samples s -s (fig. b). the number of reads (sequencing depth) did not systematically influ- ence the accuracy of estimating tumour fraction, but purity estimates of samples with low tumour fraction were noisier at low read depth (fig. c). in summary, liquidcna pro- vided an accurate estimate for purity values when true purity was above %. decreased measurement accuracy below % purity is consistent with our observations on synthetic data and is similar to reported limitations of other methods quantifying tumour fraction from lpwgs cfdna [ , , ]. therefore, for samples below % predicted purity, we advise to discard the sample from downstream analysis, although low-purity samples may be usable if a very accurate purity estimate can be derived by other means. next, we inferred the subclonal-ratio for cell line mixtures using purity-corrected �cn values, with sample s used as the baseline sample for both in vitro and in silico sample sets. we could correctly order cell line mixtures according to subclonal-ratios without any a priori information (fig. s b), and both absolute subclonal-ratio and relative subclonal changes were estimated on average within % and % of the true subclonal percentage (fig. d,f). in particular, we noted that samples s and s were accurately estimated as having an equal subclonal-ratio, despite originating from di↵erent biological replicates with di↵erent tumour purity, which was reflected in the small confidence intervals of their estimates. we also note that even though there were no truly unstable segments in this dataset as measurements were not taken over time, three non-clonal segments were clas- sified as such, probably due to higher noise in their measured cn value. using datasets of randomly selected in silico samples with million reads, we con- firmed that our algorithm could accurately infer the subclonal-ratio of samples, in partic- ular when considering relative proportions (fig. e,g). although the estimation quality .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / decreased with lower read counts (fig. s ), in most cases the estimated absolute and relative subclonal-ratio was within % and % of the true subclonal proportion, re- spectively. furthermore, we found that cases with high estimation error were typically caused by low-purity samples, which could be easily identified and removed without a priori information, as demonstrated in fig. s . using the known theoretical mixing values of tumour-dna content – instead of data- derived estimates – to derive purity-corrected cn values increased the estimation error, especially in low read count samples (fig. s ). this finding emphasises that non-diploid genomes might bias alternative measurement methods and internal consistency in the method of deriving sample characteristics (purity and subclonal-ratio) is crucial when assessing the dynamics of the subclonal population. subclonal analysis of patient samples we used liquidcna to analyse emergent subclones in longitudinal cfdna samples from pa- tients with non-small cell lung cancer (nsclc) undergoing therapy, as previously reported by chen and colleagues [ ]. the liquid biopsies were collected as part of the figaro study (go , nct ), a randomised phase ii trial designed to evaluate the e�cacy of pictilisib, a selective inhibitor of phosphatidylinositol kinase [ ]. pictilisib or placebo was given in combination with standard chemotherapy regimen which was de- termined based on the subtype of nsclc. blood samples were taken at baseline (day of the first treatment cycle) and at -week intervals up to the end of treatment (eot). dna was isolated from the plasma of liquid biopsies and sequenced using lpwgs to an average depth of . x, as described in details in [ ]. chen et al. [ ] identified several scnas in eot samples that were absent at baseline and described several genes within these regions that might be associated with resistance. we sought to apply liquidcna to these cases to corroborate their observations, and further to quantify the size of emergent subclones over time in these patients. we obtained the lpwgs data (fastq files) and performed cn profiling (see methods) on patients with cfdna samples from � time-points (n = ). we identified three patients ( , and ) whose sample series fulfilled the following criteria: (i) had a cfdna sample taken on the first day of therapy with purity above ⇠ %; (ii) and had at least two non-baseline samples with purity above ⇠ %. patients and were in the experimental arm of the study, while patient was assigned to the control arm; and all three patients have progressed during the course of the trial. we ran liquidcna on data from the three selected patients (discarding samples with purity below % (fig. s )) and examined the genomic segments that liquidcna identified as subclonal relative to baseline samples (fig. ). while we observed a good overlap with the cns previously reported to be associated with subclonal evolution through therapy (figures and s of [ ]), we also found a few segments that were missed or additionally identified by liquidcna. the original study focused on the comparison of pre- and post-treatment and highlighted scnas occurring .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / between the first and last time-points. as our analysis put equal focus on all time-points, it classified some of the previously identified segments as unstable if the cn progression was not consistently with subclone evolution. furthermore, some segments were too small to pass our initial filtering. on the other hand, liquidcna was able to identify subclonal segments which were at an abnormal cn in the baseline sample, and subsequently showed diploid cn or a further gain/loss in subclonal tumour cells. for example, in the samples from patient , whilst liquidcna identified subclonal scnas on chromosomes , and that overlapped with the findings of the original study; it also detected additional subclonal changes on chromosomes and . however, we did not observe the previously described focal loss on chromosome (harbouring the gene mll ), probably due to its small size. overall, we identified , and subclone-associated scnas in patients , and , respectively. a further segments in patient were classified as non-clonal but ’unstable’ as the cn over time was not consistent with the pattern defined by the emerging subclone. as samples from patient had lower purity, these inconsistent cn changes might have resulted from measurement noise. we found that the emerging subclone accounted for to % of the tumour derived dna in the cfdna in the three patients evaluated. patient showed evidence of a subclonal proportion consistently around %, which could be explained by samples from this patient taken at later time-points. samples from patient obtained at weeks and end of therapy contained below % dna derived from subclonal tumour cells (fig. ). patient , on the other hand, showed a contracting subclone that reduced in proportion from % presence at week to < % at the end of therapy. in case the total population size was known – which might be accessible from additional measurements of the tumour-associated cfdna pool –, the tumour subclone fractions established here could also be converted into growth rates to enable future predictions of the tumour dynamics. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion we present liquidcna, a computational algorithm to infer longitudinal subclonal dynam- ics using copy number measurements. our algorithm performs simultaneous analysis of several longitudinal samples to identify sample purity, subclonal scnas and the abun- dance of an emerging subclone. liquidcna distinguishes between scnas that are associ- ated with the emerging subclone and those showing unstable behaviour, and consequently is not confounded by uncertain cn measurements. we validate our method both on synthetic scna datasets, and in vitro and in silico mixtures of two ovarian cancer cell lines. we successfully infer the proportion of the dominant subclone in all of the above datasets, with good accuracy across a range of sample qualities defined by the noise level or sequenced reads. in patients with lung cancer, liquidcna applied to lpwgs data derived from longitudinal liquid biopsies (cfdna) shows the emergence of subclones during therapy and identifies genomic regions associated with the emergent tumour cells. we demonstrate that liquidcna can identify and quantify emerging subclones from cfdna samples, therefore enabling tracking of tumour subclone evolution through the course of therapy. deciphering the evolutionary trajectory of cancer can aid prognostic and therapeutic decision-making and further our understanding of therapy-induced drug resistance [ ]. measuring the dynamics of tumour composition is particularly crucial for prospective monitoring during an adaptive therapy regime aiming to control resistant subclones [ , , ]. furthermore, the proportion of cfdna that is tumour-derived (what we term ’purity’) in itself is a promising biomarker for determining initial therapy response and prognosis [ , ], as well as for tracking tumour progression during and after therapy [ , , , ]. we note that there are limitations in our liquidcna method. since our inference relies on heterogeneous copy number profiles and subclone-specific scnas, we cannot analyse cancer (sub)types with very low chromosomal instability, for example microsatellite un- stable tumours. conversely, extremely high levels of ongoing instability might bias our analysis due to the lack of stable subclone-associated scna profile, and therefore liq- uidcna is not suitable for oligo-metastatic disease if spatially separate metastases carry distinct karyotypes. furthermore, the accuracy of our estimation reduces at low purity (below %). however, a tumour fractions above this regime were observed in a sub- stantial number of patients, especially in late stage disease where liquidcna can o↵er the largest benefit, [ , , , , , ]. in addition, recent studies have shown that the unique fragment length of tumour-derived cfdna can be utilised to enrich for tumour purity either experimentally or bioinformatically [ , , ]. finally, liquidcna tracks a single dominant subclone associated with the largest set of subclone-specific scnas, and if there are multiple smaller subclones (with less or no associated scnas), these will be ignored by the algorithm. in summary, we provide a robust tool to derive quantitative information about dy- .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / namic changes in clonal composition from scna measurements derived from cfdna. liquidcna enables real-time non-invasive tracking of subclonal tumour evolution, which can provide new insights into the evolution of scnas and the dynamical emergence of therapy-associated resistance. acknowledgements we thank ann-marie baker for reviewing the clarity of the text, and steve gendreau and craig cummings from genentech, inc. for providing access to patient cfdna sequencing results and for their critical comments on the presentation of the data. this work was supported by the wellcome trust (grant /z/ /z to t.a.g.) and cancer research uk (grant a to t.a.g. supporting e.l.; advanced clinician scien- tist fellowship c /a to m.l.; clinical research training fellowship to h.h.). m.l. also received support from a barts and the london charity strategic research grant ( / ). t.a.g. also received founding from the national institutes of health, national cancer institute (grant u ca ). author contributions e.l., w.h., m.l. and t.a.g. conceived and designed the study. m.l. and t.a.g. acquired funding for the study. e.l. developed the inference method and performed bioinformatic analysis. h.h. and m.m. performed in vivo experiments and sequencing. e.l. and t.a.g. wrote the original draft, and all authors reviewed and approved the manuscript. competing interests the authors declare no competing interest. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genomic segment clonal subclonal unstable purity (pi) =subclonal-ratio (ri) = normal contamination tumour cell mixture sample sample sample sample sample sample genomic segment m ea su re d co py n um be r genomic segment tu m ou r c op y nu m be r c op y nu m be r subclonal/resistant tumour cells ancestral/sensitive tumour cells figure : schematic of copy number measurements. the first panel shows the scna profile of ancestral (in yellow) and subclonal (in red) tumour cells. at di↵erent sampling time-points, the overall tumour scna profile is a mixture of these profiles (second panel), influenced by the composition of tumour-derived dna depicted on the pie-charts. clonal, subclonal and unstable segments are indicated in yellow, red and blue, respectively. note that the cn of clonal segments remains the same. in the liquid biopsies taken at each time-point, contamination from normal cells leads to ’flattened’ measured scna profiles (last panel) due to normal cells having a neutral karyotype. this contamination a↵ects the cn of each segment. our aim is to estimate purity (pi) and subclonal-ratio (ri) based on clonal and subclonal scnas. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . . . Δcn in sample n um be r o f s eg m en ts . . . . sample sample sample sample s ub cl on al -r at io . . . . segment cn d en si ty . . . . . . . . purity estimate e rr or o f f it b c d ordered samples clonal/normal unstable subclonale x x x x x x x x x x x x x x x . . . . s eg m en t c n segment cn distribution purity p , p , … purity-corrected segment cns baseline sample segment classification maximal relative subclonal-ratio r ,n, r ,n, … . . . . sample sample sample s ub cl on al -r at io c om pa re d to s am pl e sample sample sample sample sample c n a sample order sample sample sample sample sample s eg m en t c n score: ( %) order order order order optimal: order (score = ) order f g subclone sample subclonal-ratio r , r , … optimal subclonal-ratio: . h i subclonal segments Δcn compared to figure : illustration of the estimation algorithm. (a) outline of the steps of the estima- tion algorithm. (b) purity estimation based on the peaks of the distribution of segment cns. green lines show the peaks expected at an example purity of . . (c) the error of a range of purity estimates, computed from the distance of observed and estimated peaks in (b). each line corresponds to a smoothing kernel applied to the raw segment cn distribution. the optimal purity is indicated with arrow. (d) change in segment cn values (�cns) plotted according to an example sample order. the number of subclonal segments computed in (e) is indicated below. (e) classification of segments based on the sample order in (d). segments with low variance are classified as clonal (in grey). non- clonal segments are evaluated whether they follow a quasi-monotone pattern (indicated by the shaded regions) and classified as unstable (outside of shaded region, in blue) or subclonal (in red). (f) �cn values plotted according to the optimal sample order max- imising subclonal segments. line colours indicate the class of each segment as in (e). (g) relative subclonal-ratio estimation compared to maximal subclonal-ratio sample (right- most in (f)). points show individual segment-wise estimates, with an example segment highlighted in black. black line shows the median. (h-i) subclonal-ratios and confidence intervals inferred by fitting a gaussian mixture model to the �cn distribution of sub- clonal segments. the components of the best fit with means �r and r are shown in green and magenta in (h). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . . . noise level (sigma) e rr or in s ub cl on al -r at io e st im at io n . . . . . . noise level (sigma) e rr or in p ur ity e st im at io n sigma= sigma= sigma= . sigma= . . . . . . . . . . . . . . . . true subclonal-ratio e st im at ed s ub cl on al -r at io d e f g x r: - % p: - %- ,- , ,+ ,+ , , , , ancestral cn s ub cl on al c n number of segments c p = . r = . x a b . - sigma= sigma= sigma= . sigma= . . . . . . . . . . . . . . . . . . true purity e st im at ed p ur ity figure : estimation of mixtures of synthetic cell populations. (a) parameters used to randomly sample synthetic datasets including simulated measurement noise. the font- size of copy number states indicates their probability. (b) a randomly generated sample. the heatmap depicts the distribution of segment cns in ancestral and subclonal cells, and the proportion of cell populations is shown on the pie-chart (red: subclonal, yellow: ancestral, grey: normal). (c) copy number profile of the sample in (b), with raw bin-wise and segmented copy number values shown in black and red, respectively. (d) estimated purity of , synthetic samples with varying levels of noise (�), plotted against the true theoretical purity. the y = x line is indicated with dashes. (e) error of purity estimation (absolute di↵erence to true purity) for samples with noise level indicated on the x axis. (f) true and estimated subclonal-ratio of synthetic datasets ( , samples) with varying levels of noise (�). (g) error in subclonal-ratio estimation for datasets with increasing noise level. box-plot elements in (e)(g) stand for: center line, median; box limits, upper and lower quartiles; whiskers, . x interquartile range; points, outliers. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . s s s s s sample s ub cl on al ra tio . . . . . . . . true purity e st im at ed p ur ity . . . . . . true subclonal-ratio e st im at ed s ub cl on al -r at io ba f g ancestral/sensitive cell line (b ) subclonal/resistant cell line (b ) c d e figure : estimation of mixtures of high grade serous ovarian cancer cell lines. (a) copy number profile of the ancestral/sensitive and subclonal/resistant hgsoc cell lines. raw bin-wise and segmented copy number values are shown in black and red, respectively. resistant-specific subclonal scnas are highlighted. (b) purity estimates of samples s - s . corrected values are computed using the linear fit in (c). theoretical purity values are indicated by maroon diamonds. (c) true (theoretical) and estimated tumour purity of in silico hgsoc cell line mixtures. y = x and the linear fit of the estimates (y = . x) are shown with dashed and solid lines, respectively. point shape and shade indicate total number of reads per sample. (d) subclonal-ratio estimates for samples s -s . shaded and empty bars indicate estimates derived using direct (gaussian fit) and two-step (from relative ratios in (f)) methods, respectively. error bars show % confidence interval of the direct estimate, maroon diamonds indicate theoretical values. (e) true and estimated subclonal-ratio of in silico datasets constructed of samples from (c) with million reads. (f) relative subclonal-ratio estimates for samples s -s , compared to s . estimates from each subclonal segment are shown with dots, the median estimates are indicated by black lines, and true values with maroon diamonds. (g) true and estimated relative subclonal-ratio in the datasets shown in (g). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / % subclonal cells % subclonal cells % subclonal cells baseline scna subclone-associated scna chromosome c op y nu m be r chromosome c op y nu m be r patient patient patient baseline week week end of therapy baseline week week end of therapy baseline week end of therapy a b c chromosome c op y nu m be r baseline scna subclone-associated scna baseline scna subclone-associated scna figure : estimation in cfdna samples from patient data. subclone-specific copy number changes and subclonal-ratio in lung cancer patients (a) , (b) , and (c) from [ ]. left: purity-corrected scna profiles. yellow bars show the cn of each segment in the baseline sample, and red bars indicate subclonal deviations from this value in non-baseline samples. regions of subclone-specific cnas are also indicated by darker shades. right: estimated resistant proportion of each sample with % confidence intervals. note that only samples with > % purity were analysed (c.f. s ). a bar of cn= on chromosome (indicated by asterisk) has been omitted from (c) for better visualisation. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods formal definition of the problem copy number measurements we consider a tumour that consists of two distinct cell populations, ancestral (a) and subclonal (s) tumour cells, and continuously sheds cell-free dna (cfdna) into the blood circulation. a typical scenario would be ancestral cells representing drug-sensitive tumour cells present before cancer therapy, and subclonal cells denoting the emerging subclone with resistance to therapy. the proportion of dna originating from these two cell types changes over time as we take measurements via blood samples (fig. ). since cell-free dna found in blood can also originate from normal (non-tumour) cells of the body, the measured dna is contributed by a mixture of the two tumour cell populations (a and s) and normal cells (n). at each time-point i the proportion of these three populations in the measured sample, si, depends on the proportion of all tumour-derived dna (the purity of the sample, pi) and the proportion of subclone-derived dna from the tumour (the subclonal-ratio, ri): ni = � pi; ai = pi · ( � ri); si = pi · ri. ( ) our aim is to track the dynamics of the subclonal (putatively resistant) population by determining the subclonal-ratio for each time-point, ri, or the change in subclonal-ratio between time-points, ri/rk = rik. to this end, we use the copy number values as typically measured by lpwgs of the sequential cfdna samples. let us consider distinct genomic regions with homogeneous copy number state, seg- ments. we assume that the copy number (cn) state of most segments stays constant over time in a particular population. therefore the jth segment is characterised by a set of three time-independent absolute cn states, c(n)j, c(a)j, c(s)j, corresponding to the local cn in normal, ancestral and subclonal cells, respectively. the copy number of segment j as measured in the ith sample, c j i , is the combination of these three absolute cns, weighted by the proportions of dna derived from the three cell populations at that time-point (ni, ai, si). we know that normal cells are at a diploid state, hence c(n) j = for all j. therefore, using the purity and subclonal-ratio defined in eq. ( ), c j i = + pi � ( � ri)c(a)j + ric(s)j � � . ( ) since all cells in a cell population share the absolute cn for a given segment, the values c(s)j and c(a)j are always integers. therefore in theory, measured cns from a given sample should be limited to a discrete set of values defined by these integer states, making it possible to solve the set of equations formed by eq. ( ) for pi and ri using linear algebra. however, we have to take into account that all real sequencing measurements have a level of imprecision introducing variation on top of this relationship. using the term �ij to represent the noise in the ith measurement of segment j, eq. ( ) becomes, c j i = + pi � ( � ri)c(a)j + ric(s)j � � + �ij. ( ) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / with the magnitude and family of this noise depending on the specifics of the technology used for cn measurement, especially the sequencing depth [ ]. this measurement noise – associated with a continuous distribution – broadens the set of c j i values, rendering a linear algebra solution impossible. hence, our aim becomes to derive an inference of pi and ri despite this unknown noise, �ij. segment classification each segment can fall into three categories depending on their respective copy number states in the two types of cells. (i) clonal segments have the same absolute cn in ancestral and subclonal tumour cells, c(a)j = c(s)j. a special case of clonal segments are segments of neutral cn, where c(a)j = c(s)j = . (ii) subclonal segments have di↵erent absolute cns in the ancestral and subclonal tumour population, c(a)j = c(s)j. these segments represent scnas that distinguish the subclone from its ancestor, even though they are not necessarily associated with a selective/phenotypic di↵erence (e.g. drug-resistance) directly. (iii) unstable segments are neither clonal nor associated with the emergent subclone, and therefore are best described by a time-dependent tumour-wide cn value, ⇣(t) j i , that does not depend on ri. these segments can arise if a genomic region cannot be measured reliably or if on-going genomic instability introduces novel scnas during the time tracked by our samples. we can assume that the number of such segments is small compared to the total number of measured segments. depending on whether segments are clonal, subclonal or unstable, their measured cn across samples will change according to the subclonal-ratio and purity of each sample. for simplicity, we omit the term �ij and its derivatives, but the reader should keep in mind that all equations are subject to measurement noise: c j i = + pi(c(a) j � ), if the segment is clonal, ( ) c j i = + pi � c(a)j � + ri(c(s)j � c(a)j) � , if the segment is subclonal, ( ) c j i = + pi(⇣(t) j i )), if the segment is unstable. ( ) figure illustrates how the measured cn of segments depend on the parameters ri and pi highlighted above. in the following sections, we use eqs. ( ) & ( ) to estimate the underlying parameters, pi and ri, via three steps (fig. ). estimation algorithm purity estimation purity estimation is carried out based on clonal (including neutral) segments. in general, we expect the majority of segments to fall into this category. consequently, for the ma- jority of segments their measured copy number follows eq. ( ). since c(a)j can take only integer values, the distribution of segment cns is expected to have distinct peaks at regular intervals of pi. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / using a peak-finder algorithm on the smoothed distribution of measured cn values, we directly compare the peaks to the values expected at a given purity, { � pi, , + pi, + pi, . . . }, as shown in fig. b. the error of the fit to a purity, pi, is evaluated as the summed squared distance between each peak and the closest observed peak, x c(a) min � ( + pi(c(a) � )) � peaks) � . ( ) as the detected peaks of the data depend on the smoothing kernel used on the distribution, we perform this computation for a wide range of smoothing bandwidths ( . ⇥ � . ⇥ the default value) and derive the purity estimate, p̂i, as the value that minimises the mean and/or median error across the range (fig. c). then, we use the derived p̂i to re-normalise the measured copy number values and thus eliminate normal contamination. we gain an estimate of the tumour-specific cn (c(t) j i ), a mixture of ancestral and subclonal cns: ĉ(t) j i = p̂i · (cji � ) + ⇡ c(a) j + ri(c(s) j � c(a)j). ( ) note that, due to the noise in measurements, peaks from close absolute cns can become indistinguishable in low-purity samples. therefore we expect purity values below % to be indistinguishable (unless high sequencing depth is available) and also advise to discard samples with low purity (typically pi < . ) as erroneous purity estimations can bias downstream computation. identifying subclonal segments and sample order next, we aim to identify the subset of segments with subclone-specific subclonal scnas that reflect the changes in subclonal-ratio over time. to easily assess the change in segment cns, we designate a sample as baseline, and compute the change in segment cn, �cn, between each sample and this baseline sample. typically, the sample taken upon diagnosis or before start of therapy (usually the first time-point, s ) can be used. we can assume that this sample has no or only negligible population of the emerging subclone, and therefore represents a pure ancestral population: r ⇡ �! c(t) j ⇡ c(a) j. hence the change in cn of a subclonal segment compared to the baseline becomes, �c(t) j i = c(t) j i � c(t) j = ri � c(s)j � c(a)j � . ( ) furthermore, eq. ( ) provides an informative quantity even if the baseline sample is not pure, as �c(t) j i nonetheless describes the change in subclone-specific scnas. in order to uncover which segments are truly subclonal, and how the subclonal-ratio changes over measurements, we need to identify a pervasive pattern across samples, and the subset of segments that consistently follows it. if the samples were taken so that .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the subclonal population increases over time-points, this pattern would be a monotone increase or decrease for all segments with subclone-specific scnas. while we cannot assume that the samples are taken in order of increasing subclonal proportions (e.g. a change of treatment between sampling times might lead to fluctuating population size in a resistance-associated subclone), we can aim to re-arrange them to follow this rule. consequently, we rephrase our aim as deriving (i) a set of subclonal segments that follow a monotone pattern across ordered samples; and (ii) an ordering of samples that is correlated with by the maximum number of (subclonal) segments. formally, we are looking for a subset of segments, {j , j , . . . } and a permutation of samples (starting from the designated baseline sample), s , si, . . . , sn, where for every segment j {j , j , . . . } either �c(t) j i+ � �c(t) j i > �✏, i or ( ) �c(t) j i+ � �c(t) j i < ✏, i holds for all i for a pre-defined accuracy level, ✏. we use an ✏ > accuracy level to allow for samples with near-equal subclonal-ratio measured with uncertainty. we find that, for typical lpwgs datasets, ✏ ⇡ . � . works well to account for the underlying measurement noise. figs. d-f illustrate the derivation of optimal sample order and subclonal segment set. we first separate clonal segments: since these have relative cn values of , apart from some measurement noise, we filter out any segment that has a standard deviation below a pre-defined threshold. we then evaluate eq. ( ) over all remaining segments and over all orderings of the samples. as we expect - time-points per dataset, an exhaustive search of all possible permutations is feasible. given a permutation, each segment is classified according to whether it follows eq. ( ) – these are candidate subclone-specific and unstable segments, respectively (fig. e). the optimal sample order is defined as the permutation that maximises the number of subclonal segments (fig. f). subclonal-ratio estimation finally, we use the set of segments identified as subclonal, and compute the subclonal- ratio of each time point. we derive the (absolute) subclonal-ratio, ri, for each sample using eq. ( ). as both c(a)j and c(s)j are assumed to be integers, and we know that c(a)j = c(s)j, �c(t) j i {. . . , � ri, �ri, ri, ri, . . . }, j {j , j , . . . }. ( ) to take into account that the measured �cns compared to the baseline, �ĉ(t) j i , are influenced by noise, we fit these values with a mixture of gaussian distributions where the mean of the gaussians follows eq. ( ), as illustrated in fig h. the subclonal-ratio of a sample is derived as the constrained mean parameter, ri, of the gaussian mixture .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / optimising the fit (fig. i). the % confidence interval of the inferred subclonal-ratio is computed based on the (shared) variance of the fitted constrained gaussians. the measurement noise propagated from segment cns can lead to high spread in values, making estimates less robust and rendering the resolution of low subclonal-ratios (ri  . ) challenging, occasionally leading to the gaussian-fitting step to fail. therefore we also derive relative subclonal-ratios, which allow for a more general application not limited to good quality samples. in particular, relative values are compared to the maximal sample since its subclonal-ratio is assumed to be the most robust against measurement noise. we compute the relative deviation of each normalised subclonal tumour segment cn, �c j in = �c(t) j i �c(t) j n = ri(c(s) j � c(a)j) rn(c(s)j � c(a)j) = ri rn , ( ) giving rise to a distribution of relative subclonal-ratio estimates (fig. g). we derive a point estimate for the relative ri of each sample as the median of this set, r̂in = median ⇣ �c j in ⌘ , j {j , j , . . . }. ( ) absolute subclonal-ratio estimates can then be derived using these relative estimates in a two-step estimation process (as opposed to the direct estimation above): we derive rn based on eq. ( ), and subsequently compute rin · rn to retrieve ri. generating synthetic datasets we constructed synthetic datasets of segments (of length varying between and bins) and time-points as illustrated in fig. a. for each segment, we generated sensitive segment copy number states (c(s)j) by randomly sampling from { , , , , }, with neutral and close-to-neutral states occurring with higher frequency. subclone-specific absolute cns (c(s)j) were assigned by randomly sampling from c(a)j +{� , � , , , }, with no change (giving rise to clonal segments) having a higher weight. for each sample, si, we assigned purity and subclonal-ratio randomly from the ranges . < pi < . and . < ri < . , with the exception of the baseline samples, where r < . . we then recreated the measurement procedure of computing noise-ridden raw cn values in a given segment, j, by adding a normally distributed noise. the magnitude (standard deviation) of the noise was controlled by the noise level parameter, � (representing di↵erences arising from e.g. sequencing depth) and the cn of the segment (reflecting higher variance in higher cn states): rawcbini = + pi � ( � ri)c(a)j + ric(s)j � � + normal( , f(�, c j i )). the final cn value of each segments, ĉ j i , was computed as the mean of all rawc bin i contained in the segment. in addition, we selected . - % of segments as unstable, and re- sampled their tumour-specific cn value to be independent of ri. fig. b-c show parameters of a synthetic sample and its copy number profile. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / generating in vitro and in silico cell line mixtures hgsoc cell line ovcar was obtained from prof fran balkwill (barts cancer institute, uk) and grown in dmem media containing % fbs and % penicillin/streptomycin. a resistant/subclonal hgsoc cell line (ov cis) was generated by culturing an aliquot of the ancestral ovcar cell line in increasing concentrations of cisplatin. for further details on cell culture and the celll lines, see [ ]. we then extracted genomic dna from both cell lines and from blood samples from healthy volunteers using qiaamp dna micro kit (qiagen, hilden, germany). genomic dna from the three sources was mixed in varying proportions (table s ), measured as the mass of dna inputted from each source, to a total of ng dna per sample and subjected to sonication with the covaris m system. libraries were prepared using the nebnext ultra ii kit (new england biolabs, hitchin, united kingdom) with cycles of pcr amplification, indexed with unique dual indexing primers and sequenced on illumina novaseq to a mean depth of . x. in silico mixtures were generated by bioinformatically mixing sequencing reads of dna derived from the ancestral/sensitive, subclonal/resistant tumour cell lines and healthy blood cells. similarly to synthetic samples, for each in silico sample we randomly assigned purity, . < pi < . , and subclonal-ratio, . < ri < . . we then sampled reads (using samtools view -s) from aligned read (bam) files of ‘pure’ ancestral, subclonal and normal samples (b , b and n ) in proportions to match pi( � ri), piri and � pi, respectively. we also varied the total number of reads per sample (as a proxy for sequencing depth and consequently measurement noise), and generated - samples with , , , and million total reads each. processing lpwgs samples fastq files derived from lpwgs samples (generated via sequencing cell line mixtures or obtained from [ ]) were aligned to the human reference genome (version hg , using bwa). we then processed bam files using the qdnaseq r package [ ] employing dnacopy for segmentation [ ]. qdnaseq produced two copy number values for each genomic bin: a raw pre-segmentation and a segmented value grouping bins of equal cn together. the cn of bins on the pre-defined blacklist of qdnaseq and of those with < % mappability was set to na. raw and segmented cn values for all cell line samples are available from https://github.com/elakatos/liquidcna_data. since qdnaseq returns normalised cn values (with neutral state at ), we multiplied all values by before proceeding with the estimation algorithm and re-normalised segment cn values to be centred at exactly. we then re-defined segment boundaries using the ensemble of samples as regions of constant cn in all samples. this way break-points present in only a sub-set of samples (such as a subclone-specific scna) gave rise to segments handled separately for all samples. updated segments with length below mega- bases ( bins of kb (cell line mixtures) or bins of kb (patient cfdna samples)) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/elakatos/liquidcna_data https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / were excluded from the downstream analysis to filter out short segments sensitive to localised measurement biases. finally, we curated each segment cn by discarding bins with the most extreme . % of raw segment values, and re-calculating the segment cn value as the mean of normal distribution fitted to the remaining raw cns. we found that this curation had negligible e↵ect for most segments, but successfully improved assigned segment cn values for more error-prone genomic regions. data availability aligned sequencing data from hgsoc cell lines and in vitro mixtures (listed in table s ) are available from the european nucleotide archive (accession prjeb ). raw and post-segmentation copy number values for these samples are available from https: //github.com/elakatos/liquidcna_data. code availability estimation functions of liquidcna implemented in r (version . . ), an illustrative ex- ample in a jupyter notebook and code generating and analysing synthetic and in silico data are available from https://github.com/elakatos/liquidcna. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/elakatos/liquidcna_data https://github.com/elakatos/liquidcna_data https://github.com/elakatos/liquidcna https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] siravegna, g., marsoni, s., siena, s. & bardelli, a. integrating liquid biopsies into the management of cancer. nature reviews clinical oncology , – ( ). url https://doi.org/ . /nrclinonc. . . [ ] ng, s. b. et al. individualised multiplexed circulating tumour dna assays for monitor- ing of tumour presence in patients after colorectal cancer surgery. scientific reports , – ( ). url https://pubmed.ncbi.nlm.nih.gov/ . [ ] rothwell, d. g. et al. utility of ctdna to support patient selection for early phase clinical trials: the target study. nat med , – ( ). [ ] khan, k. h. et al. longitudinal liquid biopsy and mathematical modeling of clonal evolution forecast time to treatment failure in the prospect-c phase ii colorectal cancer clinical trial. cancer discov , – ( ). [ ] fernandez-garcia, d. et al. plasma cell-free dna (cfdna) as a predictive and prognostic marker in patients with metastatic breast cancer. breast cancer research , ( ). url https://doi.org/ . /s - - - . [ ] conteduca, v. et al. plasma tumour dna as an early indicator of treatment response in metastatic castration-resistant prostate cancer. british journal of cancer ( ). url https://doi.org/ . /s - - - . [ ] nakamura, y. et al. clinical utility of circulating tumor dna sequencing in advanced gastrointestinal cancer: scrum-japan gi-screen and gozila studies. nature medicine ( ). url https://doi.org/ . /s - - - . [ ] diaz, l. a. j. et al. the molecular evolution of acquired resistance to targeted egfr blockade in colorectal cancers. nature , – ( ). [ ] bettegowda, c. et al. detection of circulating tumor dna in early- and late-stage human malignancies. sci transl med , ra ( ). [ ] newman, a. m. et al. an ultrasensitive method for quantitating circulating tumor dna with broad patient coverage. nature medicine , – ( ). url https: //doi.org/ . /nm. . [ ] parkinson, c. a. et al. exploratory analysis of tp mutations in circulating tumour dna as biomarkers of treatment response for patients with relapsed high-grade serous ovarian carcinoma: a retrospective study. plos medicine , e ( ). url https://europepmc.org/articles/pmc . [ ] beroukhim, r. et al. the landscape of somatic copy-number alteration across human cancers. nature , – ( ). url https://doi.org/ . / nature . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /nrclinonc. . https://pubmed.ncbi.nlm.nih.gov/ https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . /nm. https://doi.org/ . /nm. https://europepmc.org/articles/pmc https://doi.org/ . /nature https://doi.org/ . /nature https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] hanahan, d. & weinberg, r. a. hallmarks of cancer: the next generation. cell , – ( ). url https://doi.org/ . /j.cell. . . . [ ] sansregret, l., vanhaesebroeck, b. & swanton, c. determinants and clinical impli- cations of chromosomal instability in cancer. nature reviews clinical oncology , – ( ). url https://doi.org/ . /nrclinonc. . . [ ] li, x. et al. temporal and spatial evolution of somatic chromosomal alterations: a case-cohort study of barrett’s esophagus. cancer prev res (phila) , – ( ). [ ] hieronymus, h. et al. tumor copy number alteration burden is a pan-cancer prog- nostic factor associated with recurrence and death. elife ( ). [ ] rubin, c. e. et al. dna aneuploidy in colonic biopsies predicts future development of dysplasia in ulcerative colitis. gastroenterology , – ( ). [ ] zaccaria, s. & raphael, b. j. accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. nature commu- nications , ( ). url https://doi.org/ . /s - - -y. [ ] scheinin, i. et al. dna copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. genome res , – ( ). [ ] adalsteinsson, v. a. et al. scalable whole-exome sequencing of cell-free dna reveals high concordance with metastatic tumors. nature communications , ( ). url https://doi.org/ . /s - - -y. [ ] van roy, n. et al. shallow whole genome sequencing on circulating cell-free dna allows reliable noninvasive copy-number profiling in neuroblastoma patients. clin cancer res , – ( ). [ ] hovelson, d. h. et al. rapid, ultra low coverage copy number profiling of cell-free dna as a precision oncology screening strategy. oncotarget , – ( ). [ ] chin, s.-f. et al. shallow whole genome sequencing for robust copy number profil- ing of formalin-fixed para�n-embedded breast cancers. experimental and molecular pathology , – ( ). url http://www.sciencedirect.com/science/ article/pii/s . [ ] chen, x. et al. low-pass whole-genome sequencing of circulating cell-free dna demonstrates dynamic changes in genomic copy number in a squamous lung cancer clinical cohort. clinical cancer research , – ( ). url https://clincancerres.aacrjournals.org/content/ / / . https:// clincancerres.aacrjournals.org/content/ / / .full.pdf. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /j.cell. . . https://doi.org/ . /nrclinonc. . https://doi.org/ . /s - - -y https://doi.org/ . /s - - -y http://www.sciencedirect.com/science/article/pii/s http://www.sciencedirect.com/science/article/pii/s https://clincancerres.aacrjournals.org/content/ / / https://clincancerres.aacrjournals.org/content/ / / .full.pdf https://clincancerres.aacrjournals.org/content/ / / .full.pdf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] belic, j. et al. mfast-seqs as a monitoring and pre-screening tool for tumor-specific aneuploidy in plasma dna. adv exp med biol , – ( ). [ ] vanderstichele, a. et al. chromosomal instability in cell-free dna as a highly specific biomarker for detection of ovarian cancer in women with adnexal masses. clin cancer res , – ( ). [ ] taylor, f., bradford, j., woll, p. j., teare, d. & cox, a. unbiased detection of somatic copy number aberrations in cfdna of lung cancer cases and high-risk controls with low coverage whole genome sequencing. adv exp med biol , – ( ). [ ] wei, t. et al. genome-wide profiling of circulating tumor dna depicts landscape of copy number alterations in pancreatic cancer with liver metastasis. mol oncol , – ( ). [ ] hoare, j. et al. platinum resistance induces diverse evolutionary trajecto- ries in high grade serous ovarian cancer. biorxiv ( ). url https: //www.biorxiv.org/content/early/ / / / . . . . https:// www.biorxiv.org/content/early/ / / / . . . .full.pdf. [ ] nelson, l. et al. a living biobank of ovarian cancer ex vivo models reveals profound mitotic heterogeneity. nature communications , ( ). url https://doi. org/ . /s - - - . [ ] network, c. g. a. r. integrated genomic analyses of ovarian carcinoma. nature , – ( ). url https://pubmed.ncbi.nlm.nih.gov/ . [ ] soria, j.-c. et al. a phase ib dose-escalation study of the safety and pharmacoki- netics of pictilisib in combination with either paclitaxel and carboplatin (with or without bevacizumab) or pemetrexed and cisplatin (with or without bevacizumab) in patients with advanced non–small cell lung cancer. european journal of cancer , – ( ). url http://www.sciencedirect.com/science/article/pii/ s . [ ] housman, g. et al. drug resistance in cancer: an overview. cancers (basel) , – ( ). [ ] gatenby, r. a., silva, a. s., gillies, r. j. & frieden, b. r. adaptive therapy. cancer res , – ( ). [ ] enriquez-navas, p. m., wojtkowiak, j. w. & gatenby, r. a. application of evolu- tionary principles to cancer therapy. cancer res , – ( ). [ ] zhang, j., cunningham, j. j., brown, j. s. & gatenby, r. a. integrating evo- lutionary dynamics into treatment of metastatic castrate-resistant prostate can- cer. nature communications , ( ). url https://doi.org/ . / s - - - . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.biorxiv.org/content/early/ / / / . . . https://www.biorxiv.org/content/early/ / / / . . . https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://pubmed.ncbi.nlm.nih.gov/ http://www.sciencedirect.com/science/article/pii/s http://www.sciencedirect.com/science/article/pii/s https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] choudhury, a. d. et al. tumor fraction in cell-free dna as a biomarker in prostate can- cer. jci insight ( ). url https://doi.org/ . /jci.insight. . [ ] phallen, j. et al. direct detection of early-stage cancers using circulating tumor dna. science translational medicine , eaan ( ). url https://pubmed.ncbi. nlm.nih.gov/ . [ ] mouliere, f. et al. high fragmentation characterizes tumour-derived circulating dna. plos one , – ( ). url https://doi.org/ . /journal.pone. . [ ] underhill, h. r. et al. fragment length of circulating tumor dna. plos genetics , – ( ). url https://doi.org/ . /journal.pgen. . [ ] mouliere, f. et al. enhanced detection of circulating tumor dna by fragment size analysis. science translational medicine ( ). url https://stm.sciencemag. org/content/ / /eaat . https://stm.sciencemag.org/content/ / / eaat .full.pdf. [ ] venkatraman, e. s. & olshen, a. b. a faster circular binary segmentation algorithm for the analysis of array cgh data. bioinformatics , – ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /jci.insight. https://pubmed.ncbi.nlm.nih.gov/ https://pubmed.ncbi.nlm.nih.gov/ https://doi.org/ . /journal.pone. https://doi.org/ . /journal.pone. https://doi.org/ . /journal.pgen. https://stm.sciencemag.org/content/ / /eaat https://stm.sciencemag.org/content/ / /eaat https://stm.sciencemag.org/content/ / /eaat .full.pdf https://stm.sciencemag.org/content/ / /eaat .full.pdf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / analysis and forecasting of global rt-pcr primers for sars-cov- analysis and forecasting of global rt-pcr primers for sars-cov- gowri nayar ,*, edward e. seabolt , mark kunitomi , akshay agarwal , kristen l. beck , vandana mukherjee , and james h. kaufman ibm research, san jose, , usa *gowri.nayar@ibm.com +these authors contributed equally to this work abstract rapid tests for active sars-cov- infections rely on reverse transcription polymerase chain reaction (rt-pcr). rt-pcr uses reverse transcription of rna into complementary dna (cdna) and amplification of specific dna (primer and probe) targets using polymerase chain reaction (pcr). the technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. however the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. different primer sequences have been adopted in different geographic regions. as we rely on these existing rt-pcr primers to track and manage the spread of the coronavirus, it is imperative to understand how sars-cov- mutations, over time and geographically, diverge from existing primers used today. in this study, we analyze the performance of the sars-cov- primers in use today by measuring the number of mismatches between primer sequence and genome targets over time and spatially. we find that there is a growing number of mismatches, an increase by % per month, as well as a high specificity of virus based on geographic location. introduction as the sars-cov- pandemic grows, an essential method for controlling its spread and determining readiness for the re- opening of public life is through rapid testing. rapid tests for active sars-cov- infections are based on reverse transcription polymerase chain reaction (rt-pcr). these tests consist of a forward primer, reverse primer, and probe that together are used to amplify the signal from the targeted virus within a sample. the approach supports rapid and specific identification of the virus, and does not depend on tissue culture or animal cell models. however, rna viruses evolve over time and a specific pcr test may lose sensitivity as the genotypic distribution of the virus changes or shifts. phylodynamic studies suggest the mutation rate of sars-cov- is in the range . x – to . x – substitutions per site per year, approximately . % variation increase per month, consistent with mutation rates reported for other coronaviridae. – sequence drift also leads to geospatial differences in the virus, resulting in varying test sensitivity by region. this study investigates the effectivity of current sars-cov- pcr tests over the development of the virus in space and time, and projects how the performance of each may change as the virus undergoes mutation. by taking a global perspective, using specific pcr protocols from several different countries together with genomic data from around the globe, our analysis shows how the existing tests respond differently over both time and location. by analyzing the number of mismatches of the pcr primers with respect to the sequenced sars-cov- genomes, we can measure how the targeted proteins are mutating. this provides an understanding of possible shortcomings of current tests, and suggests how often we may need to update those tests in the future. through this work, we observe an average rate of amino acid sequence change of approximately % per month for the targeted proteins. furthermore, we see that the virus genotype is spatially differentiated to the point that inter-country pcr testing already leads to a much higher rate of mismatches. in support for global pandemic response, several countries have published their rt-pcr protocols. we have collected the primer sequences and protocols developed for six different regions – usa, germany, china, hong kong, japan, and thailand – as provided by the who . for all six protocols, we collect the forward, reverse, and probe sequences for each specific gene target. table details the different gene targets for each protocol. most commonly, the pcr tests target the nucleoprotein (np), followed by targets in the rna-directed rna polymerase (rdrp) gene, and the envelope small membrane protein (e protein). np is a structural protein that encapsidates the negative strand rna. for other rna viruses including influenza, the np sequence is often used for species identification . rna-dependent rna polymerase (rdrp) is an enzyme that catalyzes the replication of rna from an rna template. the membrane associated rdrp is an essential protein for coronavirus replication , and may be a primary target for the antiviral drug remdesivir . the e protein is a small membrane protein involved in assembly, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / budding, envelope formation, and pathogenesis . the sars-cov e protein also forms a ca + permeable ion channel that alters homeostasis within cells which leads to the overproduction of il- beta , . results primer comparison using these methods, we observed high sequence homology for at least % of all genomes for most of the pcrs, showing that each primer is able to detect most of the sars-cov- genomes sequenced at the time of this report. table shows the percent of genomes hit by each pcr test, labelled by the country and target gene region. the america rp is an additional primer/probe set to detect the human rnase p gene to control for non-viral genes in the sample, and therefore, as expected, % of the sars-cov- genomes match with this set. however, when we look at the number of mismatches for each pcr for those hit genomes, we can see that there is a significant difference in performance between each test. figure shows the number of mismatches for all genomes created by each pcr, where we can see the range varying from , created by the american n primer, to mismatches, created by the french ip primer. thus we observe that the measure of mismatches can be used as a proxy to identify the amount of variation found within the gene sequences that are being targeted by the worldwide tests. time analysis following the methods described in section , all genomes that fall within the day range are segmented by date of collection and analyzed for mismatches to the various primer tests. figure shows the average number of mismatches seen for all primers each day within this range, normalized by the number of genomes sampled in each day. from this analysis, we can see an average of . mismatches, with a % increase in mismatches over the day time range. this corresponds to a ∼ % increase per month. to estimate the mutation rate,from figure , we calculate the best-fit line using least squares, which results in an r value of . . this mutation rate is consistent with the expected rate of mutation of the sars-cov- virus. – figure shows the distribution of total, and time averaged, mismatches for each primer set over time. the figure indicates a larger distribution of mismatches for primer sets that target nucleoprotein regions. it is important to note that the total number of mismatches occurring is increasing and that many of these mismatches are being sustained in the evolving population. in order to identify a trend, genomes that occur close in time should have smaller change in mismatches than genomes that occur further apart in time. figure shows this comparison between delta time and delta mismatches for every pair of genomes for the france pcr targeting the rdrp gene (ip ). the graphs for the other pcrs may be found in the supplemental files. each point represents a pairwise comparison of the difference in mismatch plotted over the difference in time. we observe that the delta mismatches grows in variance as the genomes occur further apart in time. furthermore, the pearson coefficient is . between mismatches and the number of genomes sampled in a day for each pcr. this positive linear relationship between the number of genomes and the number of mismatches per day shows that the mismatches occur uniformly across the genomes sampled within a day (rather than a few genomes creating noise in the signal). the data indicates that the virus demonstrated sequence variability in the targeted gene regions and that this variability causes sequence mismatches to increase over time. geographical analysis geographical stratification is occurring as the sars-cov- virus mutates within each geographic location. following the methods described in section , geospatial analysis is conducted to identify patterns in mismatches found in genomes sequenced within versus outside the country of primer origin. figure shows the number of mismatches, normalized by the number of genomes within each category, for each pcr, grouped by same and other countries. there are countries in which the number of mismatches in the country is lower than the number of mismatches that occur with genomes sampled outside of the country. this shows that the virus displays localized tendencies within the targeted gene regions, in addition to the spike glycoprotein region. the two outliers, the hong kong and france primers, show a higher percent of mismatches within the country rather than from different countries. figure shows the average number of mismatches over time, grouped by the genomes sampled within and outside the country, for one american primer. while the in-country average number of mismatches shows low variability, the out-country average number of mismatches show an increasing diversity in these targeted regions. the full set of graphs for each pcr tested are available in the supplement. clade analysis figure shows the number of mismatches for each pcr per clade, normalized by the number of genomes in the pcr and clade. this shows definite trends which confirm the geographic specificity of the virus; for example, the american nucleoprotein / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / primers have the highest number of mismatches for clade a, which nextstrain defines as originating from predominantly asian genomes, while the chinese primer has the lowest number of mismatches for this clade. however, the clades are defined by specific mutations at nucleotide locations, which only overlaps with the primer bind region for . % of the genomes. therefore, the relationship between the primer mismatches and the genome clades are correlational rather than causational. discussion by taking a global perspective on both the sars-cov- genomes and the common rt-pcr protocols, we are able to highlight important trends within the data. we observe a an increasing number of mismatches between the primer and target genome sequence as time progresses. we can also see that the number of mismatches is higher when we compare genomes sampled outside of the country that designed the test compared to within the country. while these metrics do not quantify the performance of the test, they demonstrate a growing divergence between the targeted gene sequences and the test primers. as shown by d. bru et al. , a single mutation can result in an underestimation of the gene copy number by up to -fold. our results reveal, today, an average of . mismatches between the primer and target sequences, with a growth of % each month. understanding copy number is critical to correct interpretation of a pcr assay. if the genome being tested has sufficient mismatches this can lead to an erroneous copy number and, therefore, a misinterpretation of the assay result. in the case of sars-cov- , for each targeted gene sequence, there are at least different sequence variants and with this sequence diversity of the targeted genes, the mismatches in pcr primers may not be amplifying each example at the same rate, leading to false negatives. the given primers average a base length of primers, and it has been demonstrated for primers with such base pair length that to mismatches reduces the yield by approximately percent . our data indicates that this level of mismatches will be reached within months or fewer if the rate of infection, and thus mutation, increases significantly. the results of this study also demonstrate that each primer target develops a different number of mismatches over time (see: figure ). from the total number of mismatches created by primer target, we can see that the nucleoprotein targets from america, china, hong kong, and thailand develop the greatest number of mismatches. furthermore, when looking at the distribution of average number of mismatches over time, the primers targeting nucleoprotein have the largest distribution. the results indicate that primers targetting the envelope small membrane protein and the rna-dependent rna polymerase are the most resistant to mismatches. this may suggest more stable targets for future primer test designs. the mutations that lead to mismatches between gene pcr primers and their targets reflect the sequence evolution of the virus. comparing the difference in time of collection of two genomes with the number of mismatches by which they differ shows evidence for this evolution (figure ). genomes that occur on the same day (delta time= ) have approximately zero difference, while genomes that occur at delta time= [days] have an average of . mismatches per nucleotide. this is consistent with the observed increasing number of mismatches over time, and shows that evolution of sars-cov- genomes is being sustained. the continual branching of the genetic tree due to mutation is further supported by the analysis of the number of mutations within and outside the country that designed the particular primer. figure shows that most countries primers perform better when tested against genomes sequenced within the country rather than globally sequences genomes. in two cases, hong kong and france, the primers have a smaller percent of mismatches with genomes outside the country. for france, the ip , a region of the rdrp gene, primer target creates a disproportionate number of mismatches when compared to genomes sequenced within france. this suggests that this region of the genome has deviated more from the original reference used to generate the primer set. for hong kong, they have the least number of genomes sequenced within the country in this dataset, so it is possible that the larger percent of mismatches for genomes within versus outside the country is an artifact of bias in data. nextstrain categorizes the various genetic phylogenies by clade, which is designed to denote long-term genetic changes based on mutation. each clade defined requires significant geographical and frequency. this study shows that less than . % of the regions on the genome that define the clades overlap with the region that the primers target. this indicates that variations in the primer target sequences have not yet have reached large enough statistical significance to define a new clade in the nextstrain phylogeny, although the variants that are present in the primer region may cause a decrease in amplification signal within the assay. with the emergence of specific mutations that are spreading at faster rates, this analysis becomes more important in evaluating the possible need for primer re-design. the emergence of the b. . . strain contains mutation in the regions encoding for the envelope small membrane protein and the nucleoprotein, both targeted by the current primers. with the number of cases of sars-cov- globally, it is highly probable that the genome will mutate in the primer target regions. methods / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / data description gisaid has emerged as a leading source of sars-cov- genomes, containing the largest number of genomes sequences around the world with metadata about the location and time of collection . sars-cov- genomes from the gisaid repository were curated, collecting high quality genomes within the date range aug , – july , . while this date range precedes the start of the current outbreak, the genome sequences from the earlier points and time serve as a control for comparison. we define high quality genomes as those with less than % n within the sequence and less . % unique non-synonymous mutation. by taking these measures, we reduce the noise generated from random mutations or sequencing errors found within the genome. this resulted in a set of , sars-cov- genomes, for which we evaluated primer homology. the who has published primers from six countries - china, france, usa, japan, germany, hong kong, and thailand . each protocol published is a rt-pcr assay method, and for each primer set, a forward, reverse and probe sequence is provided . for this study, we use the sequences as provided with no modifications made. pcr primer comparison using the primer sequences and sars-cov- genomes described above, we perform a sequence comparison. specifically, we used blastn with parameters similar to primer-blast . this procedure was verified to account for full alignments of the forward, reverse, and probe sequences of primers . the blast results are then parsed, ensuring that the forward, reverse, and probe sequences match a given genome and that the probe sequence is matched spatially in the forward and reverse directions on the genome, and the number of mismatches is aggregated for each pcr sequence and genome. this metric does not necessarily predict whether the pcr test would generate a positive or negative outcome for the particular genome, but rather measures variability within the targeted gene region. since all genomes included in this corpus are associated with sars-cov- , its can be assumed that they were collected by a positive assay. mutations in the targeted gene region, over time, can affect the sensitivity of the primers. time analysis methods for each regional test, the primers each target a particular section of the genome derived from various reference genomes. however, as replication and mutation of the virus occurs, these targeted regions of circulating virus genomes accumulate sequence differences from the reference. thus, the efficacy of the primer may decrease over time. as more mutations accumulate, it is important to measure the rate of mismatch growth between primer sequence and targeted section as a function of time. from this rate it is possible to anticipate when target sequences used in a regional test should be updated. to estimate the mutation rate of the targeted genes over time, we group the genomes by their date of sampling and aggregate the number of mismatches for each day. in order to reduce noise from days with few genomes collected, for any time-based analysis, we consider only those days that have over unique genomes sequenced. with this restriction data is available for a time range between jan , - july , , for a total of days. this process removes outlier data that was sequenced prior to the start of the pandemic, including sequences that were collected from non-human hosts. geographical analysis methods as the virus has spread throughout the world, we see particular mutations that are specific to outbreaks by geospatial location. as studies using bayesian coalescent analysis have shown, high evolutionary rates and fast population growth of the sars-cov- virus results in increasing diversification of the virus by geographic location . to understand how the pcr tests respond differently for genomes collected by country, we first extract the country of sampling for each genome from the fasta header provided by gisaid and then group the number of mismatches found in the genome by in country versus out of country. clade analysis methods sars-cov- genomes have been categorized into clades to define groups of mutations. for this analysis, we use the clades as indicated by nextstrain, which are defined by frequency and geographic spread. their script to categorize genomes within the specific clade definitions was used to classify each genome within the dataset . furthermore, nextstrain publishes the genome locus that defines each clade, and these loci were compared to the genome location the primer targets bind to. by grouping the number of mismatches for each pcr by the genomes’ clade we see how different genetic variations affect the pcr test performance. references . hill v., r. a. phylodynamic analysis of sars-cov- | update - - . virological.org ( ). https://virological.org/t/phylodynamic-analysis-of-sars-cov- -update- - - / . / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . gytis, e. a., dudas. mers-cov spillover at the camel-human interface. elife ( ). . cotten, e. a., matthew. spread, circulation, and evolution of the middle east respiratory syndrome coronavirus. mbio ( ). . baric, e. a., ralph s. episodic evolution mediates interspecies transfer of a murine coronavirus. j. virology , – ( ). . organization, w. h. who in-house assays ( ). . burger h, e. a. sequence of the nucleoprotein gene of influenza a/parrot/ulster/ . virusres , – , doi: https: //doi.org/ . / - ( ) - ( ). . y gao, e. a. structure of the rna-dependent rna polymerase from covid- virus. science – ( ). . elfiky, a. ribavirin, remdesivir, sofosbuvir, galidesivir, and tenofovir against sars-cov- rna dependent rna polymerase (rdrp): a molecular docking study. life sci. ( ). . schoeman d, e. a. coronavirus envelope protein: current knowledge. virol j ( ). . surya w, e. a. mers coronavirus envelope protein has a single transmembrane domain that forms pentameric ion channels. virus res. ( ). . nieto-torres jl, e. a. severe acute respiratory syndrome coronavirus e protein transports calcium ions and activates the nlrp inflammasome. virology ( ). . d. bru, l. p., f. martin-laurent. quantification of the detrimental effect of a single primer-template mismatch by real-time pcr using the s rrna gene as an example. appl. environ. microbiol. doi: https://doi.org/ . /aem. - ( ). . cindy christopherson, s. k., john sninsky. phylodynamic analysis of sars-cov- genomes- - jan- . nucleic acids res. ( ). . shu, y. & mccauley, j. gisaid: global initiative on sharing all influenza data–from vision to reality. eurosurveillance , ( ). . seabolt, e., nayar, g. et al. ibm functional genomics platform, a cloud-based platform for studying microbial life at scale. ieee/acm transactions on comput. biol. bioinforma. doi: . /tcbb. . ( ). . camacho, c., coulouris, v., g.and avagyan et al. blast+: architecture and applications. bmc bioinforma. doi: . / - - - ( ). . ye, j., coulouris, g., zaretskaya, i. et al. primer-blast: a tool to design target-specific primers for polymerase chain reaction. bmc bioinforma. , doi: https://doi.org/ . / - - - ( ). . castells, e. a., m. evidence of increasing diversification of emerging sars-cov- strains. j med virol doi: https: //doi.org/ . /jmv. ( ). . hadfield, j. et al. nextstrain: real-time tracking of pathogen evolution. bioinformatics doi: https://doi.org/ . / bioinformatics/bty ( ). acknowledgements the authors would like to acknowledge the gisaid initiative and ncbi for the provision of data. author contributions statement g.n. conceived the experiment and analysis, m.k. verified the results, e.s. was the architect of the platform used, a.a and k.l.b. performed genome quality analysis, j.h.k and v.m. provided scientific guidance and domain specific knowledge, additional information competing interests the corresponding author is responsible for submitting a competing interests statement on behalf of all authors of the paper. this statement must be included in the submitted article file. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / - ( ) - https://doi.org/ . / - ( ) - https://doi.org/ . /aem. - . /tcbb. . . / - - - https://doi.org/ . / - - - https://doi.org/ . /jmv. https://doi.org/ . /jmv. https://doi.org/ . /bioinformatics/bty https://doi.org/ . /bioinformatics/bty http://www.nature.com/srep/policies/index.html#competing https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / country target usa nucleoprotein china orf ab, nucleoprotein germany rna-directed rna polymerase, envelope small membrane protein hong kong nucleoprotein thailand nucleoprotein france rna-directed rna polymerase (ip , ip ), envelope small membrane protein japan nucleoprotein table . targeted genes by name by primers from the countries in the study pcr percent of hit genomes america|rp * china|orf ab . japan|niid -ncov n . america| -ncov n . hongkong|hku-n . thailand|wh-nic-n . china|n . germany|e sarbeco . france|e sarbeco . france|ncov ip . america| -ncov n . france|ncov ip . america| -ncov n . table . percent of genomes that are hit by the described pcr test, identified by the country and target gene. *indicates that the primer is designed to separate the any errant samples within the assay. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . total number of mismatches each pcr test creates when tested against the full corpus of sars-cov- genomes. each pcr test is identified by the country of use and the targeted gene name. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . average number of mismatches for all genomes and all pcr primers separated by the day on which the genome is collected. the dates shown are aggregated over every day period. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . distribution of mismatches for each primer. a shows the total number of mismatches aggregated for each day within the time range. b shows the number of mismatches for each day averaged by the number of genomes that occur on a day within the time range. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . change in number of mismatches between two occurrences over delta time between the two occurrences for the ip primer developed in france. the increasing slope shows that mutations are being sustained as we compare genomes that occur further apart in time. graphs for all primers are included in the supplement. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . number of mismatches for each pcr test tested on all sars-cov- genomes, split between genomes collected within the same country as the test and outside the country. for japan, % of genomes, both in and out of the country, have mismatch, and therefore not shown in the figure. for out of the pcr tests, there are a higher number of mismatches for total genomes that occur outside the country than genomes that occur inside the country. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . number of mismatches in and out of country for an american nucleoprotein primer separated by time of genome collection. all other primers are included in the supplement. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . average number of mutations for each pcr test that occur within each clade, as defined by nextstrain. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / primer comparison time analysis geographical analysis clade analysis data description pcr primer comparison time analysis methods geographical analysis methods clade analysis methods references comprehensive comparison of transcriptomes in sars-cov- infection: alternative entry routes and innate immune responses comprehensive comparison of transcriptomes in sars-cov- infection: alternative entry routes and innate immune responses yingying cao ∗, xintian xu , simo kitanovski , lina song , jun wang , pei hao , ∗, daniel hoffmann ∗ bioinformatics and computational biophysics, faculty of biology and center for medical biotechnology, university of duisburg-essen, essen , germany key laboratory of molecular virology and immunology, institut pasteur of shanghai, center for biosafety mega-science, chinese academy of sciences, shanghai , china translational skin cancer research, german consortium for translational cancer research, essen, germany the joint program in infection and immunity: a. guangzhou women and children’s medical center, guangzhou medical university, guangzhou , china; b. institut pasteur of shanghai, chinese academy of sciences, shanghai , china ∗to whom correspondence should be addressed; e-mail: daniel.hoffmann@uni-due.de, phao@ips.ac.cn, yingying.cao@uni-due.de. the pathogenesis of covid- emerges as complex, with multiple factors leading to injury of different organs. several studies on underlying cellular processes have produced contradictory claims, e.g. on sars-cov- cell en- try or innate immune responses. however, clarity in these matters is imper- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ative for therapy development. we therefore performed a meta-study with a diverse set of transcriptomes under infections with sars-cov- , sars- cov and mers-cov, including data from different cells and covid- pa- tients. using these data, we investigated viral entry routes and innate im- mune responses. first, our analyses support the existence of cell entry mech- anisms for sars and sars-cov- other than the ace route with evidence of inefficient infection of cells without expression of ace ; expression of tm- prss /tpmrss is unnecessary for efficient sars-cov- infection with ev- idence of efficient infection of a cells transduced with a vector expressing human ace . second, we find that innate immune responses in terms of inter- ferons and interferon simulated genes are strong in relevant cells, for example calu cells, but vary markedly with cell type, virus dose, and virus type. introduction coronaviruses are non-segmented positive-sense rna viruses with a genome of around kilobases. the genome has a ’ cap structure along with a ’ poly (a) tail, which acts as mrna for translation of the replicase polyproteins. the replicase gene occupies approximately two thirds of the entire genome and encodes non-structural proteins (nsps). the remaining third of the genome contains open reading frames (orfs) that encode accessory proteins and four structural proteins, including spike (s), envelope (e), membrane (m), and nucleocapsid (n) ( ). over the past years, three epidemics or pandemics of life-threatening diseases have been caused by three closely related coronaviruses – severe acute respiratory syndrome coronavirus (sars-cov), which emerged with nearly % mortality ( , ) in - and spread to countries before being contained; middle east respiratory syndrome coronavirus (mers-cov), with mortality around % ( , ) starting in and since then spreading to countries; (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sars-cov- , emerging in late ( ), which has caused many millions of confirmed cases and > million deaths worldwide ( ). infection with sars-cov, mers-cov or sars-cov- can cause a severe acute respiratory illness with similar symptoms, including fever, cough, and shortness of breath. sars-cov- is a new coronavirus, but its similarity to sars-cov (amino acid sequences about % identical ( )) and mers-cov suggests comparisons to these earlier epidemics. de- spite the difference in the total number of cases caused by sars-cov and sars-cov- ( , ) due to different transmission rates, the outbreak caused by sars-cov- resembles the out- break of sars: both emerged in winter and were linked to exposure to wild animals sold at markets. although mers-cov has high morbidity and mortality rates, lack of autopsies from mers-cov cases has hindered our understanding of mers-cov pathogenesis in humans. until now there are no specific anti-sars-cov- , anti-sars-cov or anti-mers-cov therapeutics approved for human use. there are several points of attack for potential anti- sars-cov- /sars-cov/mers-cov therapies, e.g. intervention on cell entry mechanisms to prevent virus invasion, or acting on the host immune system to kill the infected cells and thus prevent replication of the invading viruses. a better understanding of virus entry mechanisms and the immune responses can therefore guide the development of novel therapeutics. virus entry into host cells is the first step of the viral life cycle. it is an essential component of cross-species transmission and an important determinant of virus pathogenesis and infectivity ( , ), and also constitutes an antiviral target for treatment and prevention ( ). it seems that sars-cov and sars-cov- use similar virus entry mechanisms ( ). the infection of sars- cov or sars-cov- in target cells was initially identified to occur by cell-surface membrane fusion ( , ). some later studies have shown that sars-cov can infect cells through receptor mediated endocytosis ( , ) as well. both mechanisms require the s protein of sars-cov or sars-cov- to bind to angiotensin converting enzyme (ace ), and s protein of mers- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cov to dipeptidyl peptidase (dpp ) ( ), respectively, through their receptor-binding domain (rbd) ( ). in addition to ace and dpp , some recent studies suggest that there are possible other coronavirus-associated receptors and factors that facilitate the infection of sars-cov- ( ), including the cell surface proteins basignin (bsg or cd ) ( ), and cd ( ). recently, clinical data have revealed that sars-cov- can infect several organs where ace expression could not be detected in healthy individuals ( , ), which highlights the need of closer inspection of virus entry mechanisms. the binding of s protein to a cell-surface receptor is not sufficient for infection of host cell ( ). in the cell-surface membrane fusion mechanism, after binding to the receptor, the s protein requires proteolytic activation by cell surface proteases like tmprss , tpmrss , or other members of the tmprss family ( , , ), followed by the fusion of virus and target cell membranes. in the alternative receptor mediated endocytosis mechanism, the endocytosed virion is subjected to an activation step in the endosome, resulting in the fusion of virus and endosome membranes and the release of the viral genome into the cytoplasm. the endosomal cysteine proteases cathepsin b (ctsb) and cathepsin l (ctsl) ( ) might be involved in the fusion of virus and endosome membranes. availability of these proteases in target cells largely determines whether viruses infect the cells through cell-surface membrane fusion or receptor mediated endocytosis. how the presence of these proteases impacts efficiency of infection with sars-cov- , sars-cov and mers-cov, still remains elusive. when the virus enters a cell, it may trigger an innate immune response, a crucial compo- nent of the defense against viral invasion. compounds that regulate innate immune responses can be introduced as antiviral agents ( ). the innate immune system is initialized as pat- tern recognition receptors (prrs) such as toll-like receptors (tlrs) and cytoplasmic retinoic acid-inducible gene i (rig-i) like receptors (rlrs) recognize molecular structures of the in- vading virus ( , ). this pattern recognition activates several signaling pathways and then (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . downstream transcription factors such as interferon regulator factors (irfs) and nuclear factor κb (nf-κb). transcriptional activation of irfs and nf-κb stimulates the expression of type i (α or β) and type iii (λ) interferons (ifns). ifn-α (ifna , ifna , etc), ifn-β (ifnb ) and ifn-λ (ifnl - ) are important cytokines of the innate immune responses. ifns bind and induce signaling through their corresponding receptors (ifnar for ifn-α/β and ifnlr for ifn-λ), and subsequently induce expression of ifn-simulated genes (isgs) (e.g. mx , isg and oasl) and pro-inflammatory chemokines (e.g. cxcl and ccl ) to suppress viral repli- cation and dissemination ( , ). dysregulated inflammatory host response results in acute respiratory distress syndrome (ards), a leading cause of covid- mortality ( ). one attractive therapy option to combat covid- is to harness the ifn-mediated innate immune responses. clinical trials with type i and type iii ifns for treatment of covid- have been conducted and many more are still ongoing ( , ). in this regard, the kinetics of the secretion of ifns in the course of sars-cov- infection needs to be defined. unfortunately, some results on the host innate immune responses to sars-cov- are apparently at odds with each other ( – ), e.g. it is unclear whether sars-cov- infection induces low ifns and moderate isgs ( ), or robust ifn responses and markedly elevated expression of isgs ( – ). this has to be clarified. the use of ifns as a treatment in covid- is now a subject of debate as well ( ). thus, the kinetics of ifn secretion relative to the kinetics of virus replication need to be thoroughly examined to better understand the biology of ifns in the course of sars-cov- infection and thus provide guidance to identify the temporal window of therapeutic opportunity. we have collected and analyzed a diverse set of publicly available transcriptome data ( , – ): ( ) bulk rna-seq data with different types of cells, including human non-small cell lung carcinoma cell line (h ), human lung fibroblast-derived cells (mrc ), human alveo- lar basal epithelial carcinoma cell line (a ), a cells transduced with a vector expressing (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . human ace (a -ace ), primary normal human bronchial epithelial cells (nhbe), hetero- geneous human epithelial colorectal adenocarcinoma cells (caco ), and african green monkey (chlorocebus sabaeus) kidney epithelial cells (vero e ) infected with sars-cov- , sars- cov and mers-cov (table ); ( ) rna-seq data of lung samples, peripheral blood mononu- clear cell (pbmc) samples, and bronchoalveolar lavage fluid (balf) samples of covid- patients and their corresponding healthy controls (table and table ). using this collection, we systemically evaluated the replication and transcription status of virus in these cells, ex- pression levels of coronavirus-associated receptors and factors, as well as the innate immune responses of these cells during virus infection. results different infection efficiency of sars-cov- , sars-cov and mers-cov in different cell types the rna-seq data for all samples can be aligned to the genome of the corresponding virus to evaluate the infection efficiency in cells, estimated by the mapping rate to the virus genome, i.e. the percentages of viral rnas in intracellular rnas. to assess the infection efficiency of sars-cov- , sars-cov, and mers-cov in different types of cells, we collected and analyzed a comprehensive public datasets of rna-seq data of cells infected with these viruses at hours post infection (hpi) with comparable multiplicity of cellular infection (moi) (table ). moi refers to the number of viruses that are added per cell in infection experiments. for example, if viruses are added to cells, the moi is . our analysis shows that the infection efficiency of viruses can be both cell type dependent and virus dose dependent (fig. ). mers-cov can efficiently infect mrc and vero e cells. however, the infection efficiency is influenced strongly by moi in the same type of cells. cells infected with low moi, say . , have significantly lower mapping rates than those with high (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . moi, say (fig. ). for sars-cov and sars-cov- , the infection efficiency is influenced strongly by cell type. for sars-cov- , there is efficient virus infection in a -ace , calu , caco , and vero e cells, but not in a , h , or nhbe cells (fig. and table s ). the mapping rates in a , h , and nhbe cells are low even at high mois (fig. and table s ). similar to sars-cov- , the infection by sars-cov is also cell type dependent, vero e cells and calu cells show high mapping rates to sars-cov genome, but the mapping rates of sars-cov in mrc and h cells are close to zero even at the high moi of (fig. and table s ). since “total rna” (see methods/data collection) includes additional negative-strand templates of virus, the mapping rates are usually much higher than those that used the polya+ selection method in the same condition (fig. and table s ). evidence for multiple entry mechanisms for sars-cov- and sars-cov to examine the detailed replication and transcription status of these viruses in the cells, we calculated the number of reads (depth) mapped to each site of the corresponding virus genome (fig. ). for better comparison, these read numbers were log transformed. the replication and transcription of mers-cov, sars-cov- and sars-cov share an uneven pattern of expression along the genome, typically with a minimum depth in the first half of the viral genome, and the maximum towards the end. among the parts with very high levels, there are especially coding regions for structural proteins, including s, e, m, and n proteins, as well as the first coding regions with nsp and nsp . interestingly, there is an exception for balf samples in covid- patients, which show a more irregular, fluctuating behavior along the genome (fig. b). the deviation from the cellular expression pattern is not surprising because balf is not a well-organized tissue but a mixture of many components, some of which will probably digest viral rna. interestingly, the mentioned uneven transcription pattern of efficient infections with sars- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cov- , sars-cov, and mers-cov, is also visible for inefficient infection with sars-cov- in a , nhbe, and h cells, and sars-cov in h and mrc cells (fig. c, d), although there the total mapping rates to their corresponding virus genomes are much lower (fig. ). to further elucidate the corresponding entry mechanisms for different types of cells, we examined the expression levels of those receptors and proteases that have already been described as facilitating target cell infection (fig. ). our analysis shows that mers-cov can efficiently infect mrc and vero e cells (fig. and fig. e) that both express dpp (fig. a), though compared to vero e cells, mrc cells infected with mers-cov have higher expression levels of dpp (fig. a), but lower mapping rates to the virus genome (fig. ). these observations show that higher expression levels of the receptor (dpp ) do not guarantee higher mers-cov infection efficiency in cells. this is also true for sars-cov- receptor ace , which is expressed three orders of magnitudes higher in a -ace cells than in vero e cells (fig. b), while both cells produce about the same amount of virus (fig. ). although sars-cov- can efficiently infect a -ace cells (fig. and fig. ), there is no expression of tmprss or tmprss (fig. c, d), needed for the canonical cell-surface membrane fusion mechanism (fig. j). however, there are considerable expression levels of ctsb and ctsl (fig. e, f), which are involved in endocytosis (fig. j). in a , h , and mrc cells, which do express small amounts of sars-cov- and sars-cov virus (fig. , fig. c, d), there is no ace expression at all (fig. b). this could point to an alternative ace -independent entry mechanism for sars-cov- and sars-cov (fig. j). since there were already reports about alternative sars-cov- receptors such as bsg/cd and cd ( , ), we examined their expressions in these cells as well (fig. g, h). for all cells, the expression of bsg is at the same level of - (fig. g), and the expression (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . of cd is very low. certainly, cd and bsg alone cannot explain the differences in virus expression (fig. ), nor can we exclude other low efficiency entry mechanisms. it could e.g. be that relatively inefficient alternative entry paths are often present but in some cells masked by more efficient entry via ace /tpmrss. to gain a comprehensive overview we clustered cells with respect to gene expression levels of coronavirus-associated receptors and factors (fig. i), and summarized conceivable mecha- nisms accordingly (fig. j). since all cells show high expression levels of ctsb and ctsl, the major differences between these cells lie in the expression levels of ace , tmprss and tpmrss . cell-surface membrane fusion (fig. j, a) might be mainly used in sars-cov- infec- tion of calu , caco , and nhbe cells where there are low to moderate expression of ace and moderate expression of tmprss and tmprss . endocytosis (fig. j, b) might be mainly used in sars-cov- infection of a -ace cells where ace is expressed at high levels but there is no expression of tmprss or tmprss . an alternative ace -independent way (fig. j, c) in absence of ace , tmprss , or tmprss could be mainly employed in sars-cov- infection of mrc , a , and h cells. note that although the expres- sion pattern of coronavirus-associated receptors and factors of nhbe cells is similar to that in caco cells, nhbe cells are not infected efficiently by sars-cov- . vero e cells have mod- erate expression of ace , and low expression of tmprss and tmprss , so all these entry mechanisms mentioned above could contribute to sars-cov- infection of vero e cells. strength of ifn/isg response varies between cell lines and viruses, with strong response to sars-cov- in relevant cells as a virus enters a cell, it may trigger an innate immune response, i.e. the cell may start expres- sion of various types of innate immunity molecules at different strengths. there is currently (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . an intense debate about which of these molecules, especially ifns and isgs, are expressed how strongly ( – ). we therefore focused in our analysis on innate immunity molecules such as ifns, isgs, and pro-inflammatory cytokines. to broaden the basis for conclusions, we analyzed, apart from cell lines, bulk rna-seq data of lung, pbmc, and balf samples of covid- patients, and single-cell rna-seq data of balf samples from moderate and severe covid- patients; for each type of patient data, we also included healthy controls. gene ex- pressions were compared quantitatively in terms of tpm (transcripts per million), as well as log fold changes (logfc) with respect to healthy controls (human samples) or mock-infected cultures (cell lines) (fig. s , fig. s ). the heatmap and clustering dendrogram of the logfc of ifns, isgs and pro-inflammatory cytokines in fig. a reveal broadly two groups of samples with fundamentally different expres- sion of isgs, ifns, and pro-inflammatory cytokines. the top cluster in fig. a are samples that show weaker innate immune response, includ- ing the two pbmc samples of covid- patients, a , nhbe, caco , and h cells infected with sars-cov- and a -ace cells infected with sars-cov- at lower moi ( . ), mrc cells infected with sars, mrc and vero e cells infected with mers. the bottom cluster in fig. a are samples that show stronger innate immune response, including balf and lung samples of covid- patients, calu cells infected with sars-cov- , a - ace cells infected with sars-cov- at higher moi ( ), as well as vero e cells infected with sars-cov- and sars. most of the samples in the bottom part show markedly elevated levels of isgs and elevated pro-inflammatory cytokines. an exception in the bottom cluster are four samples, namely lung. / and balf. / , with a mixture of up- and down-regulation of isgs and pro-inflammatory cytokines. in this respect, these four samples from patients with un- known covid- severity differ from the balf samples from moderate and severe covid- patients. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the expression levels of ifns are not upregulated either in most of these lung, pbmc and balf samples of covid- patients where no information about the severity of infection of these covid- patients are available. however, we estimated the severity of their infection by aligning all the samples to sars-cov- virus genome. there are no ( . %) reads mapping to the sars-cov- genome in the pbmc samples. for the two balf samples, there are low mapping rates ( . % and . %) to sars-cov- genome. the expression levels of ace in these tissues (pbmc, lung and balf samples) of healthy individuals are around zero (fig. s ), which explains why there are almost no virus reads in these tissues. one of the two lung samples (accession number: samn ) has slightly upregulated ifnl (fig. s ), which had been ignored in the original publication ( ), although the total mapping rates to virus genome are both . % for these two lung samples. we then checked the detailed coverage along the virus genome. there were a small number of virus reads aligned to sars-cov- genome in this sample (fig. s ). different from other lung samples that did not express ace , this lung sample expressed ace at a considerable level ( . tpm, table s ). this result implies that when sars-cov- enters into lung successfully, or when the lung tissue chosen for sequencing are successfully infected by sars-cov- , ifns (at least ifnl ) can be upregulated. calu cells infected with sars-cov and sars-cov- , and a -ace cells infected with sars-cov- at a high moi of have upregulated ifnb , ifnl , ifnl and ifnl (fig. b-e). a , h , nhbe (fig. b-e), and mrc cells (fig. s ), which do not support efficient virus infection, show no upregulation of ifns. low levels of ifn expression are also observed in caco cells, which are efficiently infected with sars-cov and sars-cov- . the same is true for a -ace cells infected with sars-cov- at low moi of . . in vero e cells ifnl is upregulated as well in infected with sars-cov and sars-cov- , but not with mers-cov (fig. f). in balf samples of moderate and severe covid- patients, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . upregulation of ifns was not as obvious as in calu cells, but is still present in some patients. these observations demonstrate that the innate immune response depends in complex ways on cell line, viral dose, and virus. several studies ( – ) reported robust ifn responses and markedly elevated expression of isgs in sars-cov- infection of different cells and patient samples. conversely, the study by ( ) concluded that weak ifn response and moderate isg expression are characteristic for sars-cov- infection. this apparent contradiction can be resolved if we consider that ref. ( ) generalized from patient samples and cells that were only weakly infected, and that in such cases the host, in fact, responds with low levels of ifns and isgs. on the other hand, ref. ( ) treated efficiently infected cells, such as calu and a -ace (at moi f ) as exceptions. however, our meta-analysis shows that these are not exceptions but typical for severely infected target cells that have robust ifn responses and isg expressions (cluster in fig. a). discussion one attractive potential anti-sars-cov- therapy is intervention in the cell entry mechanisms ( ). however, the entry mechanisms of sars-cov- into human cells are partly unknown. during the last few months scientists have confirmed that sars-cov- and sars-cov both use human ace as entry receptor, and human proteases like tmprss and tmprss ( , , ), and lysosomal proteases like ctsb and ctsl ( ) as entry activators. since ace is beneficial in cardiovascular diseases such as hypertension or heart failure ( ), treatments tar- geting ace could have a negative effect. inhibitors of ctsl ( ) or tmprss ( ) are seen as potential treatment options for sars-cov and sars-cov- . however, recently alternate coronavirus-associated receptors and factors including bsg/cd ( ) and cd ( ) have been proposed to facilitate virus invasion. additionally, clinical data of sars-cov- infection (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . have shown that sars-cov- can infect several organs where ace expression could not be detected ( , ), urging us to explore other potential entry routes. first, our analyses here have shown that even without expression of tmprrs or tm- prss , high sars-cov- infection efficiency in cells is possible (fig. a, c) with consider- able expression levels of ctsb and ctsl (fig. e, f). this suggests receptor mediated endo- cytosis ( , , ) as an alternative major entry mechanism. given this tmprss-independent route, tmprss inhibitors will likely not provide complete protection. the studies designed to predict the tropism of sars-cov- by profiling the expression levels of ace and tmprss across healthy tissues ( , ) may need to be reconsidered as well. second, the evidence presented in our study suggests further, possibly undiscovered entry mechanism for sars-cov- and sars-cov (fig. ). although bsg/cd has been re- cently proposed as an alternate receptor ( ), later experiments reported there was no evidence supporting the role of bsg/cd as a putative spike-binding receptor ( ). the expression patterns of bsg/cd in different types of cells observed in our study could not explain the difference in virus loads observed in these cells either. cd and cd l were recently re- ported as attachment factors to contribute to sars-cov- infection in human cells as well ( ). however, cd expression in the cell lines included here is low. another reasonable hypoth- esis could be that the inefficient ace -independent entry mechanism we observed could be macropinocytosis, one endocytic pathway that does not require receptors ( ). until now there is still no direct evidence for macropinocytosis involvement in sars-cov- and sars-cov entry mechanism. to confirm such an involvment, specific experiments are needed. moreover, this ace -independent entry mechanism, only enables inefficient infection by sars-cov and sars-cov- (fig. ) and therefore cannot be a major entry mechanism. fig. j summarizes the outcomes of our study with respect to entry mechanisms. the ob- servations with the broad range of transcriptome data can only be explained if there are several (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . entry routes. this is certainly a challenge to be reckoned with in the development of antiviral therapeutics ( ). another attractive potential anti-sars-cov- point of attack is supporting the human innate immune system to kill the infected cells and, thus disrupt viral replication. not surprisingly, research in this area is flourishing but sometimes generates conflicting results, especially on the involvement of type i and iii ifns and isgs ( – ). the results of our analyses could help to dissolve the confusion on the involvement of ifns and isgs. we found that immune responses in calu cells infected with sars-cov and sars-cov- resemble those of balf samples of moderate and severe covid- patients, with elevated lev- els of type i and iii ifns, robust isg induction as well as markedly elevated pro-inflammatory cytokines, in agreement with recent studies ( – ). however this picture differs from the one reported by ( ) with low levels of ifns and moderate isgs. this latter study was partially based on a cells and nhbe cells with nearly no ace expression and very low map- ping rate to the viral genome, and lung samples of two patients (both show . % mapping rate to virus genome). hence, given that there was no efficient virus infection in theses cells, the low levels of ifns and isgs were to be expected. however, in one of the lung samples sequenced by ( ) (accession number: samn ), we observed a slight upregulation of ifnl (fig. s ), which was ignored in the original publication, together with considerable ace expression (table s ) ( . tpm), and a few virus reads aligned to sars-cov- genome (fig. s ). this results suggests that levels of ifns are isgs are associated with viral load and severity of virus infection. we found low induction of ifns and moderate expression of isgs in pbmc samples and balf samples of covid- patients (fig. , fig. s ). in these pbmc samples, there are no ( . %) virus reads mapping to the sars-cov- genome. the failure to detect virus reads in these three pbmc samples can be explained by the absence of efficient entry routes (e.g. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . no expression of ace in pbmc samples of healthy individuals, fig. s ), or with the cell types being otherwise incompatible with viral replication. this observation is consistent with the studies on sars-cov ( – ) with abortive infections of macrophages, monocytes, and dendritic cells; moreover, replication of sars-cov in pbmc samples is also self-limiting. however, due to the limited number of pbmc, balf and lung samples included in this study, and the lack of the information of infection stage and infection severity of these covid- patients, the assessment of ifns and isgs as well as the infection of sars-cov- in these samples may not be representative of host response against sasr-cov- . future studies that include also other affected organs of more patients with different infection stages and severity are necessary for a better understanding of the immune responses. several unexpected observations need further investigations. first, a -ace and caco cells are efficiently infected with low moi of . and . , respectively, (fig. ), but fail to upregulate inf expression (fig. b-e). their cellular immune responses are more similar to those of cells that cannot support efficient virus infection (fig. a). these results suggest that in caco and a -ace cells the invasion of sars-cov- or sars-cov at low moi shuts down or fails to activate the innate immune system. based on the results observed above, multiple factors including disease severity, different organs, cell types and virus dose contribute to the variability in the innate immune responses. for a better characterization of the innate immune responses, a more comprehensive profiling is necessary, including of patients with infections in different stages, different levels of severity, and different clinical outcomes of the infection. further, a larger array of cell types should be profiled over time after infection with different virus doses. in this way we would be better able to understand the kinetics of ifns and isgs in response to sars-cov- infection. in summary, our study has comparatively analyzed an extensive data collection from differ- ent cell types infected with sars-cov- , sars-cov and mers-cov, and from covid- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patients. we have presented evidence for multiple sars-cov- entry mechanisms. we could also dissolve apparent conflicts on innate immune responses in sars-cov- infection ( – ), by drawing upon a larger set of cell types and infection severity. the results emphasize the com- plexity of interactions between host and sars-cov- , offer new insights into pathogenesis of sars-cov- , and can inform development of antiviral drugs. materials and methods data collection after the successful release of the virus genome into the cytoplasm, a negative-strand genomic- length rna is synthesized as the template for replication. negative-strand subgenome-length mrnas are formed as well from the virus genome as discontinuous rnas, and used as the templates for transcription. in the public data we collected for the analysis, there are two main library preparation methods to remove the highly abundant ribosomal rnas (rrna) from to- tal rna before sequencing. one is polya+ selection, the other is rrna-depletion ( ). it is known that coronavirus genomic and subgenomic mrnas carry a polya tail at their ’ ends, so in the polya+ rna-seq, we have ( ) virus genomic sequence from virus replication, i.e. repli- cated genomic rnas from negative-strand as template, and ( ) subgenomic mrnas from virus transcription; in the rrna-depletion rna-seq we have ( ) virus genomic sequence from virus replication: both replicated genomic rnas from negative-strand as template and the negative- strand templates themselves, and ( ) subgenomic mrnas from virus transcription. polya+ selection was used if not specifically stated in this study, “total rna” is used to specify that the rrna-depletion method was used to prepare the sequencing libraries. the raw fastq data of different cell types infected with sars-cov- , sars-cov and mers-cov, and lung samples of covid- patients and healthy controls were retrieved from ncbi ( ) (https://www.ncbi.nlm.nih.gov/) and ena ( ) (https://www.ebi.ac.uk/ena) (acces- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sion numbers gse ( ), gse , gse ( ) and gse ( )). the raw fastq data of pbmc and balf samples of covid- patients and corresponding con- trols were downloaded from big data center ( ) (https://bigd.big.ac.cn/) (accession number cra ) ( ), and the raw fastq data for balf healthy control samples were down- loaded from ncbi (accession numbers srr , srr , and srr un- der project prjna ( )). the preprocessed single cell rna-seq data of balf sam- ples from severe covid- patients and moderate covid- patients were downloaded from ncbi with accession number gse ( ). the preprocessed single cell rna-seq data of balf sample from a healthy control was retrieved from ncbi (accession number gsm under project prjna ( )). detailed information about these public datasets are available in the supplementary file: supplementary.pdf for analysis, the human grch release transcriptome and the green monkey (chloro- cebus sabaeus) chlsab . release transcriptome and their corresponding annotation gtf files were downloaded from ensembl ( ) (https://www.ensembl.org). the reference virus genomes were downloaded from ncbi: sars-cov- (genbank: mn . ), sars-cov (genbank: ay . ), mers-cov (genbank: jx . ). data analysis workflow the workflow of this study is summarized in fig. s and fig. s in the supplementary file: supplementary.pdf. the quality of the raw fastq data was examined with fastqc ( ). trimmomatic- . ( ) was used to remove adapters and filter out low quality reads with param- eters “-threads -phred illuminaclip:adapters.fasta: : : headcrop: lead- ing: trailing: slidingwindow: : minlen: ”. the clean rna sequencing reads were then pseudo-aligned to reference transcriptome and quantified using kallisto (ver- sion . . ) ( ) with parameters “-b –single -l -s ” for single-end sequencing data (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and with parameter “-b ” for paired-end sequencing data. expression levels were calculated and summarized as transcripts per million (tpm) on gene levels with sleuth ( ), and logfc was then calculated for each condition. the single cell rna-seq data were summarized across all cells to obtain “pseudo-bulk” samples. r packages edaseq ( ) and org.hs.eg.db ( ) were used to obtain gene length, and tpm was calculated with the “calculatetpm” function of r package scater ( ). logfc was then calculated for each patient. the clean rna-seq data were also aligned to the virus genome with bowtie ( ) (version . . ) and the aligned bam files were created, and the mapping rates to the virus genomes were obtained as well. samtools ( ) (version . ) was then used for sorting and indexing the aligned bam files. the “samtools depth” command was used to produce the number of aligned reads per site along the virus genome. the heatmap in fig. i was made by pheatmap r package ( ), “complete” clustering method was used for clustering the rows and “euclidean” distance was used to measure the cluster distance. the heatmap in fig. a was made by complexheatmap r package ( ). “complete” clustering method was used for clustering the rows and columns and “euclidean” distance was used to measure the cluster distance. references . a. r. fehr, s. perlman, coronaviruses (springer, ), pp. – . . t. kuiken, r. a. fouchier, m. schutten, g. f. rimmelzwaan, g. van amerongen, d. van riel, j. d. laman, t. de jong, g. van doornum, w. lim, a. e. ling, p. k. chan, j. s. tam, m. c. zambon, r. gopal, c. drosten, s. van der werf, n. escriou, j. c. manuguerra, k. stöhr, j. s. peiris, a. d. osterhaus, newly discovered coronavirus as the primary cause of severe acute respiratory syndrome. the lancet , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . who, summary of probable sars cases with onset of illness from november to july . . a. m. zaki, s. van boheemen, t. m. bestebroer, a. d. osterhaus, r. a. fouchier, isolation of a novel coronavirus from a man with pneumonia in saudi arabia. new england journal of medicine , – ( ). . who, middle east respiratory syndrome coronavirus (mers-cov) âăş saudi arabia. . f. wu, s. zhao, b. yu, y. m. chen, w. wang, z. g. song, y. hu, z. w. tao, j. h. tian, y. y. pei, m. l. yuan, y. l. zhang, f. h. dai, y. liu, q. m. wang, j. j. zheng, l. xu, e. c. holmes, y. z. zhang, a new coronavirus associated with human respiratory disease in china. nature , – ( ). . who, who coronavirus disease (covid- ) dashboard. . x. xu, p. chen, j. wang, j. feng, h. zhou, x. li, w. zhong, p. hao, evolution of the novel coronavirus from the ongoing wuhan outbreak and modeling of its spike protein for risk of human transmission. science china life sciences , – ( ). . s. belouzard, j. k. millet, b. n. licitra, g. r. whittaker, mechanisms of coronavirus cell entry mediated by the viral spike protein. viruses , – ( ). . z. lou, y. sun, z. rao, current progress in antiviral strategies. trends in pharmacological sciences , – ( ). . e. teissier, f. penin, e.-i. pécheur, targeting cell entry of enveloped viruses as an antiviral strategy. molecules , – ( ). . i. s. mahmoud, y. b. jarrar, w. alshaer, s. ismail, sars-cov- entry in host cells-multiple targets for treatment and prevention. biochimie ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . z. qinfen, c. jinming, h. xiaojun, z. huanying, h. jicheng, f. ling, l. kunpeng, z. jingqiang, the life cycle of sars coronavirus in vero e cells. journal of medical vi- rology , – ( ). . m. hoffmann, h. kleine-weber, s. schroeder, n. krüger, t. herrler, s. erichsen, t. s. schiergens, g. herrler, n. h. wu, a. nitsche, m. a. müller, c. drosten, s. pöhlmann, sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor. cell ( ). . z.-y. yang, y. huang, l. ganesh, k. leung, w.-p. kong, o. schwartz, k. subbarao, g. j. nabel, ph-dependent entry of severe acute respiratory syndrome coronavirus is mediated by the spike glycoprotein and enhanced by dendritic cell transfer through dc-sign. journal of virology , – ( ). . h. wang, p. yang, k. liu, f. guo, y. zhang, g. zhang, c. jiang, sars coronavirus entry into host cells through a novel clathrin-and caveolae-independent endocytic pathway. cell research , – ( ). . w. widagdo, s. sooksawasdi na ayudhya, g. b. hundie, b. l. haagmans, host determi- nants of mers-cov transmission and pathogenesis. viruses , ( ). . f. li, structure, function, and evolution of coronavirus spike proteins. annual review of virology , – ( ). . m. singh, v. bansal, c. feschotte, a single-cell rna expression map of human coronavirus entry factors. biorxiv ( ). . k. wang, w. chen, y.-s. zhou, j.-q. lian, z. zhang, p. du, l. gong, y. zhang, h.-y. cui, j.-j. geng, b. wang, x.-x. sun, c.-f. wang, x. yang, p. lin, y.-q. deng, d. wei, x.-m. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . yang, y.-m. zhu, k. zhang, z.-h. zheng, j.-l. miao, t. guo, y. shi, j. zhang, l. fu, q.-y. wang, h. bian, p. zhu, z.-n. chen, sars-cov- invades host cells via a novel route: cd -spike protein. biorxiv ( ). . r. amraie, m. a. napoleon, w. yin, j. berrigan, e. suder, g. zhao, j. olejnik, s. gum- muluru, e. muhlberger, v. chitalia, n. rahimi, cd l/l-sign and cd /dc-sign act as receptors for sars-cov- and are differentially expressed in lung and kidney epithelial and endothelial cells. biorxiv ( ). . f. hikmet, l. méar, Å. edvinsson, p. micke, m. uhlén, c. lindskog, the protein expression profile of ace in human tissues. molecular systems biology , e ( ). . l. zou, f. ruan, m. huang, l. liang, h. huang, z. hong, j. yu, m. kang, y. song, j. xia, q. guo, t. song, j. he, h. l. yen, m. peiris, j. wu, sars-cov- viral load in upper respiratory specimens of infected patients. new england journal of medicine , – ( ). . g. simmons, j. d. reeves, a. j. rennekamp, s. m. amberg, a. j. piefer, p. bates, char- acterization of severe acute respiratory syndrome-associated coronavirus (sars-cov) spike glycoprotein-mediated viral entry. proceedings of the national academy of sciences , – ( ). . r. zang, m. f. g. castro, b. t. mccune, q. zeng, p. w. rothlauf, n. m. sonnek, z. liu, k. f. brulois, x. wang, h. b. greenberg, m. s. diamond, m. a. ciorba, s. p. whelan, s. ding, tmprss and tmprss promote sars-cov- infection of human small intestinal en- terocytes. science immunology ( ). . p. zmora, m. hoffmann, h. kollmus, a.-s. moldenhauer, o. danov, a. braun, m. winkler, k. schughart, s. pöhlmann, tmprss a activates the influenza a virus hemagglutinin and (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the mers coronavirus spike protein and is insensitive against blockade by hai- . journal of biological chemistry , – ( ). . x. ou, y. liu, x. lei, p. li, d. mi, l. ren, l. guo, r. guo, t. chen, j. hu, z. xiang, z. mu, x. chen, j. chen, k. hu, q. jin, j. wang, z. qian, characterization of spike glyco- protein of sars-cov- on virus entry and its immune cross-reactivity with sars-cov. nature communications , – ( ). . y.-m. loo, m. gale jr, immune signaling by rig-i-like receptors. immunity , – ( ). . a. g. bowie, i. r. haga, the role of toll-like receptors in the host response to viruses. molecular immunology , – ( ). . c. chiang, m. u. gack, post-translational control of intracellular pathogen sensing path- ways. trends in immunology , – ( ). . a. park, a. iwasaki, type i and type iii interferons–induction, signaling, evasion, and ap- plication to combat covid- . cell host & microbe ( ). . q. ruan, k. yang, w. wang, l. jiang, j. song, clinical predictors of mortality due to covid- based on an analysis of data of patients from wuhan, china. intensive care medicine , – ( ). . i. f. n. hung, k. c. lung, e. y. k. tso, r. liu, t. w. h. chung, m. y. chu, y. y. ng, j. lo, j. chan, a. r. tam, h. p. shum, v. chan, a. k. l. wu, k. m. sin, w. s. leung, w. l. law, d. c. lung, s. sin, p. yeung, c. c. y. yip, r. r. zhang, a. y. f. fung, e. y. w. yan, k. h. leung, j. d. ip, a. w. h. chu, w. m. chan, a. c. k. ng, r. lee, k. fung, a. yeung, t. c. wu, j. w. m. chan, w. w. yan, w. m. chan, j. f. w. chan, a. k. w. lie, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . o. t. y. tsang, v. c. c. cheng, t. l. que, c. s. lau, k. h. chan, k. k. w. to, k. y. yuen, triple combination of interferon beta- b, lopinavir–ritonavir, and ribavirin in the treatment of patients admitted to hospital with covid- : an open-label, randomised, phase trial. the lancet , – ( ). . e. andreakos, s. tsiodras, covid- : lambda interferon against viral load and hyperin- flammation. embo molecular medicine p. e ( ). . d. blanco-melo, b. e. nilsson-payant, w. c. liu, s. uhl, d. hoagland, r. møller, t. x. jordan, k. oishi, m. panis, d. sachs, t. t. wang, r. e. schwartz, j. k. lim, r. a. albrecht, b. r. tenoever, imbalanced host response to sars-cov- drives development of covid- . cell ( ). . z. zhou, l. ren, l. zhang, j. zhong, y. xiao, z. jia, l. guo, j. yang, c. wang, s. jiang, d. yang, g. zhang, h. li, f. chen, y. xu, m. chen, z. gao, j. yang, j. dong, b. liu, x. zhang, w. wang, k. he, q. jin, m. li, j. wang, heightened innate immune responses in the respiratory tract of covid- patients. cell host & microbe ( ). . a. broggi, s. ghosh, b. sposito, r. spreafico, f. balzarini, a. lo cascio, n. clementi, m. de santis, n. mancini, f. granucci, i. zanoni, type iii interferons disrupt the lung epithelial barrier upon viral recognition. science ( ). . l. wei, s. ming, b. zou, y. wu, z. hong, z. li, x. zheng, m. huang, l. luo, j. liang, x. wen, t. chen, q. liang, l. kuang, h. shan, x. huang, viral invasion and type i inter- feron response characterize the immunophenotypes during covid- infection. available at ssrn ( ). . j. y. zhang, x. m. wang, x. xing, z. xu, c. zhang, j. w. song, x. fan, p. xia, j. l. fu, s. y. wang, r. n. xu, x. p. dai, l. shi, l. huang, t. j. jiang, m. shi, y. zhang, a. zumla, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . m. maeurer, f. bai, f. s. wang, single-cell landscape of immunological responses in pa- tients with covid- . nature immunology pp. – ( ). . e. sallard, f. x. lescure, y. yazdanpanah, f. mentre, n. peiffer-smadja, type interferons as a potential treatment against covid- . antiviral research p. ( ). . e. wyler, k. mösbauer, v. franke, a. diag, t. g. lina, r. arsie, f. klironomos, d. kopp- stein, s. ayoub, c. buccitelli, a. richter, i. legnini, a. ivanov, t. mari, s. d. giudice, p. p. jan, a. m. marcel, d. niemeyer, m. selbach, a. akalin, n. rajewsky, c. drosten, m. landthaler, bulk and single-cell gene expression profiling of sars-cov- infected human cell lines identifies molecular targets for therapeutic intervention. biorxiv ( ). . y. xiong, y. liu, l. cao, d. wang, m. guo, a. jiang, d. guo, w. hu, j. yang, z. tang, h. wu, y. lin, m. zhang, q. zhang, m. shi, y. liu, y. zhou, k. lan, y. chen, transcrip- tomic characteristics of bronchoalveolar lavage fluid and peripheral blood mononuclear cells in covid- patients. emerging microbes & infections , – ( ). . d. michalovich, n. rodriguez-perez, s. smolinska, m. pirozynski, d. mayhew, s. ud- din, s. van horn, m. sokolowska, c. altunbulakli, a. eljaszewicz, b. pugin, w. barcik, m. kurnik-lucka, k. a. saunders, k. d. simpson, p. schmid-grendelmeier, r. ferstl, r. frei, n. sievi, m. kohler, p. gajdanowicz, k. b. graversen, k. lindholm bøgh, m. ju- tel, j. r. brown, c. a. akdis, e. m. hessel, l. o’mahony, obesity and disease severity magnify disturbed microbiome-immune interactions in asthma patients. nature communi- cations , – ( ). . m. liao, y. liu, j. yuan, y. wen, g. xu, j. zhao, l. cheng, j. li, x. wang, f. wang, l. liu, i. amit, s. zhang, z. zhang, single-cell landscape of bronchoalveolar immune cells in patients with covid- . nature medicine pp. – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . c. morse, t. tabib, j. sembrat, k. l. buschur, h. t. bittar, e. valenzi, y. jiang, d. j. kass, k. gibson, w. chen, a. mora, p. v. benos, m. rojas, r. lafyatis, proliferating spp /mertk- expressing macrophages in idiopathic pulmonary fibrosis. european respiratory journal ( ). . c. tikellis, m. thomas, angiotensin-converting enzyme (ace ) is a key modulator of the renin angiotensin system in health and disease. international journal of peptides ( ). . g. simmons, d. n. gosalia, a. j. rennekamp, j. d. reeves, s. l. diamond, p. bates, inhibitors of cathepsin l prevent severe acute respiratory syndrome coronavirus entry. pro- ceedings of the national academy of sciences , – ( ). . s. lukassen, r. l. chua, t. trefzer, n. c. kahn, m. a. schneider, t. muley, h. winter, m. meister, c. veith, a. w. boots, b. p. hennig, m. kreuter, c. conrad, r. eils, sars-cov- receptor ace and tmprss are primarily expressed in bronchial transient secretory cells. the embo journal , e ( ). . r. ueha, t. sato, t. goto, a. yamauchi, k. kondo, t. yamasoba, expression of ace and tmprss proteins in the upper and lower aerodigestive tracts of rats. biorxiv ( ). . j. shilts, g. j. wright, no evidence for basigin/cd as a direct sars-cov- spike binding receptor. biorxiv ( ). . j. mercer, a. helenius, virus entry by macropinocytosis. nature cell biology , – ( ). . d. l. mckee, a. sternberg, u. stange, s. laufer, c. naujokat, candidate drugs against sars-cov- and covid- . pharmacological research p. ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . h. k. law, c. y. cheung, h. y. ng, s. f. sia, y. o. chan, w. luk, j. m. nicholls, j. peiris, y. l. lau, chemokine up-regulation in sars-coronavirus–infected, monocyte-derived hu- man dendritic cells. blood , – ( ). . c. y. cheung, l. l. m. poon, i. h. y. ng, w. luk, s.-f. sia, m. h. s. wu, k.-h. chan, k.-y. yuen, s. gordon, y. guan, j. s. m. peiris, cytokine responses in severe acute respiratory syndrome coronavirus-infected macrophages in vitro: possible relevance to pathogenesis. journal of virology , – ( ). . l. li, j. wo, j. shao, h. zhu, n. wu, m. li, h. yao, m. hu, r. h. dennin, sars-coronavirus replicates in mononuclear cells of peripheral blood (pbmcs) from sars patients. journal of clinical virology , – ( ). . w. zhao, x. he, k. a. hoadley, j. s. parker, d. n. hayes, c. m. perou, comparison of rna-seq by poly (a) capture, ribosomal rna depletion, and dna microarray for expression profiling. bmc genomics , – ( ). . e. w. sayers, r. agarwala, e. e. bolton, j. r. brister, k. canese, k. clark, r. connor, n. fiorini, k. funk, t. hefferon, j. b. holmes, s. kim, a. kimchi, p. a. kitts, s. lathrop, z. lu, t. l. madden, a. marchler-bauer, l. phan, v. a. schneider, c. l. schoch, k. d. pruitt, j. ostell, database resources of the national center for biotechnology information. nucleic acids research , d –d ( ). . r. leinonen, r. akhtar, e. birney, l. bower, a. cerdeno-tárraga, y. cheng, i. cleland, n. faruque, n. goodgame, r. gibson, g. hoad, m. jang, n. pakseresht, s. plaister, r. rad- hakrishnan, k. reddy, s. sobhany, p. t. hoopen, r. vaughan, v. zalunin, g. cochrane, the european nucleotide archive. nucleic acids research , d –d ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . l. riva, s. yuan, x. yin, l. martin-sancho, n. matsunaga, l. pache, s. burgstaller- muehlbacher, p. d. de jesus, p. teriete, m. v. hull, m. w. chang, j. f. w. chan, j. cao, v. k. m. poon, k. m. herbert, k. cheng, t. t. h. nguyen, a. rubanov, y. pu, c. nguyen, a. choi, r. rathnasinghe, m. schotsaert, l. miorin, m. dejosez, t. p. zwaka, k. y. sit, l. martinez-sobrido, w. c. liu, k. m. white, m. e. chapman, e. k. lendy, r. j. glynne, r. albrecht, e. ruppin, a. d. mesecar, j. r. johnson, c. benner, r. sun, p. g. schultz, a. i. su, a. garcía-sastre, a. k. chatterjee, k. y. yuen, s. k. chanda, discovery of sars- cov- antiviral drugs through large-scale compound repurposing. nature , – ( ). . z. zhang, et al., database resources of the national genomics data center in . nucleic acids research , d ( ). . a. d. yates, et al., ensembl . nucleic acids research , d –d ( ). . s. andrews, fastqc: a quality control tool for high throughput sequence data ( ). . a. m. bolger, m. lohse, b. usadel, trimmomatic: a flexible trimmer for illumina sequence data. bioinformatics , – ( ). . n. l. bray, h. pimentel, p. melsted, l. pachter, near-optimal probabilistic rna-seq quan- tification. nature biotechnology , – ( ). . h. pimentel, n. l. bray, s. puente, p. melsted, l. pachter, differential analysis of rna-seq incorporating quantification uncertainty. nature methods , ( ). . d. risso, k. schwartz, g. sherlock, s. dudoit, gc-content normalization for rna-seq data. bmc bioinformatics , ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . m. carlson, s. falcon, h. pages, n. li, org. hs. eg. db: genome wide annotation for human. r package version ( ). . d. j. mccarthy, k. r. campbell, a. t. lun, q. f. wills, scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. bioinformatics , – ( ). . b. langmead, s. l. salzberg, fast gapped-read alignment with bowtie . nature methods , ( ). . h. li, b. handsaker, a. wysoker, t. fennell, j. ruan, n. homer, g. marth, g. abecasis, r. durbin, the sequence alignment/map format and samtools. bioinformatics , – ( ). . r. kolde, pheatmap: pretty heatmaps ( ). r package version . . . . z. gu, r. eils, m. schlesner, complex heatmaps reveal patterns and correlations in multi- dimensional genomic data. bioinformatics , – ( ). acknowledgements: the authors thank professor ke xu from wuhan university and professor dimitri lavillette from institut pasteur of shanghai for helpful conversations. funding: this work was partially funded by grant kl b (secovit) of the german federal ministry of education and research. author contributions: pei hao and yingying cao conceived the research. daniel hoffmann, pei hao, and yingying cao designed the analyses. yingying cao, xintian xu conducted the analyses. all authors wrote the manuscript. competing interests: the authors declare that they have no competing financial interests. data and materials availability: additional data and materials are available online. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figures and tables: table . data of cell lines (cells) included in this study virus virus strain virus dose (moi) time replicates species of origin cell type library preparation accession number sars-cov- usa-wa / h homo sapiens nhbe polya+ selection gse mock mock mock h homo sapiens nhbe polya+ selection gse sars-cov- usa-wa / . h homo sapiens a polya+ selection gse mock mock mock h homo sapiens a polya+ selection gse sars-cov- usa-wa / h homo sapiens a polya+ selection gse mock mock mock h homo sapiens a polya+ selection gse sars-cov- usa-wa / . h homo sapiens a -ace polya+ selection gse mock mock mock h homo sapiens a -ace polya+ selection gse sars-cov- usa-wa / h homo sapiens a -ace polya+ selection gse mock mock mock h homo sapiens a -ace polya+ selection gse sars-cov- usa-wa / h homo sapiens calu polya+ selection gse mock mock mock h homo sapiens calu polya+ selection gse sars-cov- munich/bavpat / . h homo sapiens calu rrna-depletion gse mock mock mock h homo sapiens calu rrna-depletion gse sars-cov- munich/bavpat / . h homo sapiens calu polya+ selection gse mock mock mock h homo sapiens calu polya+ selection gse sars-cov- munich/bavpat / . h homo sapiens caco polya+ selection gse mock mock mock h homo sapiens caco polya+ selection gse sars-cov- munich/bavpat / . h homo sapiens h polya+ selection gse mock mock mock h^ homo sapiens h polya+ selection gse sars-cov- usa-wa / . h * chlorocebus sabaeus vero e rrna-depletion gse mock mock mock h chlorocebus sabaeus vero e rrna-depletion gse sars-cov frankfurt strain . h homo sapiens calu polya+ selection gse sars-cov frankfurt strain . h homo sapiens calu rrna-depletion gse sars-cov frankfurt strain . h homo sapiens caco polya+ selection gse sars-cov frankfurt strain . h homo sapiens h polya+ selection gse sars-cov urbani strain . h homo sapiens mrc polya+ selection gse sars-cov urbani strain h homo sapiens mrc polya+ selection gse sars-cov urbani strain . h chlorocebus sabaeus vero e polya+ selection gse sars-cov urbani strain h chlorocebus sabaeus vero e polya+ selection gse mers-cov emc/ . h homo sapiens mrc polya+ selection gse mers-cov emc/ h homo sapiens mrc polya+ selection gse mers-cov emc/ . h chlorocebus sabaeus vero e polya+ selection gse mers-cov emc/ h chlorocebus sabaeus vero e polya+ selection gse mock mock mock h homo sapiens mrc polya+ selection gse mock mock mock h homo sapiens vero e polya+ selection gse ^no corresponding h mock control samples for h cells, h mock control samples were used instead. * there are three replicates, but when the manuscript was in preparation only two of them are available for downloading. table . data of covid- patients included in this study individuals tissue data type accession number bronchoalveolar lavage fluid from covid- patients bulk rna-seq cra bronchoalveolar lavage fluid from healthy negative control bulk rna-seq prjna ^ peripheral blood mononuclear cells from covid- patients bulk rna-seq cra peripheral blood mononuclear cells from healthy negative control bulk rna-seq cra lung biopsy from postmortem covid- patients bulk rna-seq gse lung biopsy from healthy negative control bulk rna-seq gse bronchoalveolar lavage fluid from covid- patients (severe) single cell rna-seq gse bronchoalveolar lavage fluid from covid- patients (moderate) single cell rna-seq gse bronchoalveolar lavage fluid from healthy negative control single cell rna-seq prjna * ^three samples under project prjna : srr , srr , and srr were used. * one sample with accession number gsm under project prjna was used. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ●● ● ●●● ●●● ● ●● ●● ●● ●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ●●● ●● ●●●●● ●●● ●●● ●●● m r c − . m o i m r c − m o i h − . m o i ve ro e − . m o i ve ro e − m o i c al u − . m o i c al u − . m o i− to ta lr n a c ac o − . m o i m r c − . m o i m r c − m o i ve ro e − . m o i ve ro e − m o i a − . m o i − a − m o i h − . m o i n h b e − m o i a − ac e − . m o i a − ac e − m o i c ac o − . m o i c al u − . m o i c al u − . m o i− to ta lr n a c al u − m o i ve ro e − . m o i.t ot al r n a m ap pi ng ra te to v iru s ge no m e (% ) ● ● ● mers−cov sars−cov sars−cov− fig. . mapping rate to virus genome. the dots represent the mapping rates to the virus genome for each individual replicate under the given conditions (cell line, moi, and virus). bar heights are mean mapping rates to the virus genome for each condition. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the number of reads mapped to the corresponding virus genome. (a-e) the dot plots show the number of reads mapped to each site of the corresponding virus genome. the annotation of the genome of each virus is from ncbi (sars: gcf_ . , sars-cov- : gcf_ . , mers: gcf_ . ). labels in grey title bars correspond to conditions as in fig. . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ●● ●●●● ● ● ● ●● ● ● ● ● ●●●● ●● ●● ● ●●● ●●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) dpp a ●●●●●● ●●●● ●● ●● ●● ●●● ●● ●●● ●●● ●●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) ace b ●●●●●● ●●●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) tmprss c ●●●●●● ●●●●●● ●● ●● ●● ● ●● ●●● ●●● ●●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) tmprss d ●●●●●● ●●●● ●● ●● ●● ●●● ●● ●●● ●● ● ●●● . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) ctsbe ●●● ●●● ●● ●●●● ●● ●● ●●● ●● ●●● ●●● ●●● . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) ctslf ●●●●●● ●●● ●●● ●● ●●●●● ●● ●●● ● ●● ●●● . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) bsgg ●●● ●●● ●● ●● ●● ●● ●●●● ● ●● ●●● ●●● ●●● . . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) cd h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ac e b s g c d c ts b c ts l d p p tm p r s s a tm p r s s b tm p r s s d tm p r s s e tm p r s s f tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s a c b a .ace veroe mrc a h nhbe caco calu log (tpm+ )i j (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the expression levels of the receptors and proteases. (a-h) each dot represents the expression value in each sample. (i) heatmap of the expression levels of coronavirus as- sociated receptors and factors of different cell types. labels a, b, c mark cell clusters that likely share entry routes sketched in panel j. (j) entry mechanisms involved in sars-cov- entry into cells. schematic is based on a figure by vega asensio - own work, cc by-sa . , https://commons.wikimedia.org/w/index.php?curid= . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnb a −ace sars−cov− b ●●●●●● ●●● ●●● m oc k . m o i m o i ifnb a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnb calu sars−cov− ●● ●● m oc k . m o i ifnb caco sars−cov− ●● ●● m oc k . m o i ifnb h sars−cov− ●●● ●●● m oc k m o i ifnb sars−cov− nhbe ●● ●● m oc k . m o i ifnb sars−cov calu ●● ●● m oc k . m o i ifnb sars−cov caco ●● ●● m oc k . m o i ifnb sars−cov h ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnl a −ace sars−cov− c ●●●●●● ●●● ●●● m oc k . m o i m o i ifnl a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnl calu sars−cov− ●● ●● m oc k . m o i ifnl caco sars−cov− ●● ●● m oc k . m o i ifnl h sars−cov− ●● ● ●●● m oc k m o i ifnl nhbe sars−cov− ●● ●● m oc k . m o i ifnl sars−cov calu ●● ●● m oc k . m o i ifnl sars−cov caco ●● ●● m oc k . m o i ifnl sars−cov h ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnl a −ace sars−cov− d ●●●●●● ●●● ●●● m oc k . m o i m o i ifnl a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnl calu sars−cov− ●● ●● m oc k . m o i ifnl caco sars−cov− ● ● ●● m oc k . m o i ifnl h sars−cov− ●● ● ●●● m oc k m o i ifnl nhbe sars−cov− ●● ●● m oc k . m o i ifnl sars−cov calu ●● ● ● m oc k . m o i ifnl sars−cov caco ● ● ●● m oc k . m o i ifnl sars−cov h ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnl a −ace sars−cov− e ●●●●●● ●●● ●●● m oc k . m o i m o i ifnl a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnl calu sars−cov− ●● ●● m oc k . m o i ifnl caco sars−cov− ●● ●● m oc k . m o i ifnl h sars−cov− ●●● ●●● m oc k m o i ifnl nhbe sars−cov− ●● ●● m oc k . m o i ifnl sars−cov calu ●● ● ● m oc k . m o i ifnl sars−cov caco ●● ●● m oc k . m o i ifnl sars−cov h a f g ● ●● ● ●● ●● ● ● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e tp m ifnb ● ● ● ● ●●●● ● ● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e ifnl ● ●●● ●●● ●●● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e ifnl ● ●●● ●● ●● ● ● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e ifnl ●●● ● ● m oc k . m o i tp m ifnl sars−cov veroe ●●● ● ● ● ●● ● m oc k . m o i m o i ifnl sars−cov veroe ●●● ●●● ●●● m oc k . m o i m o i ifnl mers−cov veroe sars−cov− _a .ace _ . moi sars−cov− _nhbe_ moi sars−cov− _a _ . moi mers−cov_veroe _ moi sars−cov− _a _ moi sars−cov_caco _ . moi sars−cov− _caco _ . moi sars−cov_mrc _ moi mers−cov_mrc _ . moi mers−cov_veroe _ . moi mers−cov_mrc _ moi sars−cov_mrc _ . moi sars−cov_h _ . moi sars−cov− _h _ . moi pbmc. pbmc. pbmc. balf.moderate. balf.moderate. balf.moderate. balf.severe. balf.severe. balf.severe. balf.severe. balf.severe. balf.severe. sars−cov− _calu _ . moi sars−cov− _calu _ . moi_totalrna sars−cov− _calu _ moi sars−cov_calu _ . moi_totalrna sars−cov_calu _ . moi sars−cov− _a .ace _ moi lung. lung. balf. balf. sars−cov_veroe _ . moi sars−cov_veroe _ moi sars−cov− _veroe _ . moi_totalrna d d x if ih d h x tl r tl r tl r tl r tl r tl r tl r tl r tl r ir f ir f ir f ir f ir f ir f ir f ir f ir f tb k n fk b n fk b if n a if n a if n a if n a if n b if n e if n g if n k if n l if n w if n a r if n g r if n g r if n lr ja k ja k ja k ty k s ta t s ta t s ta t s ta t s ta t a s ta t b s ta t is g is g is g l m x o a s o a s o a s o a s l if it if it b if it if it if it if it m if it m c c l c c l c c l c c l c c l c c l c c l c c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l logfc − − (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . expression levels of genes related to immune responses (a) heatmap of the logfc of ifns, isgs and pro-inflammatory cytokines. the clustering of samples produces a clus- ter (top) with little ifn/isg expression comprising mers infections and non-infectable cells/sars-cov- / (except for caco cells), and a cluster (bottom) strong ifn/isg ex- pression with sars-cov- / infectable cells and patient samples. (b-g) expression levels of ifns. each dot represents the expression value of a sample. bars indicate mean expression levels (in tpm) of respective ifn at different moi values. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary materials: additional information about public data all data can be downloaded from public repositories, the three main sources are ncbi ( ) (https://www.ncbi.nlm.nih.gov/) and ena ( ) (https://www.ebi.ac.uk/ena) and big data cen- ter ( ) (https://bigd.big.ac.cn/). gse dataset ( ) from this dataset we downloaded: biological triplicates of primary human lung epithelium (nhbe) which were mock treated or infected with sars-cov- (usa-wa / ) at an moi of ; biological triplicates of transformed lung alveolar (a ) cells which were mock treated or infected with sars-cov- (usa-wa / ) at an moi of . or ; biological triplicates of transformed lung alveolar (a ) transduced with a vector expressing human ace , which were also mock treated or infected with sars-cov- (usa-wa / ) at an moi of . or ; biological triplicates of transformed lung-derived calu- cells which were mock treated or infected with sars-cov- (usa-wa / ) at an moi of ; covid- patient samples: uninfected human lung biopsies derived from one male (age ) and one female (age ) and used as control biological replicates, and lung samples derived from a single male covid- deceased patient (age ) which were processed in technical replicates. library preparation method polya+ selection was used to remove rrnas before sequencing. gse dataset ( ) from this dataset we downloaded biological replicates of calu- , caco- and h cells which were mock treated or infected with sars-cov- (patient isolate betacov/munich/bavpat / /epi_isl_ ) or sars-cov (frankfurt strain) at an moi of . . library preparation method polya+ selection was used to remove rrnas before sequencing caco- and h cells. for calu- cells, two library preparation method polya+ selection and rrna-depletion were used respectively to remove rrnas before sequencing. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gse dataset ( ) from this dataset we downloaded rna sequencing data of vero e cells which were either mock-infected or infected with sars-cov- usa-wa / (moi = . ) with three repli- cates. however, when we downloaded the data one sample with accession number gsm was not available for downloading. cells were harvested at hours after infection, and rrna- depletion method was used to extract rna for sequencing. gse dataset from this dataset we downloaded: biological triplicates of mrc and vero e cells which were mock treated or infected with sars-cov (urbani strain) or mers-cov (emc/ ) at an moi of . or . library preparation method polya+ selection was used to remove rrnas before sequencing. cra dataset ( ) this dataset is public available in https://bigd.big.ac.cn/gsa/browse/cra . from this dataset we downloaded: the raw fastq data of pbmc and balf samples of covid- patients and corresponding pbmc controls. prjna dataset ( ) from this dataset we downloaded the raw fastq data for balf healthy control samples with accession numbers srr , srr , and srr . gse dataset ( ) from this dataset we downloaded the preprocessed single cell rna-seq data of balf samples from severe covid- patients and mild covid- patients. prjna dataset ( ) from this dataset we downloaded the preprocessed single cell rna-seq data of balf sample from a healthy control with accession number gsm . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary figures fig. s . workflow of bulk rna-seq. bulk rna-seq raw data fastqc trimmomatic align to virus genome pseudoalign to host transcriptome b ow tie kallisto samtools sleuth reads coverage along virus genome gene level tpm values bulk rna-seq clean data fig. s . workflow of single cell rna-seq data. count matrix of scrna-seq of covid- patients sum counts across all cells to obtain “pseudo-bulk” samples edaseq obtain gene length org.hs.eg.db scater obtain gene level tpm values (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in mrc cells infected with sars-cov and mers- cov. ●●● ●●● ●●● m oc k . m o i m o i tp m ifnb sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i tp m ifnb mers-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl mers-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl mers-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl mers-cov mrc (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in balf samples of patients. ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnb a ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnl b ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnl c ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnl d (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in pbmc samples of patients. ●● ● ●●● h e a lth y. p b m c p a tie n t. p b m c t p m ifnb a ● ● ● ●● ● h e a lth y. p b m c p a tie n t. p b m c t p m ifnl b ●● ● ●●● h e a lth y. p b m c p a tie n t. p b m c t p m ifnl c ●●● ●●● h e a lth y. p b m c p a tie n t. p b m c t p m ifnl d (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in lung samples of patients. ●● ●● h e a lth y. l u n g p a tie n t. l u n g t p m ifnb a ●● ● ● h e a lth y. l u n g p a tie n t. l u n g t p m ifnl b ●● ●● h e a lth y. l u n g p a tie n t. l u n g t p m ifnl c ● ● ●● h e a lth y. l u n g p a tie n t. l u n g t p m ifnl d (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . the number of reads mapped to the sars-cov- genome in lung samples of patients. ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● nsp nsp nsp nsp −nsp s orf a e m orf − n ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● nsp nsp nsp nsp −nsp s orf a e m orf − n samn samn genomic position s a r s − c o v − r e a d s (l o g ) sars−cov− (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . the expression levels of ace in the pbmc, lung and balf samples of healthy individuals. ●●● ●●● ● ● h e a lth y. b a l f h e a lth y. p b m c h e a lth y. l u n g t p m ace additional files that are too large to be embedded into the .tex file: table s to tables .xlsx table s to tables .csv (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . capsule network for protein ubiquitination site prediction capsule network for protein ubiquitination site prediction qiyi huang , ¶ jiulei jiang ¶ yin luo * weimin li & ying wang (school of computer science and engineering, north minzu university, yinchuan , ningxia, china) (school of life sciences, east china normal university, shanghai , china) (school of computer science and engineering, changshu institute of technology, suzhou , jiangsu, china) (school of computer engineering and science, shanghai university, shanghai , china) *corresporending author. e-mail: yluo@bio.ecnu.edu.cn(yl) ¶ these authors contributed equally to this work. & this author also contributed equally to this work. copyright: © huang et al. this is an open-access article distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. funding: this project is supported by the national key r&d program of china ( yfe ), national nature science foundation of china ( ), national statistical science research project ( ly ). competing interests: the authors have declared that no competing interests exist. abstract ubiquitination modification is one of the most important protein posttranslational modifications used in many biological processes. traditional ubiquitination site determination methods are expensive and time-consuming, whereas calculation-based prediction methods can accurately and efficiently predict ubiquitination sites. this study used a convolutional neural network and a capsule network in deep learning to design a deep learning model, “caps-ubi,” for multispecies ubiquitination site prediction. two encoding methods, one-of-k and the amino acid continuous type were used to characterize the sequence pattern of ubiquitination sites. the proposed caps-ubi predictor achieved an accuracy of . , a sensitivity of . , a specificity of . , a measure-correlate-prediction of . , and an area under receiver operating characteristic curve value of . , which outperformed the other tested predictors. introduction ubiquitination is an important posttranslational modification of proteins, consisting of the covalent binding of ubiquitin to a variety of cellular proteins. ubiquitin was discovered in by .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / goldstein et al. [ ]; it is a small protein composed of amino acids [ ]. ubiquitination is the process of covalently binding the lysine of a substrate protein to the small ubiquitin molecule under the action of a series of enzymes. three enzymes are involved in the process: e activation, e conjugation, and e ligation. ubiquitination modification plays a very important role in basic reactions such as signal transduction, cell diseases, dna repair, and transcription regulation [ – ]. due to the important biological characteristics of ubiquitination, identifying potential ubiquitination sites helps to understand protein regulation and molecular mechanisms. determining ubiquitination sites based on traditional biological experimental techniques such as mass spectrometry [ ] and antibody recognition [ ] is costly and time-consuming. therefore, it is necessary to develop a calculation method that can accurately and efficiently recognize protein ubiquitination. in recent years, some calculation methods have been developed to predict potential ubiquitination sites. huang et al. [ ] used amino acid composition (aac), a position weighting matrix, amino acid pair composition (aapc), a position-specific scoring matrix (pssm), and other information to develop a predictor called ubisite using a support vector machine (svm). nguyen et al. [ ] used an svm to combine three kinds of information: aac, evolution information, and aapc to develop a predictor. qiu et al. [ ] developed a new predictor called “iubiq-lys” to apply to sequence evolution information and a gray system model. chen et al. [ ] also applied svm to build a ubiprober predictor. wang et al. [ ] introduced physical–chemical attributes into an svm to develop the esa-ubisite predictor. radivojac et al. [ ] developed the predictor ubpred using a random forest algorithm. lee et al. [ ] developed ubsite using efficient radial basis functions. all of those machine learning-based methods and predictors have promoted the development of ubiquitination site prediction research and achieved good prediction performance. however, most of them rely on artificial feature selection, which may lead to imperfect features [ ], and their datasets are small despite the large volume of accumulated biomedical data. deep learning, the most advanced machine learning technology, can handle large-scale data well. it has multilayer networks and nonlinear mapping operations, which can fit the complex structure of data well. in recent years, deep learning has been developed rapidly [ ] and has been successfully applied in various fields of bioinformatics [ , ]. some methods based on deep learning have been used for ubiquitination site identification. for example, fu et al. [ ] applied one-hot and composition of k-spaced amino acid pairs encoding methods to develop deepubi with text-cnn. liu et al. [ ] used deep transfer learning methods to develop the deeptl-ubi predictor for multispecies ubiquitination site prediction. he et al. [ ] established a multimodel predictor using one-hot, physical–chemical properties of amino acids, and a pssm. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / although various ubiquitination site predictors and tools have been developed, there are still some limitations, and their accuracy and other performance elements must be further improved. in this paper, a deep learning model, “caps-ubi,” is proposed that uses a capsule network for protein ubiquitination site prediction. in caps-ubi, the protein fragments are first passed through one-of-k and amino acid continuous methods to encode them. then three convolutional layers and the capsule network layer are used as a feature extractor to obtain the functional domains in the protein fragments and finally to get the prediction result. relative to existing tools, the prediction performance of caps-ubi is a significant improvement. researchers could use the predictor to select potential ubiquitination candidate sites and do experiments to verify them, which will reduce the range of protein candidates and save time. materials and methods benchmark dataset the ubiquitination dataset came from the largest online protein lysine modification database, plmd . , which contains protein lysine modifications. the database has , proteins and , protein lysine modification sites, including , proteins and , ubiquitination sites. to eliminate errors caused by homologous sequences, we used cd-hit [ ] to filter out homologous sequences with sequence similarities greater than %. we obtained , proteins and , ubiquitination sites, which were used as a positive sample set. based on those annotated sequences, , nonubiquitinated sites were extracted from the proteins as a negative sample set, and cd-hit- d [ ] was used to filter out homologous sequences within the positive sample set that were greater than %. to establish a balanced training model, we randomly selected the same data as the positive sample set and selected % of it as the training and validation sets and % as the independent test set. finally, , data on ubiquitination sites and , data on nonubiquitination sites were obtained. the final data division is shown in table . table . data of protein ubiquitination sites dataset no. of positive data no. of negative data training , , validation , , testing , , .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / input sequence coding the coding method directly determines the quality of its prediction results; a good feature can extract the correlation between the ubiquitination feature and the targets from peptide sequences [ ]. after encoding the protein sequence, the sequence information is converted into digital information, and then deep learning is done on it. in this study, two methods were used to encode the amino acid sequence around the protein ubiquitination site; namely, one-of-k encoding and amino acid continuous encoding. one-of-k encoding the one-of-k encoding method was adopted for protein fragments, and each protein fragment was encoded into an m × k d matrix, where m is the number of amino acids in each sequence— that is, the length of the input sequence—and k is the type of amino acid. there are kinds of common amino acids. when the length of the input sequence did not reach the window length, it was filled in with a “-” on the left or right side of the protein fragment and was treated as another amino acid, so each sequence consisted of amino acids. continuous coding of amino acids the continuous amino acid coding method [ ] was proposed by venkatarajan; the coding uses physical-chemical properties to quantitatively characterize amino acids. they used five main components to characterize the changes in physica-chemical properties of amino acids. in this paper, each amino acid is represented by a d vector, wherein the first d represents the five principal components as shown in table of [ ], the last d represents the gap in the input protein fragment with a length of m. the gap is represented by a dash“-”, meaning that when the sequence length does not reach the window length, the bit is coded as ; otherwise, it is . finally, each protein fragment is coded into an m × d matrix. this continuous coding scheme can comprehensively consider the physical and chemical properties of protein amino acids and has a smaller dimension than that of one-of-k coding. the smaller input dimension will lead to a relatively simple network structure, which is beneficial to avoid overfitting. capsule network in a cnn, the pooling layer can extract valuable information from the data, but some location information is lost [ ]. also, a cnn outputs scalar values in neurons, and the information represented by scalar neurons is limited and cannot reflect the spatial position relation of the internal .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / features of the neural network. to solve the problems of scalar neurons, in hinton proposed a deep learning architecture called a capsule network [ ]. the main building module of a capsule network is the capsule [ ], which is a set of neuron vectors. the length of the capsule represents the probability of the existence of an entity; the longer the capsule, the greater the probability，and the direction of the capsule represents the state of the entity. the capsule network provides a unique and powerful deep learning building block that can better model the complex relations within a neural network. a cnn uses scalar input activation functions, such as the rectified linear activation function relu, a sigmoid, and a tanh, and the capsule network uses an activation function called a squash. the calculation equation is ( ) where 𝑣 𝑗 is the output of capsule 𝑗 , and 𝑠 𝑗 is the weighted sum of the input vectors of capsule 𝑗 . this function compresses the vector length to the interval [ , ], which can be regarded as a kind of compression and reallocation of the vector length. in addition to the first-layer capsule network, the input of the capsule 𝑠 𝑗 is obtained by the weighted sum of the prediction vector (𝑢 𝑗 | 𝑖 ) located in the lower-layer capsule, and the prediction vector (𝑢 𝑗 | 𝑖 ) is passed through the lower layer. the capsule is calculated by multiplying its output (𝑢 𝑖 ) and the weight matrix (𝑤 𝑖 𝑗 ): ( ) ( ) where 𝑐𝑖𝑗 is the coupling coefficient, which is obtained by a softmax transformation from 𝑏𝑖𝑗; its calculation equation is ( ) in eq. ( ), the sum of the coupling coefficients of all capsules and capsule 𝑖 in the previous layer is . the coupling coefficient is obtained through a dynamic routing mechanism; the pseudocode is as follows: procedure routing ( 𝑢𝑗|𝑖 ,r,l) || || || || || || j j j j j s s v s s   |ˆj i ij j is c u  |ˆ j i ij iu w u exp( ) exp( ) ij ij k ik b c b   .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / for all capsules i in layer l and capsules j in layer (l + ): 𝑏𝑖𝑗 . for r iterations do: for all capsules i in layer l:𝑐𝑖 softmax (𝑏𝑖) for all capsules j in layer (l + ): 𝑠𝑗 𝛴𝑐𝑖𝑗𝑢𝑗|𝑖 for all capsules j in layer (l + ): 𝑣𝑗 squashing (𝑠𝑗) for all capsules i in layer l and capsules j in layer (l + ):𝑏𝑖𝑗 𝑏𝑖𝑗 + 𝑢𝑗|𝑖. 𝑣𝑗 return 𝑣𝑗 the loss function of the capsule network is the margin loss function, and the calculation equation is ( ) where 𝐾 is the number of categories, 𝑇 𝐾 is the real label ubiquitinated to and nonubiquitinated to , | | 𝑉 𝑘 | | is the output length of the kth capsule, which is the probability of predicting the kth class. the boundary 𝑚 + is . , which is a penalty for false positives, and the lower boundary 𝑚 ― is . , which is a penalty for false negatives. 𝜆 is a proportional coefficient of . , which is used to control the loss caused when some categories do not appear ， to prevent the capsule vector length of all categories from being reduced in the early stage of training，and the total loss is the sum of the losses of 𝐾 categories. architecture design as shown in figure , the structure of the proposed model contains two identical subnetworks that process one-of- and amino acid continuous encoding modes. after training in their respective network model, the two models merge the features as the final output. each subnetwork consists of the same three d convolutional layers (conv , conv , conv ) and a capsule network layer. the first convolutional layer (conv ) of the network is a d convolution kernel, which comprises convolution kernels with a size of and a step size of that use the relu activation function. a convolution kernel with a length of first appears in the network in network [ ]; a convolution kernel with a length of can reduce the complexity of the model and can make the network deeper l max( , || ||) ( ) max( ,|| || )k k k k kt m v t v m       .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / and wider. applied in this study, it acted as a feature filter and could pool features in two encoding modes. the second convolutional layer, conv , is a conventional convolutional layer with d convolution kernels with a length of and a step size of , which functions as a local feature detector to extract the protein sequence input and convert it to corresponding local features. conv is understood as the functional domain characteristics of the protein, and its output is used as the input of the next layer, conv . the third convolutional layer, conv , has d convolution kernels with a size of and a step size of . the activation function used is relu and a dropout mechanism with a random deletion rate of . . the dropout mechanism is used to prevent the model from overfitting and to increase the generalization ability of the model. these two convolutional layers are used to increase the feature representation ability of the capsule network and convert the original features of protein fragments into more advanced and abstract features. then the local features of conv are used as the input of the primarycapsule network layer. the dimension of each capsule in the primarycapsule is , the step size is , the convolution kernel length is , and the squash activation function is used. the last layer of labelcapsule is a capsule with a dimension of , which is used to represent the two states of the input protein fragment: the input sequence is ubiquitination site or non-ubiquitination site, and finally the output of the two subnetworks are merged as the final prediction result. figure . network structure structure of the proposed model model training for model training, we used the adam[ ] optimization algorithm. adam can automatically adjust the learning rate of the parameters, improve the training speed, and improve the stability of the model. the learning rate was . , the first-order estimated exponential decay rate was . , and the exponential decay rate estimated by the second moment was . . the dynamic routing .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / mechanism was consistent with that in the original paper [ ]. the number of routing iterations was , and the boundary loss function was used as the loss function of the model. the boundary loss function form is shown in eq. ( ). and the number of model training iterations was epochs. the deep learning framework used by this model was keras . . . keras is a highly modular deep learning framework based on theano and written in python; it supports both cpu and gpu. the programming language was python . , and the model was trained and tested on a windows system equipped with an nvidia rtx gpu. result model evaluation and performance indicators a confusion matrix is a visual display tool used to evaluate the quality of classification models. each row of the matrix represents the actual condition of the sample, and each column represents the sample condition predicted by the model. there are four values in the matrix, as shown in the following equations, where fn is the number of false negatives, fp is the number of false positives, tn is the number of true negatives, and tp is the number of true positives. the following indicators based on the confusion matrix are usually used to evaluate the prediction of the model performance: among them, sn stands for sensitivity, which is the evaluation of the prediction performance of negative samples; sp is the specificity, which is the evaluation of the prediction performance of positive samples; acc is the accuracy, which is the evaluation of the accuracy of the model; and mcc is the matthew’s correlation coefficient, which is the overall evaluation of the model. the receiver operating characteristic (roc) curve and the area under the curve (auc) for the roc curve are usually used to evaluate the pros and cons of binary classifiers: the larger the auc value, the better the model performance.   fn ( )( )( ) tp tn fp tn fp t mcc tp fn p fp tn fn         .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / experimental results first, we did many experiments on the selection of the window size of protein fragments. because the correlation information between amino acids had a direct effect on the prediction results, we needed to determine an appropriate window size. previous studies directly used empirical values such as , , or . however, different data models and classifiers tend to have different window sizes [ ]. therefore, a window length of n was selected from a range of to , and we did a series of experiments with the different window lengths. for each window length, we encoded all training data into two input modes and trained their respective subnetworks. according to the prediction results of the validation set, we selected each appropriate window size. figure shows the performance of various window sizes in one-of- and amino acid continuous encoding modes. figure . accuracy of the verification set for various window lengths in figure , the abscissa represents the window length, and the ordinate represents the accuracy of the model. it can be seen from figure that when the window length was , the two encoding modes had the highest accuracy. therefore, we set the window length of this model to . to compare the performance of the model under different encoding schemes, we compared the capsule network and the cnn with similar hierarchical structures of capsule networks and the same training set size. the cnn structure replaced only the primarycapsule layer with the conv layer. we set the labelcapsule layer to a × fully connected layer. the comparison results are shown in table . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . comparison of various coding schemes feature model acc (%) sn (%) sp (%) auc mcc capsnet . . . . . one-of- cnn . . . . . capsnet . . . . . amino acid continuous cnn . . . . . capsnet . . . . . one-of- and amino acid continuous cnn . . . . . accuracy of the model sensitivity of the model specificity of the model area under curve matthew’s correlation coefficient from table , it can be concluded that the capsule network’s accuracies were . %, . %, and . % percentage points higher than those of cnn under the one-of- , amino acid continuous, and combined one-of- and amino acid continuous types, indicating that the capsule network internally expressing the hierarchical relation modeling aspect has more advantages than cnn. among them, the performance under the combined one-of- and amino acid continuous encoding modes is the best on the capsule network: this proposed caps-ubi model achieved an accuracy, sensitivity, specificity, area under curve, and matthew’s correlation coefficient of . %, . %, . %, . , . respectively. the proposed caps-ubi was obtained from balanced data. the roc curve of caps-ubi on the test set is shown in figure , which shows that it was very close to the real situation. figure . receiver operating characteristic curve of caps-ubi on the test set when we used balanced data to train the model on an experimentally verified ubiquitination dataset and a nonubiquitination dataset [ ], the ratio of positive peptides and negative peptides was : , so we tested caps-ubi using natural-distribution data. the test results are shown in table . according to the test results, the performance was slightly worse than that under the balanced data. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . results of testing caps-ubi under natural-distribution data protein fragment acc (%) sn (%) sp (%) auc mcc positive–negative ratio , . . . . . : , , . . . . : accuracy of the model sensitivity of the model specificity of the model area under curve matthew’s correlation coefficient comparison with other methods in the past years, many researchers have contributed to the prediction and research of protein ubiquitination sites. we compared the proposed model with other sequence-based prediction tools. the corresponding data and results are shown in table , which shows that the performance of the caps-ubi model exceeded that of the best-performing deep learning model deepubi and several other prediction models. the accuracy, sensitivity, specificity, area under curve, and matthew’s correlation coefficient of caps-ubi were . , . , . , . , and . respectively percentage points higher than those of deepubi. table . proposed caps-ubi compared with other methods predictor acc (%) sn (%) sp (%) auc mcc ubipred . . . . . ubsite . . , – – cksaap_ubsite . . . . . ubiprober – . . . . iubiq-lys . . . – . deepubi . . , . . caps-ubi . . . . . accuracy of the model sensitivity of the model specificity of the model area under curve matthew’s correlation coefficient conclusion and outlook in this paper, a new deep learning model for predicting protein ubiquitination sites is proposed, using one-of-k and amino acid continuous coding modes. we used the largest available protein ubiquitination site dataset, and the experimental results above verify the effectiveness of this model. the operation of the model has four main steps: encoding protein sequences, constructing convolutional layers, constructing a capsule network layer, and constructing an output layer. the capsule network introduces a new building block for deep learning. relative to cnn, the capsule network, which uses a dynamic routing mechanism to update parameters, requires more training time, but the time required for prediction is similar. the capsule network can also characterize the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / complex relations among amino acids in various sequence positions and can explore the internal data distribution related to biochemical significance. the proposed caps-ubi prediction tool will facilitate the sequence analysis of ubiquitination and can also be used to identify other posttranslational modification sites in proteins. in the future, we will study other features that may better extract sample attributes to construct deeper models. references . goldstein g, scheid m, hammerling u, schlesinger dh, niall hd, boyse ea. isolation of a polypeptide that has lymphocyte-differentiating properties and is probably represented universally in living cells. proc natl acad sci u s a. ; : - . . wilkinson kd. the discovery of ubiquitin-dependent proteolysis. proc natl acad sci u s a. ; : - . . hicke l, schubert hl, hill cp. ubiquitin-binding domains. nat rev mol cell biol. ; : . . hicke l. protein regulation by monoubiquitin. nat rev mol cell biol. ; : - . . pickart cm. ubiquitin enters the new millennium. mol cell. ; : - . . haglund k, dikic i. ubiquitylation and cell signaling. embo j. ; : - . . peng j, schwartz d, elias je, et al. a proteomics approach to understanding protein ubiquitination. nat biotechnol. ; : - . . gentry ms, worby ca, dixon je. insights into lafora disease: malin is an e ubiquitin ligase that ubiquitinates and promotes the degradation of laforin. proc natl acad sci u s a. ; ( ): - . . huang ch, su mg, kao hj, jhong jh, weng sl, lee ty. ubisite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. bmc syst biol. ; suppl (suppl ): . . nguyen vn, huang ky, huang ch, lai kr, lee ty. a new scheme to characterize and identify protein ubiquitination sites. ieee/acm trans comput biol bioinform. ; : - . . qiu wr, xiao x, lin wz, chou kc. iubiq-lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. j biomol struct dyn. ; : - . . chen x, qiu jd, shi sp, suo sb, huang sy, liang rp. incorporating key position and amino acid residue features to identify general and species-specific ubiquitin conjugation .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / sites. bioinformatics. ; : - . . wang jr, huang wl, tsai mj, hsu kt, huang hl, ho sy. esa-ubisite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives. bioinformatics. ; : - . . radivojac p, vacic v, haynes c, et al. identification, analysis, and prediction of protein ubiquitination sites. proteins. ; ( ): - . . lee ty, chen sa, hung hy, ou yy. incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites. plos one. ; :e . . wang d, zeng s, xu c, et al. musitedeep: a deep-learning framework for general and kinase specific phosphorylation site prediction. bioinformatics. ; : - . . shaw d, chen h, jiang t. deepisofun: a deep domain adaptation approach to predict isoform functions. bioinformatics. ; ( ): - . . sun, d. , wang, m. , feng, h. , & li, a. . ( ). prognosis prediction of human breast cancer by integrating deep neural network and support vector machine: supervised feature extraction and classification for breast cancer prognosis prediction. th international congress on image and signal processing, biomedical engineering and informatics (cisp-bmei). ieee. . fu h, yang y, wang x, wang h, xu y. deepubi: a deep learning framework for prediction of ubiquitination sites in proteins. bmc bioinformatics. ; : . . liu y, li a, zhao xm, wang m. deeptl-ubi: a novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. methods. ;s - ( ) - . . he f, wang r, li j, bao l, xu d, zhao x. large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. bmc syst biol. ; (suppl ): . . huang y, niu b, gao y, fu l, li w. cd-hit suite: a web server for clustering and comparing biological sequences. bioinformatics. ; : - . . huang ch, su mg, kao hj, jhong jh, weng sl, lee ty. ubisite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. bmc syst biol. ; suppl (suppl ): . . plewczynski d, tkacz a, wyrwicz ls, rychlewski l. automotif server: prediction of single residue post-translational modifications in proteins. bioinformatics. ; : - . . venkatarajan m s , braun w . new quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties[j]. molecular modeling annual, , ( ): - . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . dombetzki la. an overview over capsule networks. network architectures and services . . sabour s , frosst n , hinton g e . dynamic routing between capsules[j]. . . hinton,g.e. et al. ( ) transforming auto-encoders. international conference on artifificial neural networks. springer, finland, pp. – . . lin m., chen q., yan s. network in network[j]. arxiv preprint arxiv: . , : . kingma,d. and ba,j. ( ) adam: a method for stochastic optimization, arxiv preprint arxiv: . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / triplex and other dna motifs show motif-specific associations with mitochondrial dna deletions and species lifespan triplex and other dna motifs show motif-specific associations with mitochondrial dna deletions and species lifespan. authors kamil pabis . georg august university of göttingen, göttingen, germany. mail: kamil.pabis@gmail.com abstract the “theory of resistant biomolecules” posits that long-lived species show resistance to molecular damage at the level of their biomolecules. here, we test this hypothesis in the context of mitochondrial dna (mtdna) as it implies that predicted mutagenic dna motifs should be inversely correlated with species maximum lifespan (mls). first, we confirmed that guanine-quadruplex and direct repeat (dr) motifs are mutagenic, as they associate with mtdna deletions in the human major arc of mtdna, while also adding mirror repeat (mr) and intramolecular triplex motifs to a growing list of potentially mutagenic features. what is more, triplex motifs showed disease-specific associations with deletions and an apparent interaction with guanine-quadruplex motifs. surprisingly, even though dr, mr and guanine-quadruplex motifs were associated with mtdna deletions, their correlation with mls was explained by the biased base composition of mtdna. only triplex motifs negatively correlated with mls even after adjusting for body mass, phylogeny, mtdna base composition and effective number of codons. taken together, our work highlights the importance of base composition for the comparative biogerontology of mtdna and suggests that future research on mitochondrial triplex motifs is warranted. abbreviations bps, mtdna deletion break points dr, direct repeats er, everted repeats gq, guanine-quadruplexes ir, inverted repeats mls, species maximum lifespan mr, mirror repeats nbmst, non-b dna motif search tool nc, number of effective codons (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:kamil.pabis@gmail.com https://doi.org/ . / . . . pgls, phylogenetic generalized least squares sd, standard deviation trip, triplex forming motif xr, any repeat half-site or motif mtdna, mitochondrial dna (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction macromolecular damage to lipids, proteins and dna accumulates with aging (richardson and schadt , gladyshev ), whereas cells isolated from long-lived species are resistant to genotoxic and cytotoxic drugs, giving rise to the multistress resistance theory of aging (miller , hamilton and miller ). by extension of this idea, the “theory of resistant biomolecules” posits that lipids, proteins and dna itself should be resilient in long-lived species (pamplona and barja ). in support of this theory, it was shown that long-lived species possess membranes that contain fewer lipids with reactive double bonds (valencak and ruf ) and perhaps a lower content of oxidation-prone cysteine and methionine in mitochondrially encoded proteins (see aledo et al. for a discussion). mitochondrial dna (mtdna) mutations constitute one type of macromolecular damage that accumulates over time. point mutations accumulate in proliferative tissues like the colon and in some progeroid mice (kauppila et al. ), while the accumulation of mtdna deletions in postmitotic tissues may underpin certain age-related diseases like parkinson’s and sarcopenia (lawless et al. , bender et al. ). if the theory of resistant biomolecules can be generalized, the mtdna of long-lived species should resist both point mutation and deletion formation. however, we will focus on deletions because they are more pathogenic than point mutations at the same level of heteroplasmy (gamamge et al. ) and human tissues do not accumulate high levels of point mutations observed in progeroid mouse models (khrapko et al. ). since deletion formation depends on the primary sequence of the mtdna (sequence motifs) it is amenable to bioinformatic methods. ever since a link between direct repeat (dr) motifs and deletion formation became known, variations of the theory of resistant biomolecules have been tested, although not necessarily under this name. it was reasoned that long-lived species evolved to resist deletion formation and mtdna instability by reducing the number of mutagenic motifs in their mtdna (khaidakov et al. , yang et al. ). we aim to extend these findings by re-evaluating and establishing new candidate motifs, which we then correlate with species maximum lifespan (mls). studying multiple motif classes at once also allows us to reveal relationships between potentially overlapping mtdna motifs that may affect the data. we define candidate motifs as those that are associated with deletion formation inside the major arc of human mtdna, because during asynchronous replication the major arc is single stranded for extended periods of time (persson et al. ) which should favor the formation of secondary structures. finally, we test if these motifs correlate with the mls of mammals, birds and ray-finned fishes after correcting for potential biases, especially global mtdna base composition which is an important confounder (aledo et al. ) yet is neglected in some studies (yang et al. ). the choice of motifs to study is based on biological plausibility and published literature that will be briefly reviewed below. mutagenic motifs include repeats as well as guanine-quadruplex (gq)- and triplex-forming motifs. dr motifs can lead to dna instability through strand-slippage if two dr motifs mispair during replication (persson et al. ). whereas inverted repeat (ir), g-quadruplex and triplex motifs destabilize progression of the replication fork through the formation of stable secondary structures. some of the structures formed include hairpins for ir motifs (tremblay-belzile et al. ), triple stranded dna for triplex motifs and bulky stacks of guanines for g-quadruplex motifs (bacolla et (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . al. ; fig. ). mirror repeat (mr) and everted repeat (er) motifs, in contrast, do not allow stable watson-crick base pairing and are thus less likely to be mutagenic, although a subset of mr motifs may form triplex structures (kamat et al. ). thus, many motifs can be mutagenic in principle, but what is the evidence that these motifs are related to mtdna instability, particularly deletions, and mls? paradoxically, while drs are the motif most consistently associated with mtdna deletion breakpoints (bps), despite preliminary reports (khaidakov et al. , lakshmanan et al. , yang et al. ), no correlation with species mls was seen in recent studies (lakshmanan et al. ). in contrast, with the exception of one preprint (mikhailova et al. ), irs are not known to be associated with mtdna deletions (dong et al. ), although they do show a negative relationship with species mls (yang et al. ) and may contribute to inversions (tremblay‐belzile et al. ). whether age-related mtdna inversions underlie any pathology, however, requires further study. finally, g-quadruplex motifs are associated with both deletions (dong et al. ) and point mutations (butler et al. ), but no study tested if they correlate with mls. triplex motifs are poorly studied with one report finding no association between these motifs and deletions (oliveira et al. ). based on these studies we decided to test the theory of resistant biomolecules by quantifying dr, mr, ir, er, g-quadruplex- and triplex-forming motifs. we stipulate that if a motif class played a causal role in aging, it should be involved in deletion formation and its abundance should be negatively correlated with species mls. figure a. direct repeat, both half-sites have the same orientation. b. inverted repeat, the half-sites are complementary and has mirror symmetry. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . c. everted repeat, the half-sites are complementary. d. mirror repeat, the half-sites have mirror symmetry. e. triplex motifs can form a triple helical dna structure also called h-dna. f. in a g-quadruplex multiple g-quartets (depicted as blue rectangles) stack on top of each other. adapted from gurusaran et al. ( ) and khristich and mirkin ( ) with permission. half-sites shown in red. methods detection of dna motifs repeats were detected by a script written in r (vr- . . ). briefly, to find all repeats with n basepairs (bps), the mtdna light strand is truncated by to n bps and each of the n truncated mtdnas is then split every n bps. this generates every possible substring (and thus repeat) of length n. in the next step, duplicate strings are removed. afterwards we can find dr (a substring with at least two matches in the mtdna), mr (at least one match in the mtdna and on its reverse), ir (at least one match in the mtdna and on its reverse-complement) and er motifs (at least one match in the mtdna and on its complement). overlapping and duplicate repeats were not counted for the correlation between repeats and mls. the code for the analyses performed in this paper can be found on github (pabisk/aging_triplex ). unless stated otherwise, all analyses were performed in r. g-quadruplex motifs were detected by the pqsfinder package (v . . , hon et al. ). intramolecular triplex-forming motifs were detected by the triplex package (v . . , hon et al. ) and duplicates were removed. we also compared the data with two other publicly available tools, triplexator (buske et al. ), and with the non-b dna motif search tool (nbmst; cer et al. ). triplexator was run on a virtual machine in an oracle vm virtualbox (v . ) in -ss mode on the human mitochondrial genome and its reverse complement, the results were combined and overlapping motifs from the output were removed. we used the web interface of nbmst to detect mirror repeats/triplexes (v . ). association between motifs and major arc deletions the major arc was defined as the region between position and of the human mtdna (nc_ . ). the following deletions and their breakpoints were located in this region and included: deletions from the mitobreak database (damas et al. , mtdna breakpoints.xlsx), from persson et al. ( ) and from hjelm et al. ( ). each deletion is defined by two breakpoints. a breakpoint pair was considered to associate with a motif if the motif fell within a defined window around one or both breakpoints, depending on the analysis. the window size was chosen in relation to the length of the studied motifs ( bp for repeats and bp for other motifs). three different motif orientations relative to the breakpoints were considered. two orientations for motifs with half-sites (i.e. repeats), either both half-sites at any one breakpoint of a deletion, or one half-site per breakpoint of a deletion. motifs with overlapping half-sites were not counted. in the third case, distinct g-quadruplex and triplex motifs could associate with one or both breakpoints of a deletion, but were at most counted once, since the latter case is sufficiently rare. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . in order to exclude overlapping “hybrid” motifs, mr and dr motifs with the same sequence were removed whereas triplex and g-quadruplex motifs were removed if they were in proximity. to generate controls, the mtdna deletions as a whole were randomly redistributed inside the major arc which, because of the fixed deletion size, allowed us to approximate the original distribution of breakpoints (as suggested by oliveira et al. ). significance was determined via one-sample t-test in prism (v . ) by comparing actual breakpoints to such randomized controls. alternative controls were generated by shifting each breakpoint by bp towards the midpoint of the major arc or as in fig. s . cancer associated breakpoints we obtained all autosomal breakpoints available from the catalogue of somatic mutations in cancer (cosmic; release v , th august ), which includes deletions, inversions, duplications and other abnormalities (n= in total). after removing breakpoints whose sequences could not be retrieved (< . %), we quantified the number of predicted g-quadruplex and triplex motifs in a bp window centered on the breakpoints using default settings for the detection of these motifs. sequences of breakpoint regions were obtained from the grch build of the human genome using the bsgenome package (v . . ). each breakpoint shifted by + bps served as its own control. lifespan, base composition and life history traits we included three phylogenetic classes in our analysis for which we had sufficient data (n> ), mammals, birds and ray-finned fishes (actinopterygii). mls and body mass were determined from the anage database (tacutu et al. ) and, for mammals, supplemented with data from pacifici et al. ( ). the mtdna accessions were obtained from an updated version of mitoage (unpublished; toren et al. ). species were excluded if body mass data was unavailable, if the sequence could not be obtained using the genbankr package (v . . ), or if the extracted cytochrome b dna sequence did not allow for an alignment, precluding phylogenetic correction. the species data can be found in the supplementary (species data.xlsx). we analyzed the full mtdna sequence, heuristically defined as the mtdna sequence between the first and last encoded trna, excluding the d-loop, which is rarely involved in repeat-mediated deletion formation (yang et al. ). the effective number of codons was calculated using wright’s nc (smith et al. ). base composition was calculated for the light-strand. gc skew was calculated as the fraction (g − c)/(g + c) and at skew as (a − t)/(a + t). all correlations are pearson’s r. partial correlations were performed using the ppcor package (v . ). phylogenetic generalised least squares and phylogenetic correction observed correlations between traits and lifespan can be spurious due to shared species ancestry (speakman ). to correct for this, we use phylogenetic generalised least squares (pgls) implemented in the caper package (v . . ). species phylogenetic trees were constructed via neighbor joining based on aligned cytochrome b dna sequences using clustal omega from the msa package (v . . ) and in the resulting mammalian and bird tree, four branch edge lengths were equal to zero, which were set to the lowest non-zero value in the dataset. results (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . direct repeats and mirror repeats are over-represented at mtdna deletion breakpoints in order to define candidate mtdna motifs that could be linked with lifespan, we started by reanalyzing motifs that associate with mtdna deletion breakpoints reported in the mitobreak database (damas et al. ; fig. s ; mtdna breakpoints.xlsx). in the below analysis, we consider dr and ir motifs thought to be mutagenic, as well as mr and er motifs, so far not known to be mutagenic and we pool all to bp long repeats, since the data is similar between different repeat lengths (fig. s ). as shown by others, we found that dr motifs often flank mtdna deletions (fig. a). in contrast, no strong association was seen for er and ir motifs, even considering a larger window around the breakpoint to allow for the fact that irs could bridge and destabilize mtdna over long distances (persson et al. ; fig. s ). surprisingly, we also found mr motifs flanking deletion breakpoints more often than expected by chance (fig. a). however, dr and mr motifs are known to correlate with each other (shamanskiy et al. ; fig. b) and indeed we noticed a large sequence overlap between mr and dr motifs (fig. b), which could explain an apparent over-representation of mrs at breakpoints. removal of overlapping mr-dr hybrid motifs confirmed this suspicion. after this correction, the degree of enrichment was strongly attenuated (fig. c) and the total number of breakpoints flanked by mr motifs was reduced by > %. nevertheless, long mr motifs remained particularly over-represented around deletions (fig. s ). since the prior analysis only considered motifs that flank both breakpoints, we next tested the idea that ir and other motifs could be mutagenic if both half-sites are found at any of the breakpoints. however, in this analysis no motif class showed enrichment around breakpoints (fig. d). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure direct repeat (dr) and mirror repeat (mr) motifs are significantly enriched around actual deletion breakpoints (bps) compared to reshuffled bps, but the same is not true for inverted repeat (ir) and everted repeat (er) motifs (a, d). the surprising correlation between mr motifs and deletion bps is attenuated when mrs that have the same sequence as dr motifs are removed (b, c). controls were generated by reshuffling the deletion bps while maintaining their distribution (n= , mean ±sd shown). the schematic drawings above (a, d) depict the orientation of the repeat (xr) half-sites in relation to the bps. *** p < . ; ** p < . by one sample t-test. a) the number of deletions associated with dr, mr, ir or er motifs at both bps compared with reshuffled controls. b) venn diagram showing the number of mr, dr and hybrid mr-dr motifs that were identified within the major arc. c) the number of deletions associated with mr motifs, before (mr) and after removal of hybrid mr-dr motifs (mrdr-), compared with reshuffled controls. d) the number of deletions associated with dr, mr, ir or er motifs at either bp compared with reshuffled controls. predicted triplex-forming motifs are over-represented at mtdna breakpoints given the association between mr motifs and breakpoints we decided to analyze triplex motifs, a special case of homopurine and homopyrimidine mirror repeats (khristich and mirkin , bissler ), and their association with deletion breakpoints in the mitobreak database. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . here, we use the triplex package to predict intramolecular triplex motifs because it has several advantages compared to other software (hon et al. ). for example, using the nbmst tool, as in a previous study of mtdna instability (oliveira et al. ), we only identified two potential triplex motifs within the major arc that did not overlap with the six motifs identified by the triplex package (table s ). in contrast, using triplexator (buske al. ) we were able to detect four of the six triplex motifs and the motifs detected by triplexator were also enriched at breakpoints (table s ). we noticed that predicted triplexes are g-rich and thus could be related to g-quadruplex motifs (doluca et al. ). in a comparison of the two motif types, however, we found several differences (table s , s ). triplex motifs were shorter and less abundant than predicted g-quadruplexes, associated with fewer breakpoints altogether (fig. ) and, in contrast to g-quadruplexes almost exclusive to the g-rich mtdna heavy-strand, triplex motifs were also common on the light-strand. the six triplex motifs detected by the triplex package were significantly enriched around deletion breakpoints and when we excluded triplex-g-quadruplex hybrid motifs the result was attenuated but remained significant (fig. a). given the higher risk of spurious findings with only six motifs, we repeated the analysis using a relaxed definition of triplex and the results were fundamentally unchanged (fig. b). furthermore, our results were not sensitive to reasonable changes in the size of the search window around breakpoints (fig. s a, b), motif quality scores (fig. s c, d) or inclusion of overlapping motifs (fig. s e-g). analogous to the situation with mr motifs we tested if overlapping triplex-dr hybrid motifs could bias our results. given the rarity of triplex motifs and the many drs in the mitochondrial genome we choose an alternative approach rather than excluding triplex motifs that overlapped any dr half-site. we compared the fraction of triplex and g-quadruplex positive deletions associated with drs (gq+, dr+ and trip+, dr+) and not associated with drs (gq+, dr- and trip+, dr-). we considered a deletion to be dr+ if both breakpoints were flanked by the same dr sequence. in this case, only % of trip+ deletions associated with drs whereas % of gq+ deletions did (table s ). figure triplex motifs are significantly enriched around actual breakpoints (bps) compared to reshuffled bps (a, b) even after removal of g-quadruplex (gq)-triplex hybrid motifs (tripgq-). the number of unique triplex (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . motifs, gq motifs and of hybrid triplex-gq motifs, within the mtdna major arc, is shown in the venn diagrams above (a, b). enrichment of gq motifs around bps is shown for comparison in (c). controls were generated by reshuffling the deletion bps while maintaining their distribution (n= , mean ±sd shown). the schematic drawing above (c) depicts the orientation of the gq and triplex motifs (xr) in relation to the bps. *** p < . by one sample t-test. a) the number of deletion bps associated with triplex motifs compared with reshuffled controls. analysis including (left side) or excluding triplex-gq hybrid motifs (right side). b) same as (a) but with relaxed criteria for the detection of triplex motifs (min score= ) and gq motifs (min score= ). c) the number of deletion bps associated with gq motifs compared with reshuffled controls. relaxed settings (left side, min score= ) and default settings (right side, min score= ). triplex forming motifs may be associated with mitochondrial disease breakpoints next, we sought to validate our findings on two recently published next generation sequencing datasets (hjelm et al. , persson et al. ; mtdna breakpoints.xlsx; table s ). we were able to confirm the enrichment of dr (fig. s a, s a), mr (fig. s a, s a) and g-quadruplex motifs (fig. a, b; s c, d) around deletion breakpoints. additionally, we confirmed that hybrid mr-dr motifs are responsible in large part for the enrichment of mr motifs around breakpoints (fig. s b, s b). in contrast, we found that triplex motifs were not consistently enriched around breakpoints in the dataset of hjelm et al. (fig. s c, d), which is based on post-mortem brain samples from patients without overt mitochondrial disease, whereas we saw enrichment in the dataset by persson et al. (fig. a, b), which is based on muscle biopsies from patients with mitochondrial disease. this unexpected discrepancy prompted us to take a second look at the mitobreak data. in this dataset triplex motifs were significantly more enriched at breakpoints in the mtdna single deletion subgroup compared to the healthy tissues subgroup (fig. s ). in addition, we found more broadly that mitochondrial disease status might explain the heterogenous results across datasets we have seen (fig. c). further strengthening our findings, triplex motifs were enriched in the mitobreak and persson et al. dataset regardless of the breakpoint shuffling method chosen and of our statistical assumptions (fig. s ). what is more, triplex motifs were also enriched at breakpoints when we pooled all three datasets (fig. d), although to a lesser extent. finally, g-quadruplex motifs close to triplex motifs were more strongly enriched at deletion breakpoints than solitary g-quadruplex motifs (fig. e; fig. s ), suggesting that triplex formation may further contribute to dna instability. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure in the persson et al. ( ) dataset, triplex and g-quadruplex (gq) motifs are enriched around deletion breakpoints (bps), using either default (a) or relaxed scoring criteria (b). although triplex motifs predominate in mitochondrial disease datasets (c), we also find that triplex motifs are significantly enriched around bps (d) after pooling the data from mitobreak, persson et al. ( ) and hjelm et al ( ). finally, gq and triplex motifs show stronger enrichment around bps than either of them in isolation (e). controls were generated by reshuffling the deletion bps while maintaining their distribution (n= , mean ±sd shown). the schematic drawing above (d) depicts the orientation of the motifs (xr) in relation to the bps. *** p< . , **p< . by one sample t-test. a) the number of deletion bps associated with gq and triplex motifs compared with reshuffled controls (min score = default). b) the number of deletion bps associated with gq and triplex motifs compared with reshuffled controls (min score = relaxed). c) the number of deletion bps associated with triplex motifs (relaxed settings, min score= ) stratified by mitochondrial disease status. mitobreak data includes single and multiple mitochondrial deletion syndromes. d) the number of deletion bps associated with triplex motifs, or with triplex motifs excluding triplex-gq hybrid motifs (tripgq-), compared with reshuffled controls. default settings (left side, min score= ) and relaxed settings (right side, min score= ). e) the fold-enrichment of gq and triplex motifs around deletion bps is shown. motifs were considered overlapping if their midpoints were within bp. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . repeats and lifespan: no support for the theory of resistant biomolecules for our analysis, we focus on bp long repeat motifs as short repeats are less likely to allow stable base pairing and longer repeats are rare (fig. s ) and because results considering repeat motifs of different lengths usually agree with each other (table s ; yang et al. ). to allow comparability with other studies (lakshmanan et al. ) we analyzed non d-loop motifs, but results for major arc motifs are numerically similar (table s ). first, consistent with yang et al. ( ) we found that ir motifs show a negative correlation with the mls of mammals in the unadjusted model. in addition, we identified er motifs, a class of symmetrically related repeats, that show an even stronger inverse relationship with longevity (fig. a; table ). however, these inverse correlations vanished after taking into account body mass, base composition and phylogeny in a pgls model (table ). second, in agreement with lakshmanan et al. ( ) we found that dr motifs do not correlate with the mls of mammals. the same was true for the symmetrically related mr motifs. just as with ir motifs, modest inverse correlations vanished in the fully adjusted model (table ). we also found the same null results in two other vertebrate classes, birds and ray-finned fishes (table s ). to gain hints as to causality, we finally tested if longer repeats, allowing more stable base pairing, show stronger correlations with mls, but to our surprise we noticed the opposite (fig. s a-d). considering all four types of repeats together, we noticed that repeats with both half-sites on the same strand (dr and mr) or half-sites opposite strands (ir and er) were correlated with each other (fig. b) and with the same mtdna compositional biases (fig. c). thus, for dr and mr motifs, an apparent relationship with mls may be explained by their inverse relationship with gc content and for ir and er motifs by an inverse relationship with gc content and a positive relationship with gc skew. figure the number of everted repeat (er) motifs is negatively correlated with species mls in an unadjusted analysis (a). repeats with a similar orientation correlate with each other (b). direct repeat (dr) and mirror repeat (mr) motifs have a similar orientation since both half-sites are found on the same strand and in the case of er and inverted repeat (ir) motifs the half-sites are on opposite strands. finally, we show the major mtdna compositional biases that co-vary with the four repeat classes (c) and may (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . explain an apparent correlation with mls. data is for bp long repeats and pearson’s r is shown in (a- c). table . correlation between potentially mutagenic motifs and species lifespan motif type raw adjusted dr bp - . . mr bp - . - . ir bp - . . er bp - . - . triplex default - . - . ** triplex relaxed - . - . ^ gq default . . gq relaxed . - . ** the adjusted model takes into account body mass, gc content, gc skew, at skew and number of effective codons. significant correlations in the raw or adjusted model are bolded/underlined (p< . ). the pgls model additionally considers phylogeny. ^denotes p-values of . is the regularization parameter of nuclear norms. the hyperparameter λ actually (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . can be different among different source 𝑘 and thus the regularizer in ( ) can be replaced by ∑ 𝜆𝑘 ‖𝑻𝑘 ‖∗ 𝐾 𝑘= if necessary. please see supplementary information for the reason to select subtype- specific regularization terms. optimization of swcam objective function the objective function in ( ) is bi-convex w.r.t. the two block-wise variables, i.e. 𝑨 ≜ [𝒂 𝑇 , … , 𝒂𝑀 𝑇 ]𝑇 and 𝑻 ≜ [𝑻 𝑇 , … , 𝑻𝑀 𝑇 ]𝑇 ∈ ℝ𝐾𝐿×𝑀 . accordingly, we can solve ( ) by alternatively solving the following two convex subproblems until convergence: 𝑻𝑝+ ∈ argmin ∆𝑺i≽−𝑺,∀𝒊 𝒥(𝑨𝑝, 𝑻) ( ) 𝑨𝑝+ ∈ argmin 𝑨≽ 𝑀×𝐾,𝑨𝟏𝐾=𝟏𝑀 𝒥(𝑨, 𝑻𝑝+ ) ( ) where 𝒥(𝑨, 𝑻) ≜ ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺i)‖ 𝑀 𝑖= + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘= cam-estimated subtype-specific expression matrix serves as the initial reference 𝑺. note that in ( ) ( ), we have implicitly used the following relationship for concise representation: 𝑻 ≜ [𝑣𝑒𝑐(Δ𝑺 𝑇 ), … , 𝑣𝑒𝑐(Δ𝑺𝑀 𝑇 )], where ( ) can be decoupled w.r.t each row of 𝑨: 𝒂𝑖 𝑝+ ∈ argmin 𝒂𝑖≽𝟎𝐾,𝒂𝑖𝟏𝐾= ‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 𝑝+ )‖ which can be solved using quadratic programming. if a prior proportion matrix or cam-estimated proportion matrix has already been of high quality, we can skip the alternative optimization on 𝑨 matrix, and obtain 𝑻 matrix by optimizing the subproblem ( ) only once. to solve ( ), we notice that the main bottleneck is its huge dimension of variables (typically, l is several ten thousand), preventing conventional convex solvers from being readily applicable here. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we propose to solve ( ) by adapting the alternating direction method of multipliers (admm), which has been widely applied to many large-scale problems in areas such as statistical learning, image processing and computational biology (boyd, parikh et al. ). admm naturally allows decoupling the non-smooth regularization term from the smooth loss term, which is computationally advantageous. specifically, we reformulate ( ) in the form that the primal variable can be “split” into several parts, with the associated objective function “separable” across this splitting (boyd, parikh et al. ). we will use the following definitions: 𝑻 ≜ [𝑣𝑒𝑐(Δ𝑺 𝑇 ), … , 𝑣𝑒𝑐(Δ𝑺𝑀 𝑇 )] = [ 𝑻 … 𝑻𝐾 ] ∈ ℝ𝐾𝐿×𝑀 𝑺 ≜ [𝑣𝑒𝑐(𝑺 𝑇 ), … , 𝑣𝑒𝑐(𝑺𝑀 𝑇 )] ∈ ℝ𝐾𝐿×𝑀 𝑽 ≜ 𝑿𝑇 ∈ ℝ𝐿×𝑀 𝑾 ≜ [ 𝑻 𝑺 ] ∈ ℝ 𝐾𝐿×𝑀 𝑪 ≜ [ 𝑰𝐾𝐿 𝑰𝐾𝐿 ] ∈ ℝ 𝐾𝐿×𝐾𝐿 𝑪 ≜ −𝑰 𝐾𝐿 ∈ ℝ 𝐾𝐿× 𝐾𝐿 𝑪 ≜ [ 𝟏𝑀 𝑇 ⨂𝑣𝑒𝑐(�̅�𝑇) 𝟎𝐾𝐿×𝑀 ] ∈ ℝ 𝐾𝐿×𝑀 𝑩 ≜ [𝟎𝐾𝐿×𝐾𝐿 , 𝑰𝐾𝐿 ] ∈ ℝ 𝐾𝐿× 𝐾𝐿 𝑩𝑘 ≜ [𝟎𝐿×(𝑘− )𝐿 , 𝑰𝐿 , 𝟎𝐿×(𝐾−𝑘)𝐿 , 𝟎𝐿×𝐾𝐿 ] ∈ ℝ 𝐿× 𝐾𝐿 , 𝑘 = , … , 𝐾 then we can simplify ( ) as the equivalent form: min 𝑼∈ℝ𝐾𝐿×𝑀,𝑾∈ℝ 𝐾𝐿×𝑀 ‖𝒜(𝑼) − 𝑽‖𝐹 + 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘= + 𝐼+(𝑩 𝑾) ( ) 𝑠. 𝑡. 𝑪 𝑼 + 𝑪 𝑾 = 𝑪 , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where 𝐼+(∙) is the indicator function for the non-negative orthant; 𝐼+(𝑩 𝑾) = 𝐼+(𝑺) = if 𝑺 ≽ 𝟎𝐾𝐿×𝑀 ( 𝐼+(𝑼) = +∞ , otherwise). the linear transformation in the first term is 𝒜(𝑼) = 𝒜([𝒖 , … , 𝒖𝑀]) = [𝑯 𝒖 , … , 𝑯𝑀𝒖𝑀] with 𝑯𝑖 = [𝒂𝑖 𝑝 ⨂𝐼𝐿 ], 𝑖 = , … , 𝑀 . note that ( ) has been with the admm form w.r.t. the two split block variables 𝑼 and 𝑾, and, as ( ) is solved, the solution of ( ) can be obtained by 𝑻𝑝+ = [ 𝑰𝐾𝐿 , 𝟎𝐾𝐿×𝐾𝐿 ]𝑾 ∗. given a penalty parameter 𝛾 > (empirically, 𝛾 ≔ generally guarantees good convergence speed), the augmented lagrangian (ignoring some irrelevant terms) of problem ( ) is defined by ℒ(𝑼, 𝑾, 𝒁) = ‖𝒜(𝑼) − 𝑽‖𝐹 + 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘= + 𝐼+(𝑩 𝑾) + 𝛾 ‖𝑪 𝑼 + 𝑪 𝑾 − 𝑪 − 𝒁‖𝐹 where “−𝛾𝒁”∈ ℝ 𝐾𝐿×𝑀 is the dual variable (or lagrange multiplier) associated with the constraint 𝑪 𝑼 + 𝑪 𝑾 = 𝑪 . then, admm solves ( ) via the following iterative procedure: 𝑼𝑞+ 𝜖 argmin 𝑼∈ℝ𝐾𝐿×𝑀 ℒ(𝑼, 𝑾𝑞 , 𝒁𝑞 ) ( 𝑎) 𝑾𝑞+ 𝜖 argmin 𝑾∈ℝ 𝐾𝐿×𝑀 ℒ(𝑼𝑞+ , 𝑾, 𝒁𝑞 ) ( 𝑏) 𝒁𝑞+ = 𝒁𝑞 − (𝑪 𝑼 𝑞+ + 𝑪 𝑾 𝑞+ − 𝑪 ) ( 𝑐) where 𝑾 can be initialized by [𝑻 𝑇 , 𝑼 𝑇 ]𝑇 with 𝑻 = 𝟎𝐾𝐿×𝑀 and 𝑼 = 𝟏𝑀 𝑇 ⨂𝑣𝑒𝑐(�̅�𝑇 ); 𝒁 can be simply initialized by 𝟎 𝐾𝐿×𝑀. as we will show, both ( a) and ( b) can be solved with closed-form expressions, thanks to the decomposability of admm. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the objective function of swcam for sample-specific deconvolution problem and its reformulation by admm. (for convenient illustration, 𝑻 matrix in all figures are the transposed version of those in the text and equations.) notice that ( a) is a column-wise separable optimization problem, so we can decouple w.r.t each column of 𝑼: 𝒖𝑖 𝑞+ ∈ argmin 𝒖𝑖∈ℝ 𝐾𝐿 ‖𝑯𝑖 𝒖𝑖 − 𝒗𝑖 ‖ + 𝛾 ‖𝑪 𝒖𝑖 + 𝒚𝒊 𝑞 ‖ 𝐹 ( ) where [𝒚 𝑞 , … , 𝒚𝑀 𝑞 ] ≜ 𝑪 𝑾 𝑞 − 𝑪 − 𝒁 𝑞 . the subproblem ( ) is an unconstrained quadratic problem, which can be solved by 𝒖𝑖 𝑞+ = (𝑯𝑖 𝑇 𝑯𝑖 + 𝛾𝑪 𝑇 𝑪 ) − (𝑯𝑖 𝑇 𝒗𝑖 − 𝛾𝑪 𝑇 𝒚𝒊 𝑞 ). ( ) the matrix inversion can speed up by (𝑯𝑖 𝑇 𝑯𝑖 + 𝛾𝑪 𝑇 𝑪 ) − = ((𝒂𝑖 𝑝 ) 𝑇 𝒂𝑖 𝑝 + 𝛾𝑰𝐾 ) − ⨂𝑰𝐿 . the right term in ( ) can also be simplified as (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝑯𝑖 𝑇 𝒗𝑖 − 𝛾𝑪 𝑇 𝒚𝒊 𝑞 = (𝒂𝑖 𝑝 ) 𝑇 ⨂𝒙𝑖 𝑇 − 𝛾 (𝒚 𝒊 𝑞 + 𝒚𝒊 𝑞 ), where 𝒚𝒊 𝑞 = [(𝒚 𝒊 𝑞 ) 𝑇 , (𝒚𝒊 𝑞 ) 𝑇 ] 𝑇 with 𝒚 𝒊 𝑞 ∈ ℝ𝐾𝐿 and 𝒚𝒊 𝑞 ∈ ℝ𝐾𝐿 being the first and second half vector of 𝒚𝒊 𝑞 , respectively. finally, the column vectors of 𝑼𝑞+ in ( a) can be computed fast by 𝒖𝑖 𝑞+ = 𝑣𝑒𝑐 {𝑑𝑒𝑣𝑒𝑐 {(𝒂𝑖 𝑝 ) 𝑇 ⨂𝒙𝑖 𝑇 − 𝛾 (𝒚 𝒊 𝑞 + 𝒚𝒊 𝑞 ) |𝐿, 𝐾} ((𝒂𝑖 𝑝 ) 𝑇 𝒂𝑖 𝑝 + 𝛾𝑰𝐾 ) − } ( ) to solve ( . b), we remove some irrelevant terms from its objective function: min 𝑾∈ℝ 𝐾𝐿×𝑀 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘= + 𝐼+(𝑩 𝑾) + 𝛾 ‖𝑪 𝑼 𝑞+ + 𝑪 𝑾 − 𝑪 − 𝒁 𝑞 ‖𝐹 , ( ) and then, by defining 𝑼𝑘 𝑞+ ∈ ℝ𝐿×𝑀, 𝑘 = , … , 𝐾 as block matrices from top to bottom in 𝑼𝑞+ ∈ ℝ𝐾𝐿×𝑀 , 𝒁𝑘 ∈ ℝ 𝐿×𝑀, 𝑘 = , … , 𝐾 and 𝒁 ∈ ℝ 𝐾𝐿×𝑀 as block matrices from top to bottom in 𝒁 ∈ ℝ 𝐾𝐿×𝑀 , respectively (i.e., 𝒁 ≜ [𝒁 𝑇 , … , 𝒁𝐾 𝑇 , 𝒁 𝑇 ]𝑇 ), we decouple the objective function ( ) as functions of 𝑻𝑘 , 𝑘 = , … , 𝐾 and 𝑺: min 𝑾∈ℝ 𝐾𝐿×𝑀 ∑ {𝜆‖𝑻𝑘 ‖∗ + 𝛾 ‖𝑼𝑘 𝑞+ − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 ‖ 𝐹 } 𝐾 𝑘= + {𝐼+(𝑺) + 𝛾 ‖𝑼𝑞+ − 𝑺 − 𝒁 𝑞 ‖ 𝐹 } therefore, 𝑾𝑞+ can be solved by the proximal point algorithm (ppa) (parikh and boyd ). specifically, we have 𝑾𝑞+ = [(𝑻 𝑞+ ) 𝑇 , … , (𝑻𝐾 𝑞+ ) 𝑇 , (𝑺𝑞+ )𝑇 ] 𝑇 in which 𝑻𝑘 𝑞+ ∈ argmin 𝑻∈ℝ𝐾𝐿×𝑀 𝜆‖𝑻𝑘 ‖∗ + 𝛾 ‖𝑼𝑘 𝑞+ − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 ‖ 𝐹 ( 𝑎) 𝑺𝑞+ ∈ argmin 𝑻∈ℝ𝐾𝐿×𝑀 𝐼+(𝑺) + 𝛾 ‖𝑼𝑞+ − 𝑺 − 𝒁 𝑞 ‖ 𝐹 ( 𝑏) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . note that ( . a) and ( . b) are exactly the proximal operators of ‖𝑻𝑘 ‖∗ and 𝐼+(𝑺), respectively (parikh and boyd ), and their closed-form solutions are given by 𝑻𝑘 𝑞+ = ∑ (𝜎𝑘ℓ − 𝜆 𝛾 ) + 𝝁𝑘ℓ𝝂𝑘ℓ 𝑇 𝑟 ℓ= , 𝑘 = , … , 𝐾, ( ) 𝑺𝑞+ = [𝑼𝑞+ − 𝒁 𝑞 ] + , ( ) where the singular value decomposition (svd) of is performed ahead of the computation of ( ), i.e. 𝑼𝑘 𝑞+ − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 = ∑ 𝜎𝑘ℓ𝝁𝑘ℓ𝝂𝑘ℓ 𝑇𝑟 ℓ= . a reasonable termination criterion is that the primal residual, 𝑝𝑟𝑖 = ‖𝑪 𝑼 + 𝑪 𝑾 − 𝑪 ‖ , and dual residual, 𝑑𝑢𝑎𝑙 = ‖𝛾𝑪 𝑇 𝑪 (𝑾 𝑞+ − 𝑾𝑞 )‖ , are smaller than a predefined tolerance. model parameter tuning in noisy scenarios, the penalty parameter 𝜆 setting is critical to determine how much variation is persevered as patterns of interest or ignored as noise. an extremely large 𝜆 will coerce the individual variation to be zero. decreasing 𝜆 will allow more subtype-specific patterns to be detected until overfitting. cross-validation is a popular strategy in parameter tuning for the balance of underfitting and overfitting. one round of cross-validation excludes a certain portion of samples and uses the model learned from other samples to predict the excluded ones. then every model is assessed by summarizing prediction performances across multiple rounds. however, our sample-specific deconvolution estimates the individual expression of each sample in each subtype, which cannot be used to predict the excluded samples directly. thus, we proposed to randomly exclude entries rather than samples in 𝑿 matrix (fig. ), similar to the strategy used in missing value imputation. the foundation of success is that the low-rank patterns in 𝑻𝑘 matrix are detectable by only a portion of 𝑿 entries and able to predict the excluded 𝑿 entries. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . -fold cross-validation strategy for model parameter tuning. a part of entries is randomly removed before applying swcam. the removed entries are reconstructed by estimated 𝑻 matrix and compared to observed expressions for computing rmse to decide the optimal parameter 𝜆. specifically, we fix the 𝑨 and 𝑺 at the initialization values (from cam-estimation or a priori knowledge) and randomly remove entries in 𝑿 matrix, leading to the objective function w.r.t ∆𝑺𝑖 , 𝑖 = , … , 𝑀: min {∆𝑺𝑖}𝑖= 𝑀 ∑‖𝑃Ω𝑖 (𝒙𝑖 ) − 𝑃Ω𝑖 (𝒂𝑖 (�̅� + ∆𝑺𝑖 ))‖ 𝑀 𝑖= + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘= ( ) 𝑠. 𝑡. �̅� + ∆𝑺𝑖 ≽ 𝟎𝐾×𝐿 , 𝑻𝑘 = [∆𝑺 𝑇 (𝑘), … , ∆𝑺𝑀 𝑇 (𝑘)] ∈ ℝ𝐿×𝑀, 𝑘 = , … , 𝐾, where 𝑃Ω𝑖 (𝒙𝑖) ∈ ℝ 𝐿 denote a vector with the entries in Ω𝑖 left alone, and all other entries set to zero. the workflow of our proposed -fold cross-validation strategy is: ( ) randomly split all entries into folds; (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ( ) remove one fold of entries and use the remaining folds of entries to solve ( ) with different 𝜆 values [𝜆 , 𝜆 , …]; ( ) use estimated ∆𝑺𝑖 (𝜆𝜃 ), 𝑖 = , … , 𝑀, 𝜃 = , , …, together with fixed 𝑨 and 𝑺 matrix to reconstruct 𝑿 matrix and only record the reconstructed values for the removed entries in 𝑿; ( ) repeat step ( )-( ) and obtained a reconstructed �̃�(𝜆𝜃 ) matrix in which all entry values are reconstructed when their original values are absent in optimization processes with 𝜆 = 𝜆𝜃. ( ) calculate root mean square error (rmse) by 𝑅𝑀𝑆𝐸(𝜆𝜃 ) = √ 𝑀𝐿 ∑ ∑ (𝑿𝑖𝑗 − �̃�𝑖𝑗 (𝜆𝜃 )) 𝐿 𝑗= 𝑀 𝑖= ( ) ( ) choose the 𝜆𝜃 yielding the minimum rmse. warm start can be used in step ( ) with the decreasing parameter 𝜆 > 𝜆 > ⋯, which use the estimation with 𝜆𝜃 as the initialization of next optimization with 𝜆𝜃+ . the optimization problem ( ) can be solved using a similar admm algorithm in ( - ) that have solved ( ). the only modification is that ( ) becomes 𝒖𝑖 𝑞+ ∈ argmin 𝒖𝑖∈ℝ 𝐾𝐿 ‖𝑃Ω𝑖 ′ (𝑯𝑖 𝒖𝑖 ) − 𝑃Ω𝑖 ′ (𝒗𝑖 )‖ + 𝛾 ‖𝑪 𝒖𝑖 + 𝒚𝒊 𝑞 ‖ 𝐹 ( ) where 𝑃Ω𝑖 ′ (∙) = [𝟏𝐾 𝑇 ⨂ 𝑃Ω𝑖 (∙) 𝑇 ] 𝑇 ∈ ℝ𝐾𝐿 makes all excluded-entry related variables be optimized only by the second term, which is still an unconstrained quadratic problem that can be solved easily. the remaining variables unrelated to excluded entries can still be optimized following ( - ). sparsity regularization in addition to low-rank assumption, we could also reasonably assume only limited genes are involved in functional modules and thus impose a row-sparsity regularization by ℓ , -norm minimization. the alternative swcam formulation will be: min 𝑨,{∆𝑺𝑖}𝑖= 𝑀 ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 )‖ 𝑀 𝑖= + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘= + 𝛿‖𝑻‖ , ( ) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where 𝛿 > is the regularization parameter of ℓ , norm of 𝑻, defined as ‖𝑻‖ , ≜ ∑‖𝒕𝑖 ‖ 𝐾𝐿 𝑖= accounting for the row-sparsity of 𝑻. if necessary, the parameter 𝛿 actually can be varied for different rows based on the character of each gene, such as mean-variance trend. the supplementary information gives more details on the optimization of ( ) by admm method. the ℓ or ℓ -norm minimization, as common-used sparsity regularization methods, could impose the entry sparsity in 𝑻 matrix. we also provide admm optimization for sample-specific deconvolution with ℓ or ℓ -norm minimization, which could be useful in other sbss problems. results as swcam focuses on subtype-specific variation estimation, simulating biological variance within each subtype and technical variance for each observation is important for validating swcam performance. we conduct two sets of simulations. the first is in an ideal scenario where the variance is not related to mean value. the second is more realistic where genes with larger mean usually have larger variance. validation on ideal simulations in the first simulations, we design twelve function modules, with four in each of three subtypes. the observations for genes in samples were simulated with subtype-specific expression baseline, �̅� , sampled from the purified cell populations in real benchmark microarray gene expression data gse (kuhn, thu et al. ). 𝒂𝒊, 𝑖 = , … , 𝑀, are drawn randomly from a flat dirichlet distribution. between-sample variation, ∆𝑺𝑖 (𝑘, 𝑗), 𝑖 = , … , 𝑀, for the kth subtype and jth gene was drawn from normal distribution 𝒩( , 𝜎𝑘𝑗 (𝑠) ) if the jth gene was involved in a function module in the kth subtype; otherwise zero (fig. a). the genes in the same function module has pairwise correlation coefficient equal to one, thus generating a highly correlated gene set in each module. 𝜎𝑘𝑗 (𝑠) are drawn from uniform distribution 𝑈[ , ]. the technical noise, 𝒏𝑖 , 𝑖 = , … , 𝑀, was drawn from zero-mean normal distribution with the variance 𝜎𝑖𝑗 (𝑛) = . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the twelve functional modules can be recognized in the variation matrix from swcam when 𝜆 falls into a certain range (fig. b~ i). increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing the true variation signal. rmse derived by -fold cross-validation strategy is relatively small when 𝜆 = ~ and reach the minimum at 𝜆 = (fig. a). the estimated variation matrix looks quite similar when ≤ 𝜆 ≤ (fig. e~ g), with clear patterns and some artifacts. the artifacts are formed when the signal variation in one subtype spreads to other subtypes for the same genes, which are much lower than detected true signals if 𝜆 is not extremely small. (as shown in the supplementary information, the nuclear norm minimization for each subtype’s variation matrix is a good option to reduce artifacts compared to other regularization terms.) it is interesting to find 𝜆 = is also the point where both primal and dual residuals surge in admm algorithm (fig. c~ f). it is because larger 𝜆 tends to train an over-simplified model and thus approach the optimum solution more easily in admm. the recovery of sample-specific signals in a subtype is also affected by the mixing proportions of this subtype within the sample. when a subtype accounts for a very small portion in a certain sample, its true signal in this sample will be very weak and thus underestimated (green points in fig. ). on the contrary, the major subtype in a sample can be estimated very well by cam-ss (red points in fig. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . heatmap of estimated 𝑇 matrix with varied 𝜆 parameters compared to ground truth in the ideal simulation. increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing true signal variation. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . -fold cross-validation results under different 𝜆 parameter in the ideal simulation. (a) rmse; (c) residuals for primal feasibility condition; (e) residuals for dual feasibility condition; (b), (d), (f) are zoomed curves of (a), (c), (e). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . estimated 𝑻 matrix versus ground truth when 𝜆= in the ideal simulation. the mixing proportions associated with estimated entries are colored to show the sample-specific expression estimations for high- proportion subtypes can be estimated more accurately than those for low-proportion ones validation on realistic simulations mean-variance trend is widely existing in molecular expression data. in our second simulation, all settings are the same as above except that the variance of subtype-specific expression, 𝜎𝑘𝑗 (𝑠) , and the technical variance of observations, 𝜎𝑖𝑗 (𝑛) , are proportional to the subtype-specific expression mean and mixed expression level, respectively. the coefficient of variation (cv), as the ratio of the standard deviation to the mean, is drawn from uniform distribution 𝑈[ . , . ] and 𝑈[ . , . ], respectively. -fold cross-validation strategy still obtains the minimum rmse at 𝜆 = (fig. a~ b) when both primal and dual residuals also surge (fig. c~ f). however, the estimated variation matrix by (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . swcam is blurred by artifacts trained from noise (fig. ). some high-expressed genes have relatively large variance, which could be falsely modeled as subtype-specific signal variations. as shown in fig. , the entries with zero value in ground truth variation matrix could be overestimated. though the absolute expression values estimated by swcam could deviate from ground truth, we can still clearly detect functional modules defined by the weighted gene correlation network analysis (wgcna) (zhang and horvath , langfelder and horvath ) on the estimated sample-specific expressions (fig. ). wgcna constructs weighted networks based on correlation patterns among genes across samples and thus detects function modules of highly- correlated gene sets. in fig. , the second and third subtype finds the exact four true modules with very few genes are missed. the first subtype detects an extra false module, but it is a less significant pattern compared to other modules and can be undetectable with stricter tree height cut threshold. more importantly, without swcam based deconvolution (fig. d), wgcna on mixture expression profiles can find none of the true modules, but three false modules that are related to the mixing process of three subtypes. incorporation of l -norm regularization in the above simulations, the deconvoluted sample-specific signals contain artifacts trained from signals of other subtypes and artifacts trained from noise (fig. and fig. ). we can use a ℓ , - norm regularization to enforce the sparsity of genes that have signal variation across samples. it is supposed to reduce artifacts while it also follows the assumption that genes contributing to source variation in hidden modules are limited. figure shows the alleviated artifacts with 𝜆 = and 𝛿 = , , or . . the true function modules are correctly detected with 𝜆 = and 𝛿 = or . , where the false module in the first subtype is suppressed when 𝛿 = (fig. ). increasing the penalty parameter 𝛿 will force more genes to have zero variance, which suppresses the artifacts and false function modules but brings the risk of missing the true signals. it is critical to propose a parameter tuning method for 𝛿. however, the cross-validation strategy with randomly excluding entries for tuning parameter 𝜆 is based on the low-rank assumption, where the hidden low-rank patterns can be trained from a part of entries and then used to reconstruct the remaining entries. this strategy is not applicable to 𝛿 selection, which needs further study. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . heatmap of estimated 𝑻 matrix scaled by associated means compared to ground truth in the realistic simulation with varied 𝜆 parameters. increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing true variation signal. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . -fold cross-validation results under different 𝜆 parameter in the realistic simulation. (a) rmse; (c) residuals for primal feasibility condition; (e) residuals for dual feasibility condition; (b), (d), (f) are zoomed curves of (a), (c), (e). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . estimated 𝑻 matrix scaled by associated means versus ground truth in the realistic simulation (𝜆= ). the mixing proportions associated with estimated entries are colored to show the sample-specific expression estimations for high-proportion subtypes can be estimated more accurately than those for low- proportion ones. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . gene co-expressed function modules detected by wgcna on swcam estimated sample-specific expression for each subtype (a~c) or on originally observed expressions without deconvoluton (d). (network interconnectedness is measured by topological overlap; cutheight = . ; minsize = .) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . heatmap of estimated t matrix scaled by associated means compared to ground truth in the realistic simulation with 𝜆 = and varied 𝛿. increasing the penalty of l norm will enforce more zero columns in 𝛥𝑆𝑘 matrix. fig. . gene co-expressed function modules detected by wgcna on swcam estimated sample- specific expression for each subtype with λ= and δ= or . . (network interconnectedness is measured by topological overlap; cutheight = . ; minsize = .) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion most existing tissue deconvolution methods ignore the expression variability of subtypes across individual samples. swcam will significantly expand the utility of cam by producing subtype- specific expression profiles in each sample. the success of swcam depends on the low-rank assumption, which takes advantage of biologically expected cooperation among genes and thus sheds light on solving the seemingly underdetermined sample-specific deconvolution problem. the low-rank assumption holds naturally in molecule expression data when there exist activated functional modules required by particular biological processes or pathways in different subtypes. the detection of such subtype-specific associations or networks is one of the major targets in the analysis of molecule expression profiles. after our sample-specific deconvolution by swcam, conventional network analysis methods can be applied directly to the estimated sample-subtype- specific signals to construct subtype-specific networks, e.g. weighted correlation network analysis (wgcna (zhang and horvath , langfelder and horvath )) and differential dependency network analysis (ddn (zhang, li et al. , zhang, tian et al. , tian, zhang et al. , tian, zhang et al. )). the cross-validation strategy of excluding entries randomly is inspired by the similar ideas in matrix imputation methods that commonly assume the matrix to be recovered has a low rank. our results consistently show a u-curve over parameter 𝜆, demonstrating the feasibility of the proposed cross-validation strategy. meanwhile, cam is not sensitive to the choice of 𝜆, as the u-curve has a wide platform where the recovered sample-subtype-specific signals are similar and detected modules are close. it is also reasonable to assume that genes involved in biological associations or networks are sparse. therefore, it deserves our further study to use ℓ , -norm regularization for reducing artifacts and improving function module detection. when group information is available, we can also apply basic cam algorithm to each group to obtain group-wise expression profiles of subtypes. compared to sample-specific deconvolution, group-specific deconvolution aims at a lower resolution of underlying subtype signals and thus could obtain more robust results. if grouping is fine enough, group-specific deconvolution can also (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acquire signal variation in each subtype and thus help detect function modules and construct biological networks. though swcam can solve a seemingly underdetermined problem theoretically based on a low- rank assumption. it still needs improvement and validations. first, the improvement of swcam by sparsity regularization. the sparsity assumption is practically reasonable, and we already show some preliminary results after imposing ℓ , norm regularization. however, introducing one more regularization term will increase the difficulty of parameter tuning. besides, the current cross- validation strategy with matrix entry sampling is not applicable to selecting the coefficient of ℓ , norm term. therefore, the integration of sparsity regularization still needs our further study. second, the improvement of function module detection based on swcam estimated sample- specific signals in each subtype. recovering the exact values of sample-specific signals is impossible unless there are more strong assumptions. luckily, our goal is to detect function module or networks from the between-sample variations in each subtype. thus, increasing the accuracy of estimated intercorrelations among molecules can be regarded as our target of further efforts. third, the validation of validate swcam in real data analysis. we have demonstrated the capacity of swcam to estimate sample-specific signals in each subtype using simulations where the between-sample variation matrices are low-rank. validation of swcam in real molecule expression data would be difficult, as the benchmark datasets with true subtype-specific signals are unavailable. one possible direction is to verify the constructed subtype-specific networks through biological experiments. conclusion we propose a sample-specific deconvolution algorithm to estimate simple-specific molecule expressions for each subtype, from which between-sample variation can be used to detect biological associations and construct networks in each subtype. the contributions of this work include: we formulate the objective function for swcam with a penalty term to minimize the nuclear norm of between-sample variation matrix in each subtype, based on our expectation on the existence of subtype-specific networks. we design an efficient method based on admm to solve swcam’s optimization problem in large-scale biological data. we design a -fold cross- validation strategy to select the coefficient of nuclear norm term, and demonstrate its feasibility in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . simulations where a u-curve of rmse is obtained to determine the optimal selection. we validate swcam in simulations to demonstrate sample-specific signals can be well estimated when low- rank assumption holds. even though artificial signal variances exist in swcam estimations, the intercorrelations among genes can still be well preserved for function module detection and biological network construction. we propose to use extra ℓ , norm regularization to enforce the sparsity of genes involved in networks and thus reduce the artifacts trained from noise or from signals of other subtypes. acknowledgments this work has been supported by the national institutes of health under grants hl - a , hl , ns - , and the department of defence under grant w xwh- - - (bc p ). competing financial interests the authors declare no competing financial interests. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . reference boyd, s., n. parikh, e. chu, b. peleato and j. eckstein ( ). "distributed optimization and statistical learning via the alternating direction method of multipliers." found. trends mach. learn. ( ): - . buettner, f., k. n. natarajan, f. p. casale, v. proserpio, a. scialdone, f. j. theis, s. a. teichmann, j. c. marioni and o. stegle ( ). "computational analysis of cell-to-cell heterogeneity in single- cell rna-sequencing data reveals hidden subpopulations of cells." nat biotechnol ( ): - . cai, j.-f., e. j. candès and z. shen ( ). "a singular value thresholding algorithm for matrix completion." siam journal on optimization ( ): - . candes, e. j., c. a. sing-long and j. d. trzasko ( ). "unbiased risk estimates for singular value thresholding and spectral estimators." trans. sig. proc. ( ): - . chasman, d. and s. roy ( ). "inference of cell type specific regulatory networks on mammalian lineages." current opinion in systems biology (supplement c): - . chen, l. ( ). mathematical modeling and deconvolution for molecular characterization of tissue heterogeneity. ph.d. doctoral dissertation, virginia polytechnic institute and state university. chen, l., y. lu, c.-t. wu, r. clarke, g. yu, j. e. van eyk, d. herrington and y. wang ( ). "data-driven detection of subtype-specific differentially expressed genes." scientific reports. gal, e., m. london, a. globerson, s. ramaswamy, m. w. reimann, e. muller, h. markram and i. segev ( ). "rich cell-type-specific network topology in neocortical microcircuitry." nature neuroscience : . hastie, t., r. tibshirani and j. friedman ( ). the elements of statistical learning. new york, ny, usa, springer new york inc. junttila, m. r. and f. j. de sauvage ( ). "influence of tumour micro-environment heterogeneity on therapeutic response." nature : . kuhn, a., d. thu, h. j. waldvogel, r. l. faull and r. luthi-carter ( ). "population-specific expression analysis (psea) reveals molecular changes in diseased brain." nat methods ( ): - . langfelder, p. and s. horvath ( ). "wgcna: an r package for weighted correlation network analysis." bmc bioinformatics : . parikh, n. and s. boyd ( ). "proximal algorithms." foundations and trends® in optimization ( ): - . recht, b., m. fazel and p. a. parrilo ( ). "guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization." siam review ( ): - . shen-orr, s. s., r. tibshirani, p. khatri, d. l. bodian, f. staedtler, n. m. perry, t. hastie, m. m. sarwal, m. m. davis and a. j. butte ( ). "cell type-specific gene expression differences in complex tissues." nat methods ( ): - . sonawane, a. r., j. platig, m. fagny, c.-y. chen, j. n. paulson, c. m. lopes-ramos, d. l. demeo, j. quackenbush, k. glass and m. l. kuijjer "understanding tissue-specific gene regulation." cell reports ( ): - . thouvenin, p. a., n. dobigeon and j. y. tourneret ( ). "hyperspectral unmixing with spectral variability using a perturbed linear mixing model." ieee transactions on signal processing ( ): - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tian, y., b. zhang, e. p. hoffman, r. clarke, z. zhang, i.-m. shih, j. xuan, d. m. herrington and y. wang ( ). "knowledge-fused differential dependency network models for detecting significant rewiring in biological networks." bmc systems biology ( ): . tian, y., b. zhang, e. p. hoffman, r. clarke, z. zhang, m. shih ie, j. xuan, d. m. herrington and y. wang ( ). "kddn: an open-source cytoscape app for constructing differential dependency networks with significant rewiring." bioinformatics ( ): - . wang, n., e. p. hoffman, l. chen, l. chen, z. zhang, c. liu, g. yu, d. m. herrington, r. clarke and y. wang ( ). "mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues." scientific reports : . zhang, b. and s. horvath ( ). "a general framework for weighted gene co-expression network analysis." stat appl genet mol biol : article . zhang, b., h. li, r. b. riggins, m. zhan, j. xuan, z. zhang, e. p. hoffman, r. clarke and y. wang ( ). "differential dependency network analysis to identify condition-specific topological changes in biological networks." bioinformatics ( ): - . zhang, b., y. tian, l. jin, h. li, m. shih ie, s. madhavan, r. clarke, e. p. hoffman, j. xuan, l. hilakivi-clarke and y. wang ( ). "ddn: a cabig(r) analytical tool for differential network analysis." bioinformatics ( ): - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . analysis of next- and third-generation rna-seq data reveals the structures of alternative transcription units in bacterial genomes analysis of next- and third-generation rna-seq data reveals the structures of alternative transcription units in bacterial genomes qi wang , zhaoqian liu , , bo yan , wen-chi chou , laurence ettwiller , qin ma ,†, and bingqiang liu ,† school of mathematics, shandong university, jinan , china. department of biomedical informatics, college of medicine, the ohio state university, columbus, oh , usa. new england biolabs inc., ipswich, ma , usa. infectious disease and microbiome program, broad institute of mit and harvard, cambridge, ma , usa. †corresponding author. email: bingqiang@sdu.edu.cn (b.l.); qin.ma@osumc.edu (q.m.) abstract alternative transcription units (atus) are dynamically encoded under different conditions or environmental stimuli in bacterial genomes, and genome-scale identification of atus is essential for studying the emergence of human diseases caused by bacterial organisms. however, it is unrealistic to identify all atus using experimental techniques, due to the complexity and dynamic nature of atus. here we present the first-of-its-kind computational framework, named seqatu, for genome-scale atu prediction based on next-generation rna-seq data. the framework utilizes a convex quadratic .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / programming model to seek an optimum expression combination of all of the to-be-identified atus. the predicted atus in e. coli reached a precision of . / . and a recall of . / . in the two rna- sequencing datasets compared with the benchmarked atus from third-generation rna-seq data. we believe that the atus identified by seqatu can provide fundamental knowledge to guide the reconstruction of transcriptional regulatory networks in bacterial genomes. introduction an operon in bacterial genomes is defined as a group of consecutive genes regulated by a common promoter that all share the same terminator ( ). genes in the same operon generally encode proteins with relevant or similar biological functions; e.g., lacz, lacy, and laca in the lac operon encode proteins that help cells use lactose ( , ). with decades of research on bacterial transcriptional regulation, the operon model has been found to have complex mechanisms that control expression ( - ). multiple studies have shown that bacterial genes are dynamically transcribed under different triggering conditions, leading to shared genes among different mrna transcripts ( - ). this dynamic architecture can be redefined by all of the alternative transcription units (a.k.a., atus) ( , ), and more details can be found in fig. s . atu identification is of fundamental importance for understanding the transcriptional regulatory mechanisms of bacteria, and these dynamic structures have been demonstrated to be associated with human diseases ( - ). for example, bhat et al. studied the alr-groel operon, which is essential for the survival or virulence of m. tuberculosis ( , ), the causative agent of tuberculosis (tb), and found that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the regulation of the sub-operon is distinct from the main operon (alr-groel operon) under stress, especially during heat shock, ph, and sds stresses ( ). another example is helicobacter pylori, a gastric pathogen that is the primary known risk factor for gastric cancer ( ). sharma et al. found an acid-induced sub-operon cag - transcribed from the primary cag - operon in the cag pathogenicity island of the h. pylori genome under acid stress ( ). the mechanism of the complex atu structure in these pathogenic bacteria can help us to study the emergence of human diseases caused by bacterial organisms. several newly developed techniques have provided a comprehensive view of the e. coli transcriptome by identifying full-length primary transcripts ( - ). for example, smrt-cappable-seq ( ) combines the isolation of the full-length bacterial primary transcriptome with pacbio smrt (single molecule, real-time) sequencing ( ), and simultaneous ’ and ’ end sequencing (send-seq) ( ) captures both transcription start sites (tsss) and transcription termination sites (ttss) via circularization of transcripts ( ). despite the great progress in experimental techniques, there are still some deficiencies. on the one hand, the read depth and error rate of the third-generation sequencing used in smrt-cappable-seq have an impact on atu prediction compared with illumina-based rna- seq ( , ). on the other hand, the time-consuming, laborious, and costly properties of these experimental techniques make them unrealistic to be generally applicable to atu predictions in bacteria under specific conditions. thus, novel and robust computational methods for atu identification in bacterial genomes based on rna-seq are urgently needed. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fortunately, many computational studies have been carried out to predict atus in bacteria, which have provided some preliminary studies for atu prediction. several public databases, such as regulondb ( ), dbtbs( ), microbesonline ( ), door ( , ), operomedb ( ), dminda . ( ), and proopdb ( ), provide various levels of operon information and small amounts of atu information. however, these databases cannot provide genome-scale atu information under specific conditions. some computational studies, including rockhopper ( ), seqtu ( , ), bac- browser( ), rseqtu ( ), and operon-mapper ( ), utilize machine learning and model integration methods based on genomic information and gene expression profiles to identify bacterial transcription architecture. however, these works still cannot solve the dynamic patterns and overlapping nature of atus. here, we present seqatu, a novel computational method for genome-scale atu prediction by analyzing next- and third-generation rna-seq data (fig. and table s ). seqatu utilizes a convex quadratic programming model (cqp) and aims to provide the optimum expression combination of all of the to-be-identified atus. specifically, cqp minimizes the squared error between the predicted expression level of atus and the actual expression levels in genetic and intergenic regions. it is noteworthy that seqatu also utilizes the information about the bias rate function in modeling non- uniform read distribution as the linear constraints of cqp to profile the complexity of the atu architecture. overall, seqatu provides a generalized framework for the inference of atus based on next-generation rna-seq data collected under multiple conditions and can be easily applied to any .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bacterial organism to identify the atu architecture and construct a transcriptional regulatory network. please place fig. here. materials and methods data collection the two cappable rna-seq datasets used in this study, m enrich_seq and rienrich_seq, were obtained from e. coli grown under two different conditions: m minimal medium and rich medium, respectively ( ). the full-length primary transcripts were enriched as described in ( ) with modifications to be adapted to illumina sequencing. the capping and polya tailing were performed as described in ( ). the capped rna was enriched using hydrophilic streptavidin magnetic beads (new england biolabs) and eluted with biotin using the same condition ( ). differently, the eluted rna was enriched once more using streptavidin beads to further remove processed rna (e.g., rrna). subsequently, the eluted rna was used for library preparation using nebnext ultra ii directional rna library prep kit (e ). sequencing was performed on the illumina miseq system (paired-end, bp). all reads were mapped to the e. coli genome using burrows-wheeler aligner (bwa) with the default parameters ( ). read alignment and other computational analyses were carried out using the e. coli genome nc_ . , and the corresponding gene annotations (gcf_ . _asm v _genomic.gff) were downloaded from ncbi. two experimentally verified atu datasets, smrt_m enrich and smrt_rienrich, were used as the benchmark data to evaluate the predicted atus, which were .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / generated by smrt-cappable-seq under the same conditions as the illumina datasets m enrich_seq and rienrich_seq, respectively ( ). in addition, the atus defined by regulondb ( ) and send-seq ( ) were also used as additional evaluation data in our study. calculation of the expression values of genetic and intergenic regions after the rna-seq reads in m enrich_seq and rienrich_seq were mapped to the e. coli genome using bwa, we determined the number of reads �(�) covering each genomic position �. suppose that �� and �� are two consecutive genes on the same strand; we denote the expression value of �� as �� and the expression value of the intergenic region between genes �� and �� as ��,��. then, the calculation of �� and ��,�� is given by: �� = ∑ �(�)�∈�� |�� | ( ) ��,�� = ∑ �(�)�∈��,�� |��,��| ( ) where � ∈ �� denotes that genomic position � is on the gene �� and |�� | denotes the genomic length of ��. modeling non-uniform read distribution along mrna transcripts we introduced the bias rate function, which is similar to the bias curves in the work of wu et al. ( ), to address the non-uniform distribution of the rna-seq reads along mrna transcripts ( - ). the bias function reflects the relative read distribution bias from the ’ end to the ’ end of an mrna transcript. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we assumed that the maximum read coverage of all the genomic positions of an mrna transcript is the expression level without bias. it is noteworthy that a single gene mrna transcript with no shared gene among different mrna transcripts can serve as the ideal template for modeling non-uniform read distribution along mrna transcripts. the specific steps of modeling non-uniform read distribution are detailed as follows: step : single gene mrna transcript selection. we selected single gene mrna transcripts from the evaluation data and plotted their expression distributions. specifically, groups of single gene mrna transcripts with lengths ranging from to , bp were selected from the evaluation data (more details are given in method s ), and each group had ten randomly chosen mrna transcripts. apparent decline trends appeared in the single gene mrna transcripts with long lengths, ranging from , to , bp (fig. s ). the reason for this phenomenon may be that the incomplete transcription and ’ end degradation or processing induce the enrichment of signal at ’ end of the mrna transcripts with long lengths ( , ). finally, we plotted the expression distribution of single gene mrna transcripts with lengths ranging from , to , bp. step : acquiring the bias rate function. we applied nonlinear regression to the expression distribution of the selected single gene mrna transcripts and acquired the hypothetical function �(�). specifically, the � axis and � axis of the expression distribution were converted to the distance from the ’ end of an mrna transcript and the bias rate of read distribution, respectively. to apply nonlinear regression to single gene mrna transcripts with different lengths, normalization was also implemented .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / on �. here, � = (��, ��, … , ��) and � = (��, ��, … , ��) are defined by: �� = ⎩ ⎨ ⎧ �� − �� − �� × �, �� − �� − �� × �, �� ( ) �� = ⎩ ⎪ ⎨ ⎪ ⎧ �(��) �� , �� (�� ) �� , �� ( ) where � denotes the number of genomic positions on an mrna transcript; � = (��, ��, … , ��) denotes the genomic positions on an mrna transcript; �� = ��; �(�� ) denotes the expression level of the genomic position �� , i.e., the number of reads covering the genomic position �� ; and �� denotes the expression level without bias in an mrna transcript, which is calculated as �� {�(�� )}, ≤ � ≤ �. we used the function nls in r to acquire the hypothetical function �(�). step : constructing bias rate vectors. we constructed a genetic or intergenic region bias rate vector for each mrna transcript by calculating the bias rate of all of its component genetic or intergenic regions. the bias rate of a genetic or an intergenic region is the average bias rate of all the genomic positions that it contains. considering an mrna transcript � and its component gene set {��, ��, … , ��} (the details of the gene labels are described in method s ), we denoted the genetic region bias rate vector as � = (��, ��, … , �� ), which was calculated using the formula: �� = ⎩ ⎪ ⎨ ⎪ ⎧ ∑ �(�� ) �� − �� + , �� ∑ �(�� ) �� − �� + , �� ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / where � denotes the number of genomic positions on �; �� denotes the bias rate of �� for �; and �� = (�� , �� , �� , �� , … , �� , �� ) is the range of the genomic positions of {��, ��, … , ��}, while the range of the genomic positions of �� is [�� , �� ], ≤ � ≤ �. similarly, the calculation of the intergenic region bias rate vector � = (��, ��, … , ��) is provided in method s . modification of maximal atu clusters a maximal atu cluster is defined as a maximal consecutive gene set such that each pair of its consecutive genes can be covered by at least one atu. similar to atus, maximal atu clusters are also dynamically composed under different conditions or environmental stimuli in bacterial genomes ( , ). such a maximal atu cluster can be used as an independent genomic region for atu prediction, which alleviates the difficulty in computationally predicting atus at the genome scale. the output of our in- house tool rseqtu can serve as the maximal atu cluster data, which lays a solid foundation for atu prediction ( ). we modified the maximal atu clusters from rseqtu: (i) two maximal atu clusters with distances less than bp were combined into one cluster and (ii) a maximal atu cluster was split at the intergenic region where the opposite-strand genes were located. in addition, we selected the maximal atu clusters with expression values over ten (see the details in method s ), according to the study of etwiller et al. ( ). the mathematical programming model for atu prediction the predicted atu expression profile should be consistent with the observed expression profiles of the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genetic and intergenic regions. therefore, the prediction of the atu profiles can be modeled as an optimization problem, which seeks an optimum expression combination of all of the to-be-identified atus to minimize the gap between the predicted atus and the observed genetic and intergenic region expression profiles. here, a convex quadratic programming model was built to solve this optimization problem. we denoted a maximal atu cluster as �, assuming that it contains the consecutive genes {��, … , ��}, and the intergenic regions of these genes are {��,�, … , ��,�}. the size of � is defined as the number of its component genes �. theoretically, there are �×(��) � atus for �, and an atu with consecutive genes {�� , ��, … , �� } is denoted as � �,� ; the corresponding expression value is ��,�, ≤ � ≤ � ≤ �. for the component gene �� of �, the gap between the gene expression value �� and the sum of the expression level of the atus containing it is denoted as ��, which provides the first � equality constraints in our mathematical programming model, � = , , … , �. similarly, for the intergenic region ��,�� of �, the gap between the intergenic region expression value ��,�� and the sum of the expression level of the atus containing it is denoted as ��, providing the last � − equality constraints in our mathematical programming model, � = , , … , � − . the goal of our mathematical programming model is to minimize the square of � = (��, ��, … , ��, ��, … , ��), as the combination of � �,� with a minimal value of �� is corresponding to an optimum expression combination of all atus ��,� for �, ≤ � ≤ � ≤ �. additionally, to control the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / number of optimal solutions and reduce the false-positive errors, we added an �� regularization �||�||� to �� with ��,� ≥ , which is a linear function. because of the variant expression level of different maximal atu clusters, we used the expression value of � as �. in total, the convex quadratic programming model with unknown variables (�, �) is shown as follows: �� + �||�||� �. �. ∑ ∑ ��,� � �,�� = �� + �� = , , … , � ∑ ∑ ��,�� ,�� = ��,�� + �� = , , … , � − � = ��,� �, ��,� ≥ ≤ � ≤ � ≤ � � = (��, ��, … , �� , ��, … , ��) ( ) where � = (��,� ) is the genetic region bias rate vector for �, ��,� is the bias rate of gene �� for atu ��,�, ≤ � ≤ � ≤ �，� ≤ � ≤ �, � = (��,� ) is the intergenic region bias rate vector for �, and ��,� is the bias rate of the intergenic region ��,� for atu � �,�, ≤ � < � ≤ �，� ≤ � ≤ � (see the details in method s ). two evaluation methods for atu prediction in the first evaluation method, precision and recall were defined based on perfect matching (eqs. ). perfect matching of two atus means that all of their component genes are the same. here, the true positives (��) are the number of predicted atus with the same component genes as an atu in the evaluation data; the false positives (��) are the number of predicted atus that do not exist in the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / evaluation data; the false negatives (��) are the number of atus that appear in the evaluation data but not in the prediction results of seqatu. �� = �� + �� = �� + �� ( ) in the second evaluation method, precision and recall were defined based on relaxed matching, which is measured by the similarity of two atus. assuming that an atu � is in one of two datasets (the predicted atu dataset and evaluated atu dataset), the definition and calculation of the similarity of � are shown in the following three cases: case : if � shares boundary genes at both ends of an atu in the other dataset, i.e., all component genes of � are the same as one in the other dataset, then ��(�) = . case : if � shares exactly one boundary gene of atus in the other dataset, then we denote �� as the atus in the other dataset that share the ’-end gene with � and denoted �� as the atus in the other dataset that share the ’-end gene with �, �� ∩ �� = ∅, one of �� and �� can be empty. then, ��(�) = ��∈�� (��) �(��) + ��∈�� (��) �(��) ( ) where �(��) is the number of shared genes of � and �� and �(��) is the maximal size of � and ��. case : if � shares no boundary genes at both ends of the atus in the other dataset, then ��(�) = . finally, the precision and recall based on relaxed matching are calculated by the following formula: �� = ∑ ��(�)�∈�� .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / �� = ∑ ��(�)�∈�� ( ) where �� is the set of predicted atus, �� is the number of predicted atus, �� is the set of evaluated atus, and �� is the number of evaluated atus. results a reliable bias rate function is acquired in modeling non-uniform read distribution along mrna transcripts to ensure the reliability of the bias rate function in modeling non-uniform read distribution, we selected four single gene mrna transcript datasets randomly from the two evaluation datasets (smrt_m enrich and smrt_rienrich), named m enrich_ , m enrich_ , rienrich_ , and rienrich_ . four bias rate functions, which are exponential functions, were generated after conducting nonlinear regression on the mrna transcripts across these four datasets (fig. ). we found that these bias rate functions were similar (�� > . ) when we evaluated the r-square statistic (for more details, see method s and table s ). the similarity of the four bias rate functions indicated that the selection of the single gene mrna transcript datasets had little impact on modeling non-uniform read distribution along mrna transcripts, implying the universal common non-uniform read distribution of different mrna transcripts of e. coli. specifically, we used the average of these four coefficients as the final coefficients of the exponential function, which was �(�) = �� with � = . and � = . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / please place fig. here. atus predicted by seqatu reach precision and recall over . the performance evaluation was conducted by comparing the predicted atus with the atus in smrt_m enrich and smrt_rienrich, which were generated based on the third-generation sequencing and are not sensitive to transcripts with low expression levels. for a more accurate and fair evaluation, maximal atu clusters after pre-selection were retained in the subsequent evaluations (more details about the pre-selection of maximal atu clusters can be seen in method s and fig. s ). the precision and recall of the predicted atus were calculated for each maximal atu cluster. by considering only perfect matching, the average precision and recall were . and . for m enirch_seq and . and . for rienrich_seq, respectively. when using relaxed matching, the average precision and recall increased to . and . for m enrich_seq and . and . for rienrich_seq, respectively. the statistics for precision and recall on maximal atu clusters with different sizes, as shown in fig. a and fig. s a. these results showed that the average precision and recall were decreasing with the increasing size of maximal atu clusters (other than several large size ones due to their small number of counts). the results also indicated that the evaluation results based on relaxed matching were significantly higher than those based on perfect matching across different sizes. this result implied that the incorrectly predicted atus by seqatu based on perfect matching tended to have strong similarities with the atus in the evaluation data. in addition, we also found that more than a quarter of the incorrectly predicted atus ( %/ % for m enrich_seq/rienrich_seq) by seqatu .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / based on perfect matching matched with the transcription units in regulondb ( ). the two evaluation datasets (smrt_m enrich and smrt_rienrich) were both from smrt- cappable-seq, while one of the processing steps of the technique filtered rna reads smaller than , bp ( ), which indicated that the atus in these two evaluation datasets were not comprehensive. to address this issue, we enriched the evaluation data by adding the atus defined by send-seq ( ), as send-seq did not introduce any filtering based on rna size. when we used the new evaluation data, the atus predicted by seqatu improved by % ( . ) and % ( . ) in terms of the average precision based on perfect matching for m enrich_seq and rienrich_seq, respectively, and by % ( . ) and % ( . ) based on relaxed matching. the statistics for precision across different sizes of the maximal atu clusters are shown in fig. b and fig. s b, showing that the values of precision based on perfect matching were significantly improved across different sizes of maximal atu clusters by using the evaluated atus from smrt-cappable-seq and send-seq. this result suggested that the atus we predicted, which were not in smrt_m enrich and smrt_rienrich, may be due to the rna length selection of smrt-cappable-seq. we enriched the evaluation data by adding the atus in regulondb ( ) and also found the improvement of precision across different sizes of maximal atu clusters for m enrich_seq and rienrich_seq (fig. s c). furthermore, to facilitate the understanding of the performance of seqatu and to measure the influence of the maximal atu clusters from rseqtu on our atu prediction method, smrt maximal atu clusters collected from smrt_m enrich and smrt_rienrich (for more details, see method s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / were applied for the cqp in two conditions (m minimal medium and rich medium). we found that precision and recall increased to . and . for m enrich_seq, respectively, and . and . for rienrich_seq based on perfect matching (fig. s d). additionally, when using relaxed matching, precision and recall significantly increased to . and . for m enrich_seq, respectively, and . and . for rienrich_seq (fig. s d). the significantly improved results verified the ability of seqatu to accurately predict atu when giving more accurate maximal atu clusters. in addition, we found that the number of predicted atus and the evaluated atus under the maximal atu cluster with the same size were similar except for the maximal size (fig. c), and they were far less than the theoretical number, which indicated that seqatu can effectively exclude most of the incorrect atus. please place fig. here. the bias rate constraints efficiently improve the ability of seqatu to predict atus we tried to use seqatu without bias rate constraints to predict the atus of e. coli and found that its performance significantly decreased compared with seqatu (fig. and fig. s ). specifically, the f- score of seqatu without bias rate constraints was . / . based on perfect matching for m enrich_seq/rienrich_seq, compared with . / . for seqatu. when using relaxed matching, the f-score of seqatu without bias rate constraints was . / . for m enrich_seq/rienrich_seq, compared with . / . for seqatu. this result suggested that the bias rate constraints of seqatu could capture useful information about the non-uniform distribution of the rna-seq reads along the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / mrna transcripts ( - ) and then efficiently improve the ability of the model to predict complex atus. please place fig. here. atus predicted by seqatu display a dynamic composition and overlapping nature a total of , distinct atus were identified in m minimal medium, and , were identified in rich medium. among them, there were , / , distinct atus on the forward strand and , / , on the reverse strand for m enrich_seq/rienrich_seq. each of the predicted atus was comprised of an average of . genes, with the largest atu containing genes across the two conditions. the distribution of the size of the predicted atus is shown in fig. a, from which we can see that the majority of atus (more than %) contained fewer than five genes in m minimal medium and rich medium. approximately % of the genes in e. coli were contained in more than one atu for m enrich_seq, compared to % genes for rienrich_seq, suggesting that the atus in a maximal atu cluster generally overlapped with each other (fig. b). in addition, there were , atu maximal clusters for m enrich_seq and , atu maximal clusters for rienrich_seq. seqatu identified a total of , identical atus under the two conditions, whereas there were , distinct atus. among the distinct atus across the two conditions, atus were from the same maximal atu clusters in the two maximal atu cluster datasets, and the rest were from different maximal atu clusters. the fact there were distinct atus under the two conditions suggests that atus are dynamically responsive to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / different conditions or environmental stimuli (for more real examples about the atus under different conditions, see fig. s ). the dynamic composition of predicted atus by seqatu is of great significance to understand the interactions inside polymicrobial communities. for example, chronic airway infection by pseudomonas aeruginosa considerably contributes to lung tissue destruction and impairment of pulmonary function in cystic-fibrosis (cf) patients ( ). marie et al. found that the presence of e. coli complemented the growth defect of a p. aeruginosa bioa-disrupted mutant that is unable to grow on rich medium, and can be beneficial to p. aeruginosa when biotin supply is limited ( ). an atu with a high expression level coded by the uvrb gene is identified by seqatu in rich medium, while it does not exist in m minimal medium (fig. ). we predicted the uvrb gene to be involved in the biotin metabolism pathway, as the biob, biof, bioc, and biod genes contained in a same atu with it have been known in the biotin metabolism kegg pathway. therefore, the observation by marie et al. can be explained that the atus coded by the uvrb gene of e. coli can provide the biotin supply for p. aeruginosa under rich medium. this result showed that seqatu could increase our understanding of interspecies competition and cooperation, which play an important role in shaping the composition and structure of polymicrobial bacterial populations. please place fig. here. please place fig. here. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / predicted atus by seqatu are verified by experimental tsss and ttss an experimental tss dataset of e. coli from send-seq ( ) and a tf binding site dataset of e. coli from the experimental dataset of regulondb ( ) were used to further verify the reliability of seqatu and were named dataset and dataset , respectively. there were , experimental tsss in dataset and , experimental tf binding sites in dataset . we considered the ’-end genes and no ’-end genes of the predicted atus by seqatu. a gene that is not the ’-end gene of any predicted atu is named a no ’-end gene. we identified , / , ’-end genes and , / , no ’-end genes of the predicted atus for m enrich_seq/rienich. a gene validated by experimental tsss or tf binding sites means that it is the immediate downstream gene of an experimental tss or tf binding site. as a result, the proportion of ’-end genes of the predicted atus that were validated by experimental tsss or tf binding sites was over . times greater than that of the no ’-end genes (table ). specifically, the proportion of ’-end genes ( %/ % for m enrich_seq/rienrich_seq) validated by experimental tf binding sites was over three times greater than the no ’-end genes ( . %/ . % for m enrich_seq/rienrich_seq). these results further verified the reliability of the atus predicted by seqatu in terms of the tss level. in addition, four other experimental tss or promoter datasets from regulondb ( ), drna-seq ( ), and cappable-seq ( ) were also examined. the results are shown in table s , and we also found a higher proportion of ’-end genes of the predicted atus validated by experimental tsss or promoters than that of no ’-end genes. we also used two experimental tts datasets of e. coli from send-seq ( ) and regulondb ( ) to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / verify the reliability of predicted atus by seqatu in terms of tts level. these two experimental tts datasets were named dataset and dataset , respectively. there were , experimental ttss in dataset and experimental ttss in dataset . we considered the ’-end genes and no ’-end genes of the predicted atus by seqatu. a gene that is not the ’-end gene of any predicted atu is named a no ’-end gene. we identified , / , ’-end genes and , / no ’-end genes of the predicted atus for m enrich_seq/rienrich_seq. a gene validated by experimental ttss means that it is the immediate upstream gene of an experimental tts. as a result, the proportion of ’-end genes of the predicted atus that were validated by experimental ttss was over two times greater than that of no ’- end genes (table ). specifically, the proportion of ’-end genes ( %/ % for m enrich_seq/rienrich_seq) validated by experimental ttss from send-seq was over three times greater than that of no ’-end genes ( %/ % for m enrich_seq/rienrich_seq). these results further verified the reliability of the atus predicted by seqatu in terms of the tts level. in addition, two other computationally predicted tts datasets from the works by nadiras et al. ( ) and kingsford et al. ( ) were also examined. the results are shown in table s , and we also found the proportion of ’-end genes ( %/ % for m enrich_seq/rienrich_seq) validated by computationally predicted rho- independent ttss was over two times greater than that of no ’-end genes ( %/ % for m enrich_seq/rienrich_seq). please place table here. please place table here. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the gene pairs frequently encoded in the same atus are more functionally related than those that can belong to two distinct atus functional analysis was conducted by integrating go terms from the gene ontology (go) database ( ). in detail, we measured the level of functional relatedness for two types of consecutive gene pairs, which is similar to the definition in the work by mao et al. ( ). two types of consecutive gene pairs were (i) gene pairs each consisting of a ’-end gene of an atu and the gene in its immediate upstream on the same strand and (ii) all the other gene pairs inside an atu (fig. a). in addition, we used a scoring scheme to measure the go-based functional similarity between a pair of genes by wu et al. ( ). this study developed a go similarity score and showed that the larger the score, the more likely that two genes are functionally related. in brief, the go similarity score of a gene pair �� and �� is denoted as �� (�� , �� ): �� , �� = ��∈�(��), ��∈�(��) �(�� , �� ) where �� and �� are the go terms assigned to �� and �� , respectively; �(�� , �� ) is the maximal number of common terms between paths in the two go graphs induced by the go terms �� and ��. as a result, the mean go similarity score was higher for type-ii gene pairs ( . versus . for m enrich_seq and . versus . for rienrich_seq) than for type-i gene pairs. a total of / type-ii gene pairs had go similarity scores greater than four ( %/ % of a total of / ), while only / type-i gene pairs had go similarity scores greater than four ( %/ % of a total of , / , ) for m enrich_seq/rienrich_seq. we also applied a c�-test ( ) to determine whether the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / distribution of �� , �� was different for the type-i gene pairs and type-ii gene pairs. the c �- statistics corresponded to a p-value less than ��, which revealed that the distribution of �� , �� for the type-ii gene pairs was significantly different from the type-i gene pairs. fig. b shows the distribution of �� , �� for the type-i gene pairs and the type-ii gene pairs. these results strongly indicated that the type-ii gene pairs had a higher degree of go similarity than the type-i gene pairs, suggesting that the gene pairs frequently encoded in the same atus (type-ii gene pairs) are more functionally related than those that can belong to two distinct atus (type-i gene pairs). we also carried out a similar analysis of the two different gene pairs based on kegg enrichment analysis (see more details in method s ) and found that the proportion of type-ii gene pairs ( %/ % for m enrich_seq/rienrich_seq), whose two genes were contained in the same kegg pathway, was higher than the proportion of type-i gene pairs ( %/ % for m enrich_seq/rienrich_seq) (fig. c). the distribution of the kegg similarity scores of the two different types of gene pairs is shown in fig. d, suggesting that genes of type-ii gene pairs have a higher probability of participating in the same kegg pathway than those of type-i gene pairs. please place fig. here. discussion we developed seqatu, the first computational method for genome-scale atu prediction by analyzing next- and third-generation rna-seq data, using a cqp model. linear constraints provided by the bias .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rate of read distribution were, for the first time, integrated into the cqp model. positional bias refers to the non-uniform distribution of reads over different positions of a transcript ( , ), which is handled by learning non-uniform read distributions from given rna-seq reads ( ) or modeling the rna degradation ( ). the bias rate function we proposed can address the non-uniform read distribution along mrna transcripts and also be desirable for standard next-generation rna-seq data that involves more degraded mrnas, as the exponential function has been used to model the degradation of mrna transcripts ( ). as a result, a total of , distinct atus for m enrich_seq and , distinct atus for rienrich_seq were identified by seqatu. the precision and recall reached . / . and . / . , respectively, based on perfect matching and . / . and . / . , respectively, based on relaxed matching for m enrich_seq/rienrich_seq. we further validated predicted atus using experimental transcription factor binding sites or transcription termination sites from regulondb and send-seq. in addition, the proportion of the ’- or ’-end genes of predicted atus that were validated by experimental transcription factor binding sites and transcription termination sites was over three times greater than that of no ’- or ’-end genes, demonstrating the high reliability of predicted atus. gene pairs frequently encoded in the same atus were more functionally related than those that can belong to two distinct atus according to go and kegg enrichment analyses. these results demonstrated the reliability and accuracy of our predicted atus, implying the ability of seqatu to reveal the transcriptional architecture of the bacterial genome. in fact, the atu architecture of bacteria is much more complex than that determined with currently .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / used experimental techniques. we investigated the ’-end genes and no ’-end genes of the experimental atus identified by smrt-cappable-seq ( ) using a combination of experimental tsss from regulondb ( ), drna-seq ( ), cappable-seq ( ), and send-seq ( ). as a result, we found that the proportion of ’-end genes ( %) validated by experimental tsss was not significantly different from that of no ’-end genes ( %). the high percentage of no ’-end genes validated by experimental tsss implied that the atus identified by experimental techniques are only a small proportion of the comprehensive atus in bacterial organisms due to the dynamic mechanisms of atus. these results further verified the necessity of developing robust computational methods for atu identification. seqatu not only provides a powerful tool to understand the transcription mechanism of bacteria but also provides a fundamental tool to guide the reconstruction of a genome-scale transcriptional regulatory network. first, the atu structure can help us to make new functional predictions, as genes in an atu tend to have related functions. second, atus can elucidate condition-specific uses of alternative sigma factors ( , ). for example, the thrlabc operon is regulated by transcriptional attenuation. totsuka et al. found that under the log phase growth condition, the thrlabc operon is the only transcript, while two transcripts are found under stationary phase growth condition, the thrlabc and thrbc. as validated experimentally, � � can regulate the additional promoter located in front of thrb under the stationary phase growth condition and then separately regulate thrbc, which elucidates the condition-specific uses of � � ( ). third, understanding the atu structure is of great help to construct transcriptional and translation regulatory networks, such as for the construction of the σ-tug (σ-factor-transcription unit .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / gene) network ( ). the transcription regulatory network consists of nodes (atu and regulatory proteins) and links (interactions) ( ), and the comprehensive atu structure can provide a nearly complete set of nodes, which can improve the accuracy of regulatory prediction. although seqatu has obtained satisfactory predicted results, there are still several challenges regarding the computational prediction of atus. on the one hand, due to the influence of the ’ untranslated region (utr) and ’ untranslated region (utr) in the intergenic regions, the expression value of intergenic regions cannot be reproduced perfectly by the same calculation used for the expression value of genetic regions. without accurate reproduction, it is difficult to obtain the best expression combination of atus by the programming model based on the expression value of genetic and intergenic regions. on the other hand, due to the lack of strand-specific rna-seq data, it is difficult to distinguish the expression level of intergenic regions between two consecutive genes on the same strand derived from atus containing these two genes or antisense rnas (asrnas) ( , ). all of these challenges and the great significance of atu prediction inspire and encourage us to discover more information to determine the atu structure in bacteria. for example, we plan to add high confidence tsss and ttss information to our programming model in the future. additionally, since the microbiome is increasingly recognized as a critical component in human diseases, such as inflammatory bowel disease ( ), antibiotic-associated diarrhoea ( ), neurological disorders ( ), and cancer ( ) ( ), predicting new atus of uncultured species from metagenomic and metatranscriptomic data is of great significance in uncovering new regulatory pathway and metabolic products during the development of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / diseases ( ). however, due to a majority of species with unknown genomes or genome annotations within a microbial community, atu prediction on metagenomics and metatranscriptomics is still a challenging task, which encourage us to pay more attention on it. references . f. jacob, d. perrin, c. sanchez, j. monod, operon: a group of genes with the expression coordinated by an operator. c r hebd. seances. acad. sci , - ( ). . f. jacob, j. monod, genetic regulatory mechanisms in the synthesis of proteins. j. mol. biol. , - ( ). . z. liu, j. feng, b. yu, q. ma, b. liu, the functional determinants in the organization of bacterial genomes. brief. bioinform., doi.org/ . /bib/bbaa ( ). . w.-c. chou, q. ma, s. yang, s. cao, d. m. klingeman, s. d. brown, y. xu, analysis of strand- specific rna-seq data using machine learning reveals the structures of transcription units in clostridium thermocellum. nucleic acids res. , e -e ( ). . s.-y. niu, b. liu, q. ma, w.-c. chou, rseqtu—a machine-learning based r package for prediction of bacterial transcription units. frontiers in genetics , ( ). . b. yan, m. boitano, t. a. clark, l. ettwiller, smrt-cappable-seq reveals complex operon variants in bacteria. nat. commun. , ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . x. ju, d. li, s. liu, full-length rna profiling reveals pervasive bidirectional transcription terminators in bacteria. nature microbiology , - ( ). . k. totsuka, k. totsuka, the transcription unit architecture of the escherichia coli genome. nat. biotechnol. , - ( ). . a. h. bhat, d. pathak, a. rao, the alr-groel operon in mycobacterium tuberculosis: an interplay of multiple regulatory elements. scientific reports , ( ). . c. m. sharma, s. hoffmann, f. darfeuille, j. reignier, s. findeiß, a. sittka, s. chabas, k. reiche, j. hackermüller, r. reinhardt, the primary transcriptome of the major human pathogen helicobacter pylori. nature , - ( ). . j. m. durand, g. r. bjork, putrescine or a combination of methionine and arginine restores virulence gene expression in a trna modification-deficient mutant of shigella flexneri: a possible role in adaptation of virulence. mol. microbiol. , - ( ). . l. e. wroblewski, r. m. peek, k. t. wilson, helicobacter pylori and gastric cancer: factors that modulate disease risk. clin. microbiol. rev. , - ( ). . l. ettwiller, j. buswell, e. yigit, i. schildkraut, a novel enrichment strategy reveals unprecedented number of novel transcription start sites at single base resolution in a model prokaryote and the gut microbiome. bmc genomics , - ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . m. k. thomason, t. bischler, s. k. eisenbart, k. u. forstner, a. zhang, a. herbig, k. nieselt, c. m. sharma, g. storz, global transcriptional start site mapping using differential rna sequencing reveals novel antisense rnas in escherichia coli. j. bacteriol. , - ( ). . t. bischler, h. s. tan, k. nieselt, c. m. sharma, differential rna-seq (drna-seq) for annotation of transcriptional start sites and small rnas in helicobacter pylori. methods , - ( ). . d. dar, m. shamir, j. mellin, m. koutero, n. stern-ginossar, p. cossart, r. sorek, term-seq reveals abundant ribo-regulation of antibiotics resistance in bacteria. science , ( ). . j. clauwaert, g. menschaert, w. waegeman, an in-depth evaluation of annotated transcription start sites in e. coli using deep learning. biorxiv, doi: https://doi.org/ . / . . . , november , pre-print: not peer-reviewed. ( ). . s. goodwin, j. d. mcpherson, w. r. mccombie, coming of age: ten years of next-generation sequencing technologies. nat. rev. genet. , - ( ). . a. santos-zavaleta, h. salgado, s. gama-castro, m. sánchez-pérez, l. gómez-romero, d. ledezma-tejeida, j. s. garcía-sotelo, k. alquicira-hernández, l. j. muñiz-rascado, p. peña- loredo, regulondb v . : tackling challenges to unify classic and high throughput knowledge of gene regulation in e. coli k- . nucleic acids res. , d -d ( ). . n. sierro, y. makita, m. j. l. de hoon, k. nakai, dbtbs: a database of transcriptional regulation in bacillus subtilis containing upstream intergenic conservation information. nucleic acids res. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / , - ( ). . p. s. dehal, m. p. joachimiak, m. n. price, j. t. bates, j. k. baumohl, c. dylan, g. d. friedland, k. h. huang, k. keith, p. s. novichkov, microbesonline: an integrated portal for comparative and functional genomics. nucleic acids res. , d -d ( ). . h. cao, q. ma, x. chen, y. xu, door: a prokaryotic operon database for genome analyses and functional inference. brief. bioinform. , - ( ). . x. mao, q. ma, c. zhou, x. chen, h. zhang, j. yang, f. mao, w. lai, y. xu, door . : presenting operons and their functions through dynamic and integrated views. nucleic acids res. , d - d ( ). . k. chetal, s. c. janga, operomedb: a database of condition-specific transcription units in prokaryotic genomes. biomed research international , - ( ). . j. yang, x. chen, a. mcdermaid, q. ma, dminda . : integrated and systematic views of regulatory dna motif identification and analyses. bioinformatics , - ( ). . t. blanca, c. ricardo, c. e. martinez-guerrero, m. enrique, proopdb: prokaryotic operon database. nucleic acids res. , d -d ( ). . r. mcclure, d. balasubramanian, y. sun, m. bobrovskyy, p. sumby, c. a. genco, c. k. vanderpool, b. tjaden, computational analysis of bacterial rna-seq data. nucleic acids res. , .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / e -e ( ). . x. chen, w. chou, q. ma, y. xu, seqtu: a web server for identification of bacterial transcription units. scientific reports , ( ). . i. a. garanina, g. y. fisunov, v. m. govorun, bac-browser: the tool for visualization and analysis of prokaryotic genomes. frontiers in microbiology , ( ). . b. taboada, k. estrada, r. ciria, e. merino, operon-mapper: a web server for precise operon identification in bacterial and archaeal genomes. bioinformatics , - ( ). . h. li, r. durbin, fast and accurate short read alignment with burrows–wheeler transform. bioinformatics , - ( ). . z. wu, x. wang, x. zhang, using non-uniform read distribution models to improve isoform expression inference in rna-seq. bioinformatics , - ( ). . a. roberts, c. trapnell, j. donaghey, j. l. rinn, l. pachter, improving rna-seq expression estimates by correcting for fragment bias. genome biol. , - ( ). . r. bohnert, g. rï¿ ½tsch, rquant. web: a tool for rna-seq-based transcript quantitation. nucleic acids res. , w -w ( ). . w. li, t. jiang, transcriptome assembly and isoform expression level estimation from biased rna-seq reads. bioinformatics , - ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . b. xiong, y. yang, f. r. fineis, j.-p. wang, degnorm: normalization of generalized transcript degradation improves accuracy in rna-seq analysis. genome biol. , ( ). . j. chaitanya, degradation of mrna in escherichia coli. iubmb life , - ( ). . x. mao, q. ma, b. liu, x. chen, h. zhang, y. xu, revisiting operons: an analysis of the landscape of transcriptional units in e. coli. bmc bioinformatics , ( ). . b. marie, k. h. thilo, f. thierry, t. mikael, r. adriana, v. d. christian, metabolic pathways of pseudomonas aeruginosa involved in competition with respiratory bacterial pathogens. frontiers in microbiology , ( ). . c. nadiras, e. eveno, a. schwartz, n. figueroa-bossi, m. boudvillain, a multivariate prediction model for rho-dependent termination of transcription. nucleic acids res. , - ( ). . c. l. kingsford, k. ayanbule, s. l. salzberg, rapid, accurate, computational discovery of rho- independent transcription terminators illuminates their relationship to dna uptake. genome biol. , r ( ). . m. ashburner, s. lewis, on ontologies for biologists: the gene ontology—untangling the web. novartis found. symp. , - ; discussion - , - , - ( ). . h. wu, z. su, f. mao, v. olman, y. xu, prediction of functional modules based on comparative genome analysis and gene ontology application. nucleic acids res. , - ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . s. a. teukolsky, b. p. flannery, w. press, w. vetterling, numerical recipes in c: the art of scientific computing. cambridge university press, cambridge ( ). . l. wan, x. yan, t. chen, f. sun, modeling rna degradation for rna-seq with applications. biostatistics , - ( ). . c. yanofsky, attenuation in the control of expression of bacterial operons. nature , ( ). . b. k. cho, d. kim, e. m. knight, k. zengler, b. o. palsson, genome-scale reconstruction of the sigma factor network in escherichia coli : topology and functional states. bmc biol. , - ( ). . b.-k. cho, p. charusanti, m. j. herrgård, microbial regulatory and metabolic networks. curr. opin. biotechnol. , - ( ). . a. toledo-arana, o. dussurget, g. nikitas, n. sesto, h. guet-revillet, d. balestrino, e. loh, j. gripenland, t. tiensuu, k. vaitkevicius, the listeria transcriptional landscape from saprophytism to virulence. nature , - ( ). . b. yue, x. luo, z. yu, s. mani, z. wang, w. dou, inflammatory bowel disease: a potential result from the collusion between gut microbiota and mucosal immune system. microorganisms , ( ). . b. h. mullish, h. r. williams, clostridium difficile infection and antibiotic-associated diarrhoea. clin. med. , ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . m. maguire, g. maguire, gut dysbiosis, leaky gut, and intestinal epithelial proliferation in neurological disorders: towards the development of a new therapeutic using amino acids, prebiotics, probiotics, and postbiotics. rev. neurosci. , - ( ). . s. vivarelli, r. salemi, s. candido, l. falzone, m. santagati, s. stefani, f. torino, g. l. banna, g. tonini, m. libra, gut microbiota and cancer: from pathogenesis to therapy. cancers , ( ). . g. cammarota, g. ianiro, a. ahern, c. carbone, a. temko, m. j. claesson, a. gasbarrini, g. tortora, gut microbiome, big data and machine learning to promote precision medicine for cancer. nature reviews gastroenterology & hepatology , - ( ). . s. s. a. zaidi, x. zhang, computational operon prediction in whole-genomes and metagenomes. briefings in functional genomics , - ( ). acknowledgements funding: this work was supported by the national nature science foundation of china (nsfc) [ to b.l., to b.l.]; interdisciplinary science innovation group project of shandong university ( ); and the innovation method fund of china [ im to b.l.]. the authors would like to thank yang li for his assistance in language polishing. authors’ contributions: b.l., q.m. and w.c. conceived the basic idea and designed the overall analyses. q.w. carried out most of the computational analysis and data interpretation. all the authors wrote the manuscript. competing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / interests: the authors declare that they have no competing interests. data and materials availability: the raw data and source code of seqatu and a detailed tutorial can be found at https://github.com/osu-bmbl/seqatu. figures and tables table . results of predicted atus verified by experimental tsss or tf binding sites. overview of the experimental tss and tf binding site datasets (dataset and dataset ) and the proportion of ’-end genes and no ’-end genes of the predicted atus by seqatu for m enrich_seq and rienrich_seq, which were validated by experimental tsss or tf binding sites. dataset dataset source ju et al. ( ) regulondb tf binding sites technique send-seq collection tsss/tf binding sites , , m enrich_se q ’-end genes % % no ’-end genes % . % rienrich_seq ’-end genes % % no ’-end genes % . % table . results of predicted atus verified by experimental ttss. overview of the experimental tts datasets (dataset and dataset ) and the proportion of ’-end genes and no ’-end genes of the predicted atus by seqatu for m enrich_seq and rienrich_seq, which were validated by experimental ttss. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dataset dataset source ju et al. ( ) regulondb ttss technique send-seq collection ttss , , m enrich_se q ’-end genes % % no ’-end genes % . % rienrich_seq ’-end genes % % no ’-end genes % . % fig. . schematic overview of seqatu. the blue arrow and orange line denote gene and rna-seq read, respectively. the preprocessing stage requires rna-seq data in the fastq format, the reference .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genome sequence in the fasta format, and gene annotations in the gff format, generating linear constraints for the next convex quadratic programming (cqp) stage. there are two steps in the preprocessing stage: (i) calculating the expression value of the genetic region �� and intergenic region ��,� and (ii) modelling non-uniform read distribution along mrna transcripts; specifically, we acquired a bias rate function �(�) = �� using nonlinear regression and then constructed genetic or intergenic region bias rate vectors. the maximal atu cluster data determined by rseqtu and the linear constraints from preprocessing are both taken as inputs of cqp. cqp seeks the optimum expression combination of all of the to-be-identified atus to minimize the gap �� between the predicted atu expression profile and the genetic and intergenic region expression profile. finally, the output of cqp is the predicted atus. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . results of modelling non-uniform read distribution along mrna transcripts. the four bias rate functions (� = ��) by nonlinear regression had similar coefficients (� and �) across the four datasets m enrich_ , m enrich_ , rienrich_ and rienrich_ . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . overall evaluation results of seqatu. (a) precision and recall based on perfect matching and relaxed matching for m enrich_seq (left) and rienrich_seq (right) using evaluated atus from smrt- .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cappable-seq. (b) average precision based on perfect matching for m enrich_seq (left) and rienrich_seq (right) using evaluated atus from smrt-cappable-seq (black) and evaluated atus from smrt-cappable-seq and send-seq (red). the magnitude of the point denotes the number of maximal atu clusters with same size. (c) average number of atus across different sizes of smrt maximal atu clusters for m enrich_seq (left) and rienrich_seq (right). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . comparative analysis of the performance between seqatu and seqatu without the bias rate constrains for smrt maximal atu clusters. (a) precision, recall and f-score based on perfect matching for m enrich_seq and rienrich_seq. (b) precision, recall and f-score based on relaxed matching for m enrich_seq and rienrich_seq. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . comprehensive analysis of the predicted atus by seqatu. (a) number of atus across different sizes. the size of an atu is the number of its component genes. (b) distribution of the number of atus per gene. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . integrative genomics viewer (igv) representation of the mapping and atus. mapping and atus of m enrich_seq (orange) and rienrich_seq (blue) were shown for the maximal atu cluster containing the biob, biof, bioc, biod and uvrb genes. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . interpretation and results of the functional relatedness of different gene pairs based on go and kegg enrichment analyses. (a) illustration of two different gene pairs i and ii. (b) functional relatedness results based on go enrichment analysis for m enrich_seq (left) and rienrich_seq (right). (c) the proportion of two different gene pairs whose genes are contained in the same kegg pathway for m enrich_seq (left) and rienrich_seq (right). (d) the functional relatedness results based on kegg enrichment analysis for m enrich_seq (left) and rienrich_seq (right). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / integrated cross-study datasets of genetic dependencies in cancer integrated cross-study datasets of genetic dependencies in cancer clare pacini , , joshua m. dempster , isabella boyle , emanuel gonçalves , hanna najgebauer , , , emre karakoc , , dieudonne van der meer , andrew barthorpe , howard lightfoot , patricia jaaks , james m. mcfarland , mathew j. garnett , , aviad tsherniak , francesco iorio , , ,* wellcome sanger institute, wellcome genome campus, hinxton, cambridge, cb sa, uk open targets, wellcome genome campus, hinxton, cambridge, cb sa, uk broad institute of mit and harvard, main street, cambridge, ma , usa european molecular biology laboratory, european bioinformatics institute, wellcome genome campus, cambridge cb sa, uk human technopole, via cristina belgioioso , milano - italy * corresponding author: francesco.iorio@sanger.ac.uk abstract crispr-cas viability screens are increasingly performed at a genome-wide scale across large panels of cell lines to identify new therapeutic targets for precision cancer therapy. integrating the datasets resulting from these studies is necessary to adequately represent the heterogeneity of human cancers and to assemble a comprehensive map of cancer genetic vulnerabilities. here, we integrated the two largest public independent crispr-cas screens performed to date (at the broad and sanger institutes) by assessing, comparing, and selecting methods for correcting biases due to heterogeneous single guide rna efficiency, gene-independent responses to crispr-cas targeting originated from copy number alterations, and experimental batch effects. our integrated datasets recapitulate findings from the individual datasets, provide greater statistical power to cancer- and subtype-specific analyses, unveil additional biomarkers of gene dependency, and improve the detection of common essential genes. we provide the largest integrated resources of crispr-cas screens to date and the basis for harmonizing existing and future functional genetics datasets. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:francesco.iorio@sanger.ac.uk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cancer is a complex disease that can arise from multiple different genetic alterations. the alternative mechanisms by which cancer can evolve result in considerable heterogeneity between patients, with the vast majority of them not benefiting from approved targeted therapies . in order to identify and prioritize new potential therapeutic targets for precision cancer therapy, analyses of cancer vulnerabilities are increasingly performed at a genome-wide scale and across large panels of in vitro cancer models – . this has been facilitated by recent advances in genome editing technologies allowing unprecedented precision and scale via crispr-cas screens. of particular note are two large pan-cancer crispr-cas screens that have been independently performed by the broad and sanger institutes , . the two institutes have also joined forces with the aim of assembling a joint comprehensive map of all the intracellular genetic dependencies and vulnerabilities of cancer: the cancer dependency map (depmap) , . the two generated datasets collectively contain data from over , screens of more than cell lines. however, it has been estimated that the analysis of thousands of cancer models will be required to detect cancer dependencies across all cancer types . consequently, the integration of these two datasets will be key for the depmap and other projects aiming at systematically probing cancer dependencies. these integrated datasets will provide a more comprehensive representation of heterogeneous cancer types and form the basis for the development of effective new therapies with associated biomarkers for patient stratification . further, designing robust standards and computational protocols for the integration of these types of datasets will mean that future releases of data from crispr-cas screens can be integrated and analyzed together, paving the way to even larger cancer dependency resources. we have previously shown that the pan-cancer crispr-cas datasets independently generated at the broad and sanger institutes are consistent on the domain of commonly screened cell lines . the reproducibility of these crispr screens holds despite extensive differences in the experimental pipelines underlying the two datasets, including distinct crispr-cas sgrna libraries. here we investigate the integrability of the full broad/sanger gene dependency datasets, yielding the most comprehensive cancer dependency resource to date, encompassing dependency profiles of , genes across different cell lines that span tissues and different cancer types. we compare different state-of-the-art data processing methods to account for heterogeneous single-guide rna (sgrna) on-target efficiency, and to correct for gene independent responses to .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/votga https://paperpile.com/c/bnwyax/e ooj+ jkgi+ayqe +as lx+ymsj +t woi+odthp+dctjj+bifqg+g buj https://paperpile.com/c/bnwyax/f tt +e ooj https://paperpile.com/c/bnwyax/kl bc+htoyk https://paperpile.com/c/bnwyax/ jkgi https://paperpile.com/c/bnwyax/wjxm https://paperpile.com/c/bnwyax/ uh g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / crispr-cas targeting , , , evaluating their performance on common use cases for crispr-cas screens (figure a, b and c). figure : schematic of the integration strategy. a. broad and sanger gene dependency datasets (raw count data of single-guide rnas) are downloaded from respective web-portals. b. the datasets from each institute are pre-processed with three different methods, accounting for gene-independent responses to crispr-cas targeting (arising from copy number amplifications) and heterogeneous sgrna efficiency, providing gene-level corrected depletion fold changes. then, four different batch-correction pipelines are applied to the gene level fold changes across the two institute datasets for each of the pre-processing methods. c. twelve different integrated datasets resulting from applying three different pre-processing methods (as indicated by the border colors) and four different batch-correction pipelines (as indicated by the fill colors) are benchmarked. d. advantages provided by the final integrated datasets and conservation of analytical outcomes from the individual ones are investigated. we show that our integration strategy accounts and corrects for technical biases whilst preserving gene dependency heterogeneity and recapitulates established associations between molecular features and gene dependencies. we highlight the benefits of the integrated dataset over the two individual ones in terms of improved coverage of the genomic heterogeneity across different cancer types, identification of new biomarker/dependency associations, and increased reliability of human .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/f tt +q esm+htdux https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / core-fitness/common-essential genes (figure d). finally, we estimate the minimal size (in terms of the number of screened cell lines) required in order to effectively correct batch effects when integrating a new dataset. collectively, this study presents a robustly benchmarked framework to integrate independently generated crispr-cas datasets that provide the most comprehensive resource for the exploration of cancer dependencies and the identification of new oncology therapeutic targets. results overview of the integrated crispr-cas screens the sanger’s project score crispr-cas dataset (part of the sanger depmap) and the broad’s q depmap dataset , contain data for and cell lines, respectively. overall, these represent screens for unique cell lines (figure a, supplementary table ). together these cell lines spanned different tissues (figure b) and for of these the number of cell lines covered increased when considering both datasets together. similarly, the integrated dataset provided richer coverage of specific cancer types and clinically relevant subtypes (figure c). these preliminary observations highlight the first benefit of combining these resources to increase statistical power for tissue-specific as well as pooled pan-cancer analyses. between the two datasets, there was an overlap of cell lines screened by both institutes, encompassing different tissue types (median = , min for soft tissue, biliary tract and kidney, max for lung, figure a and b). the set of overlapping cell lines enabled the estimation of batch effects due to differences in the experimental protocols underlying the two datasets , without biasing the correction toward specific cell line lineages. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/ cgu https://paperpile.com/c/bnwyax/ qc +n jvg https://paperpile.com/c/bnwyax/ uh g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . overview of crispr-cas screened cancer cell lines. a. number of cell lines screened by the broad and the sanger institutes and their overlap. b. overview of the number of cell lines screened for each tissue type across the two datasets. c. number of screened lung cancer and breast cancer cell lines split according to cancer types and pam subtypes, respectively, across the two datasets. data pre-processing known biases in crispr screens arise due to nonspecific cutting toxicity that increases with copy number amplifications (cnas) , and heterogeneous levels of on-target efficiency across sgrnas targeting the same gene . multiple methods exist to correct for these biases. here, we evaluate three: crisprcleanr, an unsupervised nonparametric cna effect correction method for individual genome-wide screens ; a method resulting from using crisprcleanr with jacks, a bayesian method accounting for differences in guide on target efficacy (ccr-jacks) through joint analysis of multiple screens; and ceres, a method that simultaneously corrects for cna effects and accounts for differences in guide efficacy , also analyzing screens jointly. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/iqbee+ o i https://paperpile.com/c/bnwyax/eqqvf https://paperpile.com/c/bnwyax/q esm https://paperpile.com/c/bnwyax/htdux https://paperpile.com/c/bnwyax/f tt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / batch effect correction technical differences in screening protocols, reagents and experimental settings can cause batch effects between datasets. these batch effects can arise from factors that vary within institute screens (for example, differences in control batches and cas activity levels) as well as between institutes (such as differences in assay lengths and employed sgrna libraries). when focusing on the set of cell lines screened at both institutes, a principal component analysis (pca) of the cell line dependency profiles across genes (dpgs) highlighted a clear batch effect determined by the origin of the screen, irrespective of the pre-processing method, consistent with previous results (figure a) . we quantile-normalized each cell line dpg and adjusted for differences in screen quality in the individual broad/sanger data sets. the combined broad/sanger dataset was then batch corrected using combat (methods). following combat correction, the combined datasets on the overlapping cell lines showed reduced yet persistent residual batch effects clearly visible along the two first principal components (supplementary figure ). analysis of the first two principal components (using msigdb gene signatures and all cell lines, methods), showed enrichment for metabolic processes (phosphorus metabolic process q-value = . e- , protein metabolic process q-value = . e- , hypergeometric test) in the first principal component. the enrichment of metabolic processes is consistent with differences identified across these datasets due to different media conditions employed in the underlying experimental pipelines , . the second principal component contained significant enrichments for protein complex organisation and assembly (q-value = . e- and . e- respectively, hypergeometric test) (supplementary table ), which have no obvious associations with technical biases found in crispr-cas screens. based on these results, we considered four different batch correction pipelines and evaluated their use in our integrative strategy. in the first pipeline, we processed the combined broad/sanger dpg dataset using combat alone (combat). in the second, we applied a second round of quantile normalization following combat correction (combat+qn) to account for different phenotype intensities across experiments, resulting in different ranges of gene dependency effects. in the third and fourth pipelines we also removed the first one or two principal components respectively (combat+qn+pc ) and (combat+qn+pc - ). the final datasets contained data from unique screens of cell lines using each of the three pre-processing methods and four different batch correction pipelines as outlined in the previous section. to assess the performance of different batch correction pipelines we estimated, using the overlapping cell lines, the extent to which each cell line dpg from one study matched that of its counterpart (derived from the same cell line) from the other study .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/ uh g https://paperpile.com/c/bnwyax/ax xh https://paperpile.com/c/bnwyax/wm a https://paperpile.com/c/bnwyax/ezh +rxwn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / following batch correction. to quantify the agreement, we calculated for each dpg its similarity to all other screen dpgs using a weighted pearson’s (wpearson) correlation (methods). we then calculated the proximity of a cell line to its counterpart compared to all other cell lines using the wpearson as a metric (recall of cell line identity) (figure b ). the best performances were obtained when removing either the first or the first two principal components following combat and quantile normalization, i.e. combat+qn+pc or combat+qn+pc - . across pre-processing methods, ceres performed best with ( %) of the cell lines being closest to their counterpart from the other study (k = ) followed by crisprcleanr with cell lines ( %) and ccr-jacks with ( %). the recall of cell line identity was high for each integration pipeline with normalized area under the curve (nauc) values of . for ccr-jacks and . for crisprcleanr and ceres when considering the best performing combat+qn+pc - batch correction method. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : batch effect assessment and correction. a. principal component plots of the dependency profile across genes (dpgs) for cell lines screened in both broad and sanger studies and pre-processing methods. screens are colored by the institute of origin. b. percentages of cell line dpgs that have the corresponding (same cell line) dpg screened at the other institute among their k most correlated dpgs (the k-neighborhood). results are shown across different pre-processing methods (in different plots) and different batch correction pipelines (as indicated by the different colors). correlations between dpgs are computed using a weighted pearson correlation metric. genes with higher selectivity have a larger weight in the correlation calculation. as a measure of selectivity we used the average (across the two individual datasets) skewness of a gene’s dependency profile across cell lines. the proportion of cell lines closest to their counterpart from the other study (k = ) is shown and the normalised areas under the curves (nauc) are shown in brackets. the x-axis values are restricted to between - to highlight the range over which performance differences are visible between datasets. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / performance of the integration pipelines we evaluated the performance of each of the integrated datasets, containing cell lines, under four use-cases: the identification of i) essential and non-essential genes ii) lineage subtypes iii) biomarkers of selective dependencies and iv) functional relationships. identification of essential and non-essential genes a cell line dpg with a large separation of dependency scores (ds) of common essential and non-essential genes should yield lower misclassification rates when identifying dependencies specific to that cell line. for each cell line we measured the separation of dependency scores (ds) between known common essential and non-essential genes across all integrated datasets. as a measure of separation we used the null-normalized mean difference (nnmd) , defined as the difference between the mean ds of the common essential genes and non-essential genes divided by the standard deviation of the dss of the non-essential genes. by analysing multiple screens jointly, ceres and jacks borrow essentiality signal information across screens. as a consequence, these methods better identify consistent signals across cell line dpgs (i.e. for common essential and non-essential genes), especially for dpgs derived from lower quality experiments, or reporting weaker depletion phenotypes , . consistently, ceres (median nnmd range [- . , - . ]) showed better nnmd values than crisprcleanr (median nnmd range [- . , - . ], wilcox test (wt) p-value < . e- ) and ccr-jacks (median nnmd range [- . , - . ], wt p-value < . e- )), and similarly ccr-jacks had better nnmd values than crisprcleanr (largest wt p -value < . ) (figure a). comparing the batch correction methods, combat+qn+pc - had marginally better performance across all pre-processing methods. next, we evaluated the gene dependency false-positive rates across all integrated datasets. for each cell line dpg, we defined a set of putative negative controls composed of genes not expressed at the basal level in that cell line (methods). false positives were calculated as the sum of negative controls identified as significant dependencies (in the top % most depleted genes) normalized by their total number across the dpg. there was little difference in false-positive rates across the four different batch correction pipelines, with a slight improvement when two principal components were removed (figure b). ceres outperformed ccr-jacks significantly for all batch correction methods (largest 𝜒 .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/g buj https://paperpile.com/c/bnwyax/fojka https://paperpile.com/c/bnwyax/ o i+htdux https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / contingency table p-value . x - , n= . x ) and ccr-jacks outperformed crisprcleanr (p-value below machine precision). comparing the correction methods, the differences between combat and combat+qn and between combat+qn+pc and combat+qn+pc - were generally not significant across preprocessing methods, while the difference between either combat or combat+qn and either combat+qn+pc or combat+qn+pc - were generally significant (largest p-value . x - ). as a final test of control separation, we used the unexpressed genes as an empirical null distribution for each dpg to estimate p- values for all ds and thus false discovery rates (fdrs) within each dpg. we calculated the recall of a reference set of common essential genes at % fdr (figure c ). again ceres outperformed ccr-jacks which outperformed crisprcleanr, and increasing the number of steps in the batch correction pipeline monotonically improved essential recall for all preprocessing methods. all differences between preprocessing methods and batch correction methods were significant, with the largest observed t-test (related) p-value . x - (n = ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/g buj https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : use case recall of essential genes and lineage identification . a. null-normalized mean difference (nnmd, a measure of separation between dependency scores of prior-known essential and non-essentials genes): defined as the difference in means between dependency scores of essential and non-essential genes divided by standard deviation of dependency scores of the non-essential genes. lower values of nnmd indicate better separation of essential genes and non-essential genes. b. false positive rates across all pre-processing methods and batch-correction pipelines. in the gene dependency profile of a given cell line, a significant dependency gene was called a false positive if that gene was not expressed in that cell line. c. recall of known essential genes across all pre-processing methods and batch-correction-pipelines at % .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fdr. d. agreement between cell line clusters based on dpgs correlation and tissue lineage labels of corresponding cell lines, across pre-processing methods and batch-correction pipelines. e. agreement of lung crispr-cas fitness profiles according to the lung cancer subtypes. for each query lung cancer cell line in turn we computed correlation scores to all other lung cancer cell lines (responses). we then ranked the response cell lines according to these correlations. for each query cell line, the rank position k of the most correlated response cell line from the same cancer subtype (matching response) was identified. a rank of k = indicates that the query cell line was closest to another cell line from the same cancer subtype. the curves show the ratio of query cell lines with a matching response within a given rank position. the proportion of query cell lines with a matching response in k = are also shown as percentages for each dataset. the normalised area under the curve (nauc) for each dataset is shown in brackets. the figure shows the x-axis zoomed in to between and . identification of lineage subtypes many dependencies are context specific, reducing cellular fitness in a subset of lineages, that can be used to elucidate gene function and identify cancer type specific vulnerabilities. to evaluate the ability of the integrated datasets in recapitulating tissue lineages and clinical subtypes we first estimated the extent of conserved similarity between screens of cell lines derived from the same tissue lineage. we evaluated the tendency of screens of cell lines from the same lineage to yield similar results by comparing unsupervised clusterings of the batch-corrected cell line dpgs to the lineage labels of the cell lines. to this aim, we performed one hundred k-means clusterings of each of the datasets, with k equal to the number of tissue lineages screened in at least one study. we then calculated the adjusted mutual information (ami, methods) between each dpg clustering and the partition of the cell lines induced by their lineage labels. we observed higher than chance ami between the obtained k clusters and the tissue lineages of the cell line dpgs, regardless of the starting batch corrected dataset (largest single-sample t-test p-value of . x - , n = , figure d ). under each pre-processing method the removal of one or two principal components resulted in an increased ami between cell line dpgs clusters and tissue lineages. we next measured the ability of each of the integrated datasets to separate cell lines according to lineage subtypes. the integrated datasets contain over lung cell lines. these cell lines can further be stratified into subtypes such as small cell lung carcinoma and mesothelioma, whilst clinical subtypes such as pam classifications are available for the breast cancer cell lines (figure c). to quantify the clustering of cell lines by subtype we calculated the correlation between all cell lines dpgs, and for a given query cell line the rank of the cell line with most correlated dpg to the query from the same subtype (k-rank). for the lung cancer cell lines, the percentage of cell lines whose closest neighbour was from the same subtype (k = ) was greatest for ceres ( - % across batch correction methods) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / followed by crisprcleanr ( - %) and ccr-jacks ( - %), with slight improvement with the removal of or principal components (figure e). the normalised area under the curve (nauc) values showed little variation across batch correction methods and were broadly similar between the pre-processing methods ceres (lung = . , breast = . - . ), ccr-jacks (lung = . - . , breast = . - . ), crisprcleanr (lung= . - . , breast= . - . )(supplementary figure ). identification of biomarkers interesting potential novel therapeutic targets are genes that show a pattern of selective dependency, i.e. exerting a strong reduction of viability upon crispr-cas targeting in a subset of cell lines. furthermore, these selective dependencies are often associated with molecular features that may explain their dependency profiles (biomarkers). we investigated each of the integrated datasets’ ability to reveal tissue-specific biomarkers of dependencies. as potential biomarkers we used a set of clinically relevant cancer functional events (cfes ), across different tissue types. the cfes encompass mutations in cancer driver genes, amplifications/deletions of chromosomal segments recurrently altered in cancer, hypermethylated gene promoters and microsatellite instability status. for each cfe and tissue type, we performed a student’s t-test for each selective gene dependency (sgd, methods) contrasting two groups of cell lines based on the status of cfe under consideration (present/absent), for a total number of , , biomarker/dependency pairs tested. the total number of significant biomarker/dependency associations showed little variation across batch-correction methods at % fdr. however, a significantly larger number of biomarker/dependency associations were identified when using crisprcleanr compared to ccr-jacks (largest p-value . e- , proportion test) or ceres (largest p-value . e- , proportion test) whilst little significant difference was found between ccr-jacks and ceres (smallest p-value . , proportion test) (figure a, supplementary table ). similar results were seen when the cfes were split according to whether the biomarker was a mutation, recurrent copy number alteration or hypermethylated region (supplementary figure ) . we next examined the ability of each dataset to recover known selective dependencies in individual cell lines. we downloaded a set of oncogenic gene alterations .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/hbt j https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / from oncokb , . after filtering for genes that tend to be common essentials (mean dependency score lower than - . in the crisprcleanr-combat dataset, where - is the median of scores of known common essentials), we considered the oncogenes as positive controls in cell lines where they had indicated oncogenic or likely-oncogenic gain of function alterations, and negative controls in all others. for each oncogene, we measured the nnmd between positive and negative cell lines (figure b). we found little difference in median performance by either preprocessing method or batch correction method. we then collected the dependency scores of all oncogenes in cell lines with a corresponding oncogenic alteration and measured receiver operator characteristic (roc) auc between them and the dependency scores of the same genes in cell lines without oncogenic alterations (figure c). by this measure, crisprcleanr outperformed ceres by . % and ccr-jacks by . %, with minimal variations across batch correction method. recovery of functional relationships we tested the ability of each dataset to identify expected dependency relations between paralogs, gene pairs coding for interacting proteins, or members of the same complex using gene pairs annotation from publicly available databases – (methods). for each pair of genes known to have a functional relationship, we selected a random pair of genes with similar mean dependency scores across cell lines to serve as null examples. we calculated the false discovery rate for the known pairs using the absolute pearson correlation of their dependency profiles versus those of the null examples. recovery of known relationships was unsurprisingly low, since many genes with known functional relationships do not exhibit selective viability phenotypes. combat+qn+pc or pc - recovered the greatest number of expected gene dependency relations at % fdr (figure d). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/assl+d gc https://paperpile.com/c/bnwyax/dwirj+z a+kxhhl https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : use case biomarkers and functional relationships . a. for each tissue pairs of cancer functional events (cfes) and dependencies were tested for significant associations between the gene dependency and the absence/presence of a biomarker (cfe). the bar chart shows the total number of significant associations at % fdr across tissue types for each of the integrated datasets. b. the per-oncogene nnmd between cell lines with and without an indicated oncogenic gain-of-function indication (more negative is better). c. for all identified oncogenes collectively, the receiver-operator characteristic (roc) auc between oncogene scores in cell lines where they have an indicated gain-of-function mutation and cell lines where they do not. d. for each dataset, the number of known gene-gene relationships recovered at % fdr. final selection of pre-processing methods and batch-correction pipelines comparing the performance of batch correction methods across the use-cases we found that combat+qn outperformed combat alone and removing one or two principal components had similar or noticeable increases in performance compared to combat+qn. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the principal component analysis indicated that combat+qn+pc corrected for linear and non-linear effects of technical confounders including assay length, guide library and media conditions. removing the first two principal components offered little improvement over removing the first principal component alone and we found no attributable technical bias in the gene sets enriched in the second principal component. overall, we selected combat+qn+pc as the batch correction pipeline as it had good performance over all metrics and a reduced impact on the data with respect to combat+qc+pc - , whilst still correcting for multiple technical biases. comparing the pre-processing methods we found that ceres outperformed the other methods while identifying essential genes and lineage subtypes, that crisprcleanr showed higher performance in the biomarker association use case, and these two methods performed comparably and better than ccr-jacks in identifying known gene-gene relationships. as a conclusion, we selected both ceres and crisprcleanr as processing methods and considered the two corresponding integrated datasets as the final results of our pipeline. advantages of the integrated datasets over the individual ones in-line with the results from all the use-cases, we estimated the benefits of the integrated datasets with respect to the individual ones, in terms of increased capacity to unveil reliable sets of common essential genes (using ceres), as well as increased diversity of genetic dependencies and biomarker associations (using crisprcleanr). to evaluate the increased coverage of molecular diversity and genetic dependencies in the integrated dataset we first estimated the increase in the number of detected gene dependencies with respect to the two individual datasets. to this aim, using the crisprcleanr processed dataset we quantified the number of genes significantly depleted in n cell lines (at % fdr, methods) for a fixed number of cell lines n (with n = , , or n ≥ ) of the integrated dataset, as well as in the individual broad and sanger datasets. the integrated dataset identified more dependencies, indicating greater coverage of molecular features and dependencies than in the individual datasets (supplementary figure a). we then evaluated the ability of the ceres processed integrated dataset to predict common essential genes and its performance when compared to the individual datasets and two existing sets of common essential genes from recent publications: behan and hart . we predicted common essential genes using two methods: the th-percentile method and .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/karn https://paperpile.com/c/bnwyax/ uh g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the adaptive daisy model (adam) . the majority of genes called common essentials according to one of adam or th percentile methods was also identified by the other ( , out of , , supplementary figure b ). we assigned to each of the , common essential genes a tier based on the amount of supporting evidence of their common essentiality. tier , the highest confidence set comprised the , genes found by both methods. tier had genes found by only one method (supplementary table ). for each predicted set of common essential genes, we calculated recall rates of known essential genes sets obtained from kegg and reactome pathways. these pathways included ribosomal protein genes, genes involved in dna replication and components of the spliceosome (methods). the integrated set of common essentials (tier and ) showed greater recall of known essential genes compared to behan and hart, and increased recall over the individual datasets for out of the gene sets (figure a). we next generated a set of genes that were never expressed across the panel of cell lines, to serve as high confidence negative controls (methods). we calculated the proportion of negative controls in each set of common essentials genes. the best performance was for the hart gene set ( %) followed by the integrated data set ( . %) (figure b ). as the positive and negative controls did not cover all genes we further investigated the genes predicted to be common essentials. the integrated dataset predicted the largest number of common essentials, with genes found in the integrated data set alone. the genes were enriched for cell cycle genes (fdr . e- ) and mitochondrial gene expression (fdr . e- ), indicative of essential cellular processes. similar results were observed for the , genes in the integrated set of common essentials but neither of the existing datasets (behan and hart) (supplementary table ) we next asked whether the crisprcleanr processed integrated dataset was able to unveil additional significant gene dependencies and cfe/gene-dependency statistical interactions compared to either one of the broad or sanger (individual) datasets. performing systematic biomarker analysis using cfes on cell lines from individual tissue lineages unveiled additional significant associations in the integrated dataset (when considering only cfe/gene-dependency pairs testable in the individual datasets at % fdr) with respect to those using the sanger dataset alone, and with respect to the broad dataset (supplementary table ). examples included decreased dependency on mdm in tp mutant lung cell lines for the sanger dataset, and increased dependency on stag in stag mutated central nervous system cancer cell lines for the broad dataset (figure c). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/thhr https://paperpile.com/c/bnwyax/shsw https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / furthermore, tissue-specific significant associations identified in the integrated dataset were tested but not found significant in either the broad or the sanger dataset (figure d). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : advantages of an integrated dataset . a. recall of essential genes sets for the integrated dataset, across different tiers, compared to two previously published gene sets (behan and hart). b. proportion of genes in the common .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sample size requirements for efficient data integration to further increase the coverage of a cancer dependency map, new crispr-cas screens should be integrated into the existing datasets as they are generated. to aid in this integration we estimated the minimum number of overlapping cell lines that should be screened to efficiently calculate and correct batch effects. we performed a downsampling analysis on the cell lines screened at both sanger and broad, ranging from % to %, and used the obtained subset of cell lines to estimate and correct batch-effects using combat. following this, for each cell line dpg generated at either institute, we computed the pearson correlation following batch correction using all overlapping cell lines (figure e). we found a high degree of correlation between datasets at all levels of downsampling, with the minimum of samples still reducing batch effects when compared to no batch correction (n = ) (supplementary figure c). we next evaluated the batch correction using the average silhouette width (asw) of the clustering induced by the institute of origin of the cell lines as a measure of the extent to which cell lines from the same institute clustered together. as expected, as the number of samples used to estimate and correct the batch effect decreases, the dpgs increasingly cluster by the batch of origin (figure f). the asw and pearson correlation metrics both showed clear convergence with increasing sample size and at the same rate. given the convergence of these metrics, the results showed that the overlapping cell lines used were sufficient to maximise the batch correction using combat. further the downsampling analysis showed convergence was reached at cell lines and that between and cell lines would be sufficient to provide a batch corrected dataset that is highly correlated (over . ) with that obtained when estimating and correcting batch effects with using more than cell lines. the overlapping cell lines contained cell lines from different lineages. to investigate the impact of lineage composition of the cell lines on the batch correction we also essential gene sets that are constitutively not expressed across the panel of cell lines and therefore likely to be false positive results. c. examples of significant associations between genes and features, found in the integrated dataset compared to the individual dataset. d. examples of significant associations found in the integrated dataset that were not significant in either of the individual datasets. e. the boxplots contain random samples of between % and % of the overlapping cell lines (number of cell lines in each sample indicated on the x-axis). for each sample the pearson correlation of the dpgs following combat correction compared to the integrated dataset was calculated for each pre-processing method. f. the average silhouette width (asw) for each downsampled dataset was calculated using the institute of origin as the cluster label. an asw of close to zero indicating a near random performance of the clustering, meaning the samples do not cluster by the origin of the screen and batch effects have been removed. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / used a single lineage to estimate the batch effects. in the overlapping cell lines the lung lineage had the most cell lines ( in total). we subsampled the lung cell lines to include , or cell lines (supplementary figure de ) and found little difference in performance between using a single and a mixture of lineages, indicating that this is not a major factor for estimating batch effects. discussion the integration of data from different high-throughput functional genomics screens is becoming increasingly important in oncology research to adequately represent the diversity of human cancers. integrating crispr-cas screens performed independently and/or using distinct experimental protocols, requires correction and benchmarking strategies to account for technical biases, batch effects and differences in data-processing methods. here, we proposed a strategy for the integration of crispr-cas screens and evaluated methods accounting for biases within and between two dependency datasets generated at the broad and sanger institutes. our results show that established batch correction methods can be used to adjust for linear and non-linear study-specific biases. our analyses and assessment yielded two final integrated datasets of cancer dependencies across cell lines. in contrast to existing databases of crispr-cas screens , , our integrated datasets are corrected for batch effects allowing for their joint analysis. following integration, dependency profiles of cell lines from the same tissue lineage and cancer specific subtypes show good concordance. our integrated datasets cover a greater number of genetic dependencies, and the increased diversity of screened models allows additional associations between biomarkers and dependencies to be identified. the integrated datasets were the output of two orthogonal pre-processing methods, crisprcleanr and ceres. the use-case analysis showed that ceres (which borrows information across screens) yields a final dataset better able to identify prior known essential and non-essential genes and clustering of cell lines by lineage. in contrast, crisprcleanr (a per sample method) was better able to detect associations between selective dependencies and potential biomarkers, and had better recall of known oncogenic .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/xh a +czfn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / addictions. therefore, results from both processing methods provide the best overall data-driven functional cancer dependency map. the data integration strategies and sample size guidelines outlined here can be used with future and additional crispr-cas datasets to increase coverage of cancer dependencies. this will be important for oncological functional genomics, for the identification of novel cancer therapeutic targets, and for the definition of a global cancer dependency map. further, as library design improves , , we would expect the coverage and accuracy of the integrated datasets to also improve. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/eqqvf+ztmd+dkgl https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / data availability the final integrated datasets are available for download at https://figshare.com/projects/integrated_crispr/ . the data will also be made accessible through the depmap (https://depmap.org) and score (https://score.depmap.sanger.ac.uk) web portals in early . code availability scripts and software packages implementing the integration pipeline described in this manuscript and needed to reproduce results and figures are available on github at https://github.com/depmap-analytics/integratedcrispr with data sources available on figshare: https://figshare.com/projects/integrated_crispr/ . acknowledgments this work was partially funded by open targets [project otar ] and by the wellcome trust [grant ]. we thank leo parts for a number of insightful discussions. author contributions cp conceived the study, designed, implemented and performed analyses, assembled figures, curated data, wrote the manuscript. jmd conceived the study, designed, implemented and performed analyses, assembled figures, and contributed to manuscript writing. ib contributed to pipeline implementation. eg performed analyses, assembled figures, revised the manuscript. hn assembled figures, revised the manuscript. ek, dvdm, ab, hl, pj contributed to data curation. jmm, mjg, and at revised the manuscript and contributed to study supervision. fi conceived the study, designed analyses, contributed to figure production, wrote the manuscript, acquired funds and supervised the study. competing interests mjg, and fi receive funding from open targets, a public-private initiative involving academia and industry. mjg receives funding from astrazeneca and performs consultancy for sanofi. fi performs consultancy for the joint cruk - astrazeneca functional genomics centre. at is a consultant for tango therapeutics and cedilla therapeutics. jmd, jm and at receive funding from the cancer dependency map consortium, but no consortium member was involved in or influenced this study. all the other authors declare no competing interests. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://figshare.com/projects/integrated_crispr/ https://github.com/depmap-analytics/integratedcrispr https://figshare.com/projects/integrated_crispr/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods preprocessing data sanger data processed with crisprcleanr were obtained from the score website (https://score.depmap.sanger.ac.uk/). the crisprcleanr corrected counts were used as input into jacks, for the ccr-jacks processing method. raw counts and the copy number profiles for the sanger dataset downloaded were processed with ceres . the broad data processed with ceres (unscaled gene effect) version q scores were downloaded from the broad depmap portal . the raw counts for broad data q were processed with crisprcleanr and the crisprcleanr corrected counts processed with jacks. gene names were matched across the broad and sanger datasets by updating both to the current version of hugo gene symbols from the hgnc website. missing entries were mean imputed for the principal component removal and then re-assigned as na in the final matrix. cell lines processed by both ceres and crisprcleanr were used for analysis. tissue annotations for each cell line were obtained from the cell model passports (https://cellmodelpassports.sanger.ac.uk/) . batch correction pipelines the dependency profiles across genes (dpgs) for overlapping cell lines from each institute were first quantile normalized using the preprocesscore package in r . screen quality adjustments were made by fitting a spline to the average gene fold change across cell line dpgs. each dpg was then adjusted to remove the difference between the fitted spline and the diagonal. the overlapping cell lines were then batch corrected using three different methods. a standard least squares model was fitted in r. the combat correction was performed using the sva package in r . batch correction pipelines’ assessment and weighted pearson correlation metric cell lines’ rank neighborhoods were based on a weighted pearson correlation metric. the weights were defined as the absolute mean (over the broad and sanger datasets) of a gene dependency signal skewness across the overlapping cell lines for the broad and sanger datasets. using skewness upweights genes with a variable and sufficiently selective fitness profile whilst downweighting those that show weak/no-signal or unselective dependencies. then for each query dpg we ranked all the others based on how similar they were to the fixed one in decreasing order, according to the wpearson scores. for each position k in the resulting rank we then defined a k-neighborhood of the query dpg composed of all the other dpgs whose rank position was ≤ k. finally we determined the number of cell line dpgs that .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://score.depmap.sanger.ac.uk/ https://paperpile.com/c/bnwyax/ qc https://paperpile.com/c/bnwyax/ qc https://cellmodelpassports.sanger.ac.uk/ https://paperpile.com/c/bnwyax/wfsum https://paperpile.com/c/bnwyax/ zwnw https://paperpile.com/c/bnwyax/zcfxr https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / had the dpg derived from screening the same cell line in the other dataset (a matching dpg) in its k-neighborhood. the final rank for each cell line was defined based on the minimum rank obtained for each cell line when considering the dpg for that cell line from the broad data compared to all dpgs, and similarly the dpg for the cell line in the sanger dataset compared to all dpgs. analysis of principal components the first two principal components (pcs) were extracted from combat corrected crisprcleanr data using the prcomp function in r. the top genes (according to the absolute value of their pc loadings) were selected for enrichment analysis. the gene lists were used as input into the gsea website (https://www.gsea-msigdb.org/) and were tested against the gene ontology biological processes, hallmark and canonical pathway databases. the top significantly enriched (q-value < . ) gene sets were downloaded from the website. batch correction extended to cell lines the combat estimates, pooled mean, variance and empirical bayes adjustments (mean and standard deviation) for each batch based on the analysis of cell lines common to both initial dataset were computed. the combat correction using these estimates was then applied to all screens, i.e. the union of the two initial datasets. particularly, each individual cell line dpg was shifted and scaled gene-wise using the batch correction vectors outputted by combat. further adjustments were then applied to all screens including quantile normalization, and the removal of either the st principal component of the joint datasets or the first two. finally, dpgs for overlapping cell lines passing a similarity threshold (detailed below) were averaged. across the three pre-processing methods the number of cell lines that matched their counterparts exactly after combat correction ranged from % - % (figure b), suggesting that under all pre-processing methods there remained cell lines whose dpgs diverged between studies. for each of the cell lines that matched their counterpart as the first neighbor we considered their distances ( -wpearson) as a measure of the variability in distance profiles between dpgs of the same cell line across institutes. we called divergent dpgs those with a distance greater than the th percentile of distances from matching cell lines. for cell lines with divergent dpgs across all three processing methods we selected the dpg from the screen with the highest quality to be included in the integrated datasets. as a quality metric we used the null-normalized mean difference (nnmd, defined in the .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.gsea-msigdb.org/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / main text) and took its consensual value across the three datasets (resulting from applying ceres, ccr-jacks and crisprcleanr). agreement between dependency profile clusterings and cell line tissue labels we selected genes with the highest variance in the ceres combat integrated dataset and performed repeated k-means clusterings cell lines using the high variance genes for each pre-processing and batch-correction method. for each clustering, we calculated the adjusted mutual information between the obtained clusters and the cell line tissue labels as specified in the annotation provided by the sample_info file of the depmap_public_ q dataset using sklearn’s python function adjusted_mutual_info_score (https://scikit-learn.org/stable/). recall of known gene relationships we assembled a set of functionally related gene pairs using paralogs identified by ensemblcompara , protein-protein interactions identified by li et al , and corum complex comemberships . for a given dataset, for each pair of related genes, we calculated a pearson correlation coefficient between those genes’ dependency scores across cell lines. we then binned each gene that appeared in the list of known gene relationships according to its mean gene score using equally spaced bins. for pairs of genes in the related genes pairs, we chose one as the query gene and replaced its related partner with another randomly selected gene of similar gene mean, i.e. belonging to the same bin, excluding genes known to be related to the query gene. we calculated pearson’s correlation coefficients between these randomly selected gene pairs to generate a null distribution, from which we calculated empirical p-values and benjamini-hochberg fdrs for known related gene pairs. ensuring that the pairs of genes used in the null distribution have similar distributions of mean gene effect as the pairs of known related genes is necessary because variable screen quality can produce a high but artifactual correlation between any pair of common essential genes, and corum is highly biased towards common essentials. this is discussed further in the comparisons of batch corrections in dempster et al . unexpressed false positives we defined a gene as unexpressed in a cell line if the log (transcripts per million + ) of its depmap expression was less than . . any score of an unexpressed gene in a cell line was called a false positive if it fell in the bottom % of gene scores for that cell line. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/ qc https://scikit-learn.org/stable/ https://paperpile.com/c/bnwyax/dwirj https://paperpile.com/c/bnwyax/z a https://paperpile.com/c/bnwyax/kxhhl https://paperpile.com/c/bnwyax/fojka https://paperpile.com/c/bnwyax/ zofe https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / identifying selective dependencies normlrt and likelihood of normal distribution was calculated in r using the mass package . for the skew t-distribution the st.mple function from the sn package was used to calculate the likelihood. if the fitting procedure failed different degrees of freedom were used iteratively until a solution was found. the degrees of freedom used in order were , , , , and . systematic association test between molecular features and gene dependencies we performed a systematic two-sample unpaired student’s t-test (with the assumption of equal variance between compared populations) to assess the differential essentiality of each gene across a dichotomy of cell lines defined by the status (present/absent) of each cfe in turn. we tested genes whose normlrt values were greater than in any integrated dataset. from these tests, we obtained p-values against the null hypothesis that the two compared populations had an equal mean, with the alternative hypothesis indicating an association between the tested cfe/gene-dependency pair. p-values were corrected for multiple hypothesis testing using benjamini–hochberg (method ‘fdr’ using the p.adjust function in r). we also estimated the effect size of each tested association using cohen’s delta (Δfc), i.e. the difference in population means divided by their pooled standard deviations. evaluating known selective dependencies a table of all annotated oncogene variants was downloaded from oncokb . the table was filtered first for genes that were (likely) oncogenic and alterations that were (likely) gain-of-function or switch-of-function. for each alteration, the depmap public q mutation and fusion calls were used to identify which cell lines had the alteration. these cell lines were treated as positive controls for the gene in question, with all other cell lines treated as negative controls. only oncogenes with at least one positive cell line were retained. for each integrated dataset, we calculated the roc auc between all positive oncogene-cell line pairs and negative pairs. then, for each oncogene with at least two positive cell lines, we calculated the nnmd between its positive and negative cell lines. identification of common essential genes via the th percentile method the th percentile method finds for each gene the cell line on the boundary of its th percentile least dependent cell lines. it then calculates the rank of that gene in that cell line, by sorting all the genes based on their dependency score in increasing order. a mixture of .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/fenjn https://paperpile.com/c/bnwyax/d gc https://paperpile.com/c/bnwyax/ qc https://paperpile.com/c/bnwyax/ezh https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / two normal distributions is then fitted to the rank positions of all genes. those genes with ranks below the crossover point of these two distributions are labeled as common essentials. adam method binary depletion matrices for the integrated datasets were calculated as outlined in the next section and used with the adam method as described in behan et al . the adam method determines the number of cell lines dependent on a gene required to call that gene a common essential. the number of cell lines is calculated by maximizing the tradeoff between true positive rate (using a set of known prior essential genes) and the deviance from the null expected rate (calculated using random permutations of the binary depletion matrix). common essential genes were identified for each tissue separately (according to the cell line annotation from the cell model passports ) and were then used as input into adam to determine pan-cancer common essential genes. binary depletion calls binary depletion calls were computed by considering each cell line dpg as a rank-based classifier of essential/non-essential genes (with gene rank positions determined by their fitness effect, i.e. average depletion fold-change of targeting single guide rnas abundance at the end of the assay with respect to plasmid counts). the fitness effect threshold was then fixed as that corresponding to the largest rank position r guaranteeing a false discovery rate (fdr) < %, when the predicted essential genes are those with a rank position ≤ r. this allowed us to assign to each gene in each cell line, in each of the two datasets, a binary dependency score. to identify significantly depleted genes for a given cell line at a % fdr, we ranked all the genes in the cell line dpg in increasing order based on their depletion log fold-changes. we used the ranked list to calculate the precision curve using a set of prior known essential (e) and non-essential (n) genes, respectively, derived from hart et al . to estimate the rank position corresponding to the % fdr threshold we calculated for each rank position k, a set of predicted essential genes p(k) = {s ∈ e ∪ n: r(s) ≤ k }, with r(s) indicating the rank position of s, and the corresponding positive predictive value (or precision) ppv(k) as: .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/wfsum https://paperpile.com/c/bnwyax/g buj https://paperpile.com/c/bnwyax/g buj https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ppv(k)=|p(k)∩e|/|p(k)| we then determined the largest rank position k* with ppv(k*) ≥ . (equivalent to a fdr ≤ . ). the % fdr logfcs threshold f* was defined as the logfcs of the gene s such that r(s) = k*. we called all genes with a logfc < f* as significantly depleted at % fdr. binary dependency matrices were defined as gene by cell lines matrices with non null entries corresponding to significant dependency genes at % fdr, for each cell line, i.e. column. positive controls for common essentials to generate sets of prior known common essential genes we downloaded gene sets from msigdb (v . ) using the r package qusage. the gene sets used were from kegg were kegg_spliceosome, kegg_ribosome, kegg_proteasome, kegg_rna_polymerase and kegg_dna_replication. for the histones gene set we combined two reactome gene sets reactome_hats_acetylate_histones and reactome_hdacs_deacetylate_histones as well as the curated histones gene set from . negative controls for common essentials we compiled a set of negative controls for the common essential genes as those genes that were not expressed across all cell lines. we defined a gene as unexpressed across the panel of cell lines using the log (transcripts per million + ) of its ccle expression and the th percentile method (the input into the adam package (available at https://github.com/depmap-analytics/adam ) performing the th percentile method was - *log (tpm+ ) to ensure correct ranking). a gene defined as constitutively unexpressed was therefore one that was still lowly expressed in its highly ranked ( th percentile) most expressed cell line. downsampling for batch correction sample sizes we downsampled times the overlapping cell lines at different levels between % and %. random samples were generated using probabilities of selecting a cell line based .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/bnwyax/e ooj https://paperpile.com/c/bnwyax/ qc https://github.com/depmap-analytics/adam https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / on the relative proportions of each cell line lineage in the overlapping data set. using the downsampled set of overlapping cell lines combat was used to calculate the batch adjustment vectors. the batch adjustment vectors were then applied to all , cell lines. the correlation of a cell lines fold changes batch corrected using the downsampled datasets and the full overlapping cell lines was calculated and compared to the correlation with no batch correction. to evaluate the batch correction we also used the average silhouette width as a measure of clustering. we calculated the average silhouette width for each batch corrected data set (using samples of the overlapping cell lines) using the institute of origin as the cluster label. the average silhouette width is for perfect clustering (or complete separation of cell lines by the institute of origin) with indicating random performance of the clusters. references . prasad, v. perspective: the precision-oncology illusion. nature , s ( ). . behan, f. m. et al. prioritization of cancer therapeutic targets using crispr-cas screens. nature , – ( ). . tsherniak, a. et al. defining a cancer dependency map. cell , – .e ( ). . mcdonald, e. r., rd et al. project drive: a compendium of cancer dependencies and synthetic lethal relationships uncovered by large-scale, deep rnai screening. cell , – .e ( ). . shalem, o. et al. genome-scale crispr-cas knockout screening in human cells. science , – ( ). . koike-yusa, h., li, y., tan, e.-p., velasco-herrera, m. d. c. & yusa, k. genome-wide recessive genetic screening in mammalian cells with a lentiviral crispr-guide rna library. nat. biotechnol. , – ( ). . wang, t., wei, j. j., sabatini, d. m. & lander, e. s. genetic screens in human cells using the crispr-cas system. science , – ( ). . steinhart, z. et al. genome-wide crispr screens reveal a wnt-fzd signaling circuit as a druggable vulnerability of rnf -mutant pancreatic tumors. nat. med. , – ( ). . shi, j. et al. discovery of cancer drug targets by crispr-cas screening of protein domains. nat. biotechnol. , – ( ). . tzelepis, k. et al. a crispr dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukemia. cell rep. , – ( ). . hart, t. et al. high-resolution crispr screens reveal fitness genes and genotype-specific cancer liabilities. cell , – ( ). . meyers, r. m., bryan, j. g., mcfarland, j. m. & weir, b. a. computational correction of .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/votga http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/e ooj http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ jkgi http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/ayqe http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/as lx http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/ymsj http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/t woi http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/odthp http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/dctjj http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/bifqg http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/g buj http://paperpile.com/b/bnwyax/f tt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / copy number effect improves specificity of crispr–cas essentiality screens in cancer cells. nature ( ). . wellcome sanger institute. cancer dependency map. https://depmap.sanger.ac.uk/. . broad institute of harvard and mit. cancer dependency map. https://depmap.org/. . feng, f. y. & gilbert, l. a. lethal clues to cancer-cell vulnerability. nature vol. – ( ). . dempster, j. et al. agreement between two large pan-cancer genome-scale crispr knock-out datasets. nature communications in press ,. . iorio, f. et al. unsupervised correction of gene-independent cell responses to crispr-cas targeting. bmc genomics , ( ). . allen, f. et al. jacks: joint analysis of crispr/cas knockout screens. genome res. , – ( ). . project score. https://score.depmap.sanger.ac.uk/. . depmap, b. depmap q public. ( ) doi: . /m .figshare. .v . . project achilles. https://figshare.com/articles/depmap_ q _public/ . . aguirre, a. j. et al. genomic copy number dictates a gene-independent cell response to crispr/cas targeting. cancer discov. , – ( ). . gonçalves, e. et al. structural rearrangements generate cell-specific, gene-independent crispr-cas loss of fitness effects. genome biol. , ( ). . doench, j. g. et al. rational design of highly active sgrnas for crispr-cas -mediated gene inactivation. nat. biotechnol. , – ( ). . leek, j. t., johnson, w. e., parker, h. s., jaffe, a. e. & storey, j. d. the sva package for removing batch effects and other unwanted variation in high-throughput experiments. bioinformatics , – ( ). . liberzon, a. et al. molecular signatures database (msigdb) . . bioinformatics , – ( ). . dempster, j. m. et al. agreement between two large pan-cancer crispr-cas gene dependency data sets. nat. commun. , ( ). . lagziel, s., lee, w. d. & shlomi, t. inferring cancer dependencies on metabolic genes from large-scale genetic screens. bmc biol. , ( ). . dempster, j. m., rossen, j., kazachkova, m. & pan, j. extracting biological insights from the project achilles genome-scale crispr screens in cancer cell lines. biorxiv ( ). . iorio, f. et al. a landscape of pharmacogenomic interactions in cancer. cell , – ( ). . chakravarty, d. et al. oncokb: a precision oncology knowledge base. jco precis oncol , ( ). . oncokb. all annotated variants. oncokb.org http://oncokb.org/api/v /utils/allannotatedvariants ( ). . aken, b. l. et al. ensembl . nucleic acids res. , d –d ( ). . li, t. et al. a scored human protein-protein interaction network to catalyze genomic interpretation. nat. methods , – ( ). . ruepp, a. et al. corum: the comprehensive resource of mammalian protein complexes-- . nucleic acids res. , d – ( ). . hart, t. et al. evaluation and design of genome-wide crispr/spcas knockout screens. g , – ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/f tt http://paperpile.com/b/bnwyax/kl bc http://paperpile.com/b/bnwyax/kl bc http://paperpile.com/b/bnwyax/kl bc http://paperpile.com/b/bnwyax/htoyk https://depmap.org/ http://paperpile.com/b/bnwyax/htoyk http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/wjxm http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/ uh g http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/q esm http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/htdux http://paperpile.com/b/bnwyax/ cgu https://score.depmap.sanger.ac.uk/ http://paperpile.com/b/bnwyax/ cgu http://paperpile.com/b/bnwyax/ qc http://dx.doi.org/ . /m .figshare. .v http://paperpile.com/b/bnwyax/ qc http://paperpile.com/b/bnwyax/n jvg https://figshare.com/articles/depmap_ q _public/ http://paperpile.com/b/bnwyax/n jvg http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/iqbee http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/ o i http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/eqqvf http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/ax xh http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/wm a http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/ezh http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/rxwn http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/fojka http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/hbt j http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/assl http://paperpile.com/b/bnwyax/d gc http://paperpile.com/b/bnwyax/d gc http://paperpile.com/b/bnwyax/d gc http://oncokb.org/api/v /utils/allannotatedvariants http://paperpile.com/b/bnwyax/d gc http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/dwirj http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/z a http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/kxhhl http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn http://paperpile.com/b/bnwyax/karn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . kanehisa, m. et al. kegg for linking genomes to life and the environment. nucleic acids res. , d – ( ). . fabregat, a. et al. the reactome pathway knowledgebase. nucleic acids res. , d –d ( ). . lenoir, w. f., lim, t. l. & hart, t. pickles: the database of pooled in-vitro crispr knockout library essentiality screens. nucleic acids res. , d –d ( ). . rauscher, b., heigwer, f., breinig, m., winter, j. & boutros, m. genomecrispr - a database for high-throughput crispr/cas screens. nucleic acids research vol. d –d ( ). . gonçalves, e., thomas, m., behan, f. m., picco, g. & pacini, c. minimal genome-wide human crispr-cas library. biorxiv ( ). . elmentaite, r., noell, g., turner, g., iyer, v. & parts, l. minimized double guide rna libraries enable scale-limited crispr/cas screens. biorxiv ( ). . van der meer, d. et al. cell model passports—a hub for clinical, genetic and functional datasets of preclinical cancer models. nucleic acids res. , d –d ( ). . bolstad, b. m. preprocesscore: a collection of pre-processing functions. . r package version ,. . leek, j. t. et al. sva: surrogate variable analysis. r package version . . . depmap, b. depmap q public. ( ) doi: . /m .figshare. .v . . ripley, b. et al. package ‘mass’. cran r , ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/thhr http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/shsw http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/xh a http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/czfn http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/ztmd http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/dkgl http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/wfsum http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/ zwnw http://paperpile.com/b/bnwyax/zcfxr http://paperpile.com/b/bnwyax/zcfxr http://paperpile.com/b/bnwyax/zcfxr http://paperpile.com/b/bnwyax/ zofe http://dx.doi.org/ . /m .figshare. .v http://paperpile.com/b/bnwyax/ zofe http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn http://paperpile.com/b/bnwyax/fenjn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / identification and design of vinyl sulfone inhibitors against cryptopain- – a cysteine protease from cryptosporidiosis- causing cryptosporidium parvum arpita banerjee author contributions: designed the computational experiments: ab performed the computational experiments: ab analyzed the data: ab wrote the paper: ab correspondence: arpita. @gmail.com .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / abstract: cryptosporidiosis, a disease marked by diarrhea in adults and stunted growth in children, is associated with the unicellular protozoan pathogen cryptosporidium; often the species parvum. cryptopain- , a cysteine protease characterized in the genome of cryptosporidium parvum, had been earlier shown to be inhibited by a vinyl sulfone compound called k (or k- ). cysteine proteases have long been established as valid drug targets, which can be covalently and selectively inhibited by vinyl sulfones. this computational study was initiated to identify purchasable vinyl sulfone compounds, which could possibly inhibit cryptopain- with higher efficacy than k . docking simulations screened a number of such possibly better inhibitors. the work was furthered to probe the enzymatic pocket of cryptopain- , through in-silico mutations, to derive a map of receptor-ligand interactions in the docked complexes. the idea was to provide crucial clues to aid the design of inhibitors, which would be able to bind the protease well by making favorable interactions with important residues of the enzyme. the analyses dictated placement of ligands towards the front of the enzymatic cleft, and disfavored interactions deep within. the s ’ and s subsites of the enzyme preferred to remain occupied by polar ligand subgroups. reasonably distanced ring systems and polar backbones of ligands were desired across the cleft. large as well as inflexible subgroups were not tolerated. double ringed systems such as substituted napthalene, especially in s , were exceptions though. the s subsite, which is typically a specificity determinant in papain (c ) family cysteine proteases such as cathepsin l-like cryptopain- , can possibly accommodate polar and hydrophobic ligand subgroups alike. keywords: vinyl sulfone inhibitors, cryptopain- , cysteine protease, molecular modeling, covalent docking, in-silico mutational analysis, drug design. running title: identification and design of vinyl sulfone inhibitors against cryptopain- .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / introduction: cryptosporidiosis is an intestinal disease that is clinically manifested by diarrhea in adults [ ] and stunted growth in children [ ]. the infection can persist indefinitely in immunocompromised individuals such as hiv patients, and could be fatal in the form of life-threatening diarrhea [ ]. the disease is caused by unicellular protozoan parasite cryptosporidium, which infects humans and animals [ ] through consumption of contaminated water and/or ingestion of contaminated food products [ ]. the majority of infections are caused by cryptosporidium species hominis and parvum [ ] [ ]. a cysteine protease named cryptopain- , characterized in the genome of cryptosporidium parvum [ ], most likely facilitates host cell invasion and nutritional uptake (through proteolytic degradation) [ ] [ ] [ ]. the pathogenic enzyme, being cathepsin l –like, belongs to papain-like or clan ca (family c ) cysteine protease enzymes - which in general have been of particular use as therapeutic targets against parasitic infections [ ]. the catalytic triad of such enzymes is constituted by cys, his and asn residues [ ], [ ]. orthologous proteases to cryptopain- have been validated as drug targets viz: cruzain (from chagas’ disease agent trypanosoma cruzi), rhodesain (from sleeping sickness causing trypanosoma brucei), falcipain- (from malarial parasite plasmodium falciparum), smcb (from intestinal schistosomiasis causing schistosoma mansoni) [ ] [ ] etc. vinyl sulfone compounds have been particularly effective inhibitors of such parasitic cysteine proteases [ ] [ ] [ ] [ ] [ ]. these inhibitors form a covalent bond with the active site cys thiol to bind the proteases, thereby irreversibly blocking the enzymatic pocket. such inhibition interferes with the pathogenic activity of the proteases that would otherwise participate in general acid-base reaction for hydrolysis of host-protein peptide bonds [ ]. molecular modeling studies had previously shown that unlike serine proteases (which also cleave peptide bonds and have ser in their active site), the catalytic his in cysteine proteases remains protonated to act as a general acid [ ]. hydrogen bonding between the protonated his and the sulfone oxygen of a vinyl sulfone compound polarizes the vinyl group of the ligand to impart a positive charge on its beta carbon that eventually promotes nucleophilic attack by negatively charged cys thiolate of the protease’s active site. vinyl sulfone class of inhibitors are preferred over other covalent inhibitors because of its selectivity for cysteine proteases over serine proteases, relative inertness in the absence of target protease [ ] [ ], and safe pharmacokinetic profile [ ] [ ]. the peptidyl vinyl sulfones that have been co-crystallized with cysteine proteases so far reveal that the –co-nh- backbones of the pharmacologically active compounds fit snugly in the enzymatic cleft, with the ligand sidechains (or subgroups) protruding into the different subsites of the proteases. the subgroup near the vinyl carbon that undergoes nucleophilic attack is equivalent to p in the inhibitor/substrate [ ]. therefore, ligand sidegroups starting from the vinyl side are designated as p , p … that interact with the s , s … protease subsites. the ligand subgroups beyond the sulfonyl are referred to as p ’, p ’… and they occupy the s ’, s ’… subsites on the prime side of the enzyme (figure ). typically, the p -s interaction is the key specificity determinant in papain (c ) family cysteine proteases [ ] [ like cryptopain- . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / k (or k- ), a vinyl sulfone that binds cryptopain- as its target as per inhibitor competition experiments with active site probe of the recombinant protease, has been demonstrated to arrest cryptosporidium parvum growth in human cell lines at physiologically achievable concentrations [ ]. the cryptopain- structure however, by itself or in complex with k , has not been solved till date. k -bound co-crystals of other orthologous cysteine proteases such as cruzain, rhodesain and smcb [ ] [ ], showed the orientation of the inhibitor in the cysteine proteases as depicted in figure . the earlier mentioned study on cryptopain- had simulated the binding of k within the active site of the enzyme homology model [ ], and mimicking nature, the inhibitor was put in an orientation as illustrated in figure the present computational study was initiated to explore other (purchasable) vinyl sulfones that could better bind the active site of the cryptopain- enzyme, with possibly higher efficacy than k . the study was extended to probe the enzymatic pocket of cryptopain- to figure preferential binding of certain ligand chemical groups at the subsites, for the purpose of providing clue to drug design against the pathogenic cysteine protease. materials and methods: homology model building of enzyme the sequence of cryptopain- , with the accession number aba . , belonging to cryptosporidium parvum was retrieved from genbank [ ]. the protein sequence was downloaded in fasta format. the homology model template search for cryptopain- (cathepsin l-like) through ncbi blast against pdb database [ ] led to f , which is the activated toxoplasma gondii cathepsin l (tgcpl) in complex with its propeptide. the template shared % sequence identity with the sequence to be modeled. the homology model of cryptopain- was built within the full refinement module of icm [ ]. the structure-guided sequence alignment between the template and the model was generated using the default matrix with gap opening penalty of . and gap extension penalty of . . loops were sampled for the alignment gaps where the template did not have co-ordinates for the model. the loop refinement parameters were used according to default settings. acceptance ratio for the simulation process was . . the generated homology model of a length of amino acids was then validated in procheck [ ] and prosa [ ] webservers. ligand structures from chemical compound database k (or k- ) was downloaded from pubchem [ ] in sdf format. the vinyl sulfone substructure of k was then searched in pubchem, with the additional option of ‘ring systems not embedded’ so as to filter out those structures where the vinyl bonds would extend into ring systems. the search, which was obviously not restricted to peptidyl vinyl sulfones, led to , hits (as of april , ). compounds, which were purchasable amongst the hits, were downloaded in sdf format. the downloaded compounds were checked for redundancy. from the non-redundant vinyl sulfone .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / compounds, cyanide compounds were discarded due to the usual high toxicity profile of such compounds, and the remaining were saved to be used as ligands for docking into cryptopain- . docking simulation of covalent inhibition of enzyme the n-terminal propeptide (which is not part of the active enzyme and acts as a self- inhibitory peptide for regulatory purposes) of the cryptopain- homology model was deleted. the residues were then renumbered in the enzyme model, with position allocated to the beginning of the mature protease. the pdb file of the edited cryptopain- model was then prepared as a receptor in icm with the addition of protons, optimization of his, pro, asn, gln and cys residues. the protonation step was crucial for mimicking the reaction (and hence bonds) between a vinyl sulfone and the cysteine protease. the active site residues of the binding pocket had been derived from the structural alignment of cryptopain- homology model with the orthologous cruzain that was bound to k (pdb id: oz ), followed by mapping of the residues around k in the cruzain onto the cryptopain- sequence. the pre-determined pocket residues were selected (except the catalytic cys or c ) on the prepared cryptopain- in the gui of icm and the relevant box size was created on the receptor for defining the area for ligand docking. further, c was selected for specifying the covalent docking site. from the set of preloaded reactions in icm, alpha, beta-unsaturated sulfone/sulfonamide/cysteine reaction was selected, which specified the simulation of covalent bond formation between the supposedly thiolate (c of protease) and the beta carbon atom (of the vinyl group of ligand). the receptor maps were finally made for grid generation. k , downloaded from pubchem in sdf format, was read in as a chemical table in the gui of icm, and was specified for docking into the prepared cryptopain- receptor. thoroughness of . was set in the docking protocol, and twenty conformations of the ligand in the receptor were generated. following k , a total of non-cyanide vinyl sulfone compounds were attempted for covalent docking into the cryptopain- homology model, using the same protocol as described above. in-silico mutation of enzyme residues for assessing binding for the purpose of evaluating the contribution of the individual residues to the binding of the ligands, mutational analysis was undertaken. the protein-ligand stability was measured by in-silico mutation of the contact residues in the complexes. k -docked cryptopain- and the best-scored complexes (with a score of - or lower) were read in separately, and then for each of them, the ligand-subgroup contacting residues were selected one at a time in the workspace panel, and were mutated to alanine. the outputs of the calculations were displayed in several columns. dgwt column had the dg (gibbs free energy) value for the wild type complex (without mutation), the dgmut held the dg value for the mutated complex (where the residue was mutated to ala), and the ddgbind (dgmut – dgwt) column, which showed the binding free energy change (in kcal/mol) upon mutation, essentially predicted the stability of the native complex, thereby hinting at the contribution of the residue in question towards binding the ligand. positive values of ddgbind implied the mutation to be less favorable, indicating greater .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / contribution of the wild type residue towards binding. hence, with more positive ddgbind, better binding of the ligand by the residue could be expected. negative values, on the other hand, implied the mutated form to be more stable, thereby delineating the native residue’s involvement in unfavorable interactions with the ligand. the residues that were detected to make high number of favorable ligand interactions in thirty-two of the complexes (k -cryptopain- plus thirty-one best-scored ones) were subjected to a fresh round of mutations in the updated version of the icm software. the recalculated ddgbind values were then tallied with the placement and orientation of ligand-subgroups around the residues to decipher the preference of chemical groups across the enzymatic cleft of cryptopain- . [the gui of icm was used to make the enzyme/complex structure figures. illustration and compilation of figures were done in inkscape, which is an open-source vector graphics editor] results and discussion: validation of theoretical enzyme structure the ramachandran plot for the cryptopain- homology model showed % of the residues to lie in the allowed region, and the remaining % to be within the generously allowed region of the plot (supplementary figure a). the prosa z-score for the cryptopain- model was - . , better than the - . z-score of its crystal structure template (supplementary figure b). screening of docked compounds besides k , a total of purchasable, non-redundant and non-cyanide vinyl sulfone compounds were docked and scored in the cryptopain- homology model ( symmetric molecules could not be docked using icm). post docking, the conformation of k - where the ligand p ’ group (beyond the sulfonyl) got oriented across the enzyme s ’ and its p ..p groups (beyond the vinyl) were placed across the s ..s subsites (as in figure ), and had the lowest score in the said category, was chosen as a reference for the analysis. such orientation appeared first in the eighteenth pose (conformation) of k docked into cryptopain- , with a score of - . . the conformations of some other docked vinyl sulfone compounds that had similar orientation (described above) where the ligand subgroups beyond the sulfonyl were placed across s ’ or beyond, with lowest scores <= - . (and hence possibly better binders than k ), were included in the study for further detailed analysis. [the chemical structures of k and the thirty-one best-scored vinyl sulfones are provided in supplementary figure , as pubchem ids associated with (some) chemical compounds change due to frequent updates to the database. the ids .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / mentioned throughout the text, tables and figures are from the current pubchem records as of may , ] ligand binding to preferential enzyme residues the residues around Å of the ligand subgroups were noted for each complex. k - docked cryptopain- was taken as a reference, as k had been shown experimentally (on bench) to bind cryptopian- . the protease subsite residues were thus primarily derived from this complex. figure show the chosen conformation of k docked into cryptopain- with the derived subsites colored differently. for the other best-scored complexes, the additional contact residues that showed up were assigned subsites according to their vicinity/placement to the already derived subsite residues in the three dimensional structure of cryptopain- . figure shows all the residues that were contacted by ligand subgroups across the enzymatic cleft, in one or more of the complexes. the panels a, b, c and d of figure show the selected conformations of the other vinyl sulfones in the cryptopain- , amidst the subsites derived from the reference complex. the ligand subgroup-contacting residues in each complex had been mutated to alanine; one at a time, to figure the favorable interactions based on the ddgbind values. the interactions that showed ddgbind values worse than - (less than - ) were not taken into account. the residues that corresponded with the rest of the ddgbind values (greater than - ) were considered to be contributing to favorable interactions with the ligand. supplementary table lists the ddgbind interactions in terms of residue versus ligand (represented by pubchem ids). the columns have all the residues that had been favorably contacted in one or many of the complexes, and the rows hold the compounds whose subgroups had shown favorable interactions with the corresponding column residues. table lists the scores, contact residues, h-bonding residues and the favorably interacting subsite residues (derived from supplementary table ) in the complexes. the tables feature also the additional subsite residues that showed up in the other best-scored complexes, which included ligands that, unlike k , were not typical peptidyl vinyl sulfones. thirteen of the favorably interacting cryptopain- residues emerged to be heavily contacted by ligand subgroups in the complexes (see supplementary table ). the number of times each of the residues was shown to make favorable interactions ranged from to . with a threshold of , q , k , g , c , w , g , t , a , v , n , h , g , and w turned out to be the most frequently contacted of the favorably interacting residues. the derived residues were then subjected to ddgbind recalculations (barring a ). the results from the calculations were studied with respect to the orientation and positioning of the ligand subgroups near the mentioned residues in the complexes. the ddgbind values for the interaction of the frequently contacted residues with the ligands are listed in table . the purpose was to deduce the contributing factors for binding and to shed light on the enzymatic-pocket preference for accommodating certain ligand groups, which could be ultimately useful for designing a potent vinyl sulfone inhibitor (better than k ) to target cryptopain- . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / interactions: enzyme subsite residues - ligand subgroups unlike k which occupied the central part of the pocket and was spread equally amongst all the subsites (figure ), the best-scored vinyl sulfones more often occupied the upper part of the cleft and tended to position themselves on the right, making contacts mostly with s ’ and s . ligands that lacked p ’, p ’ etc., were sometimes exceptions and got placed at the lower end of the cleft, heavily contacting s . the positioning of the ligand-contacting residues in the three dimensional structure of the enzyme can be seen in figure , and the other vinyl sulfone ligands’ placement therein is visible in figure . the accommodation of various ligand subgroups of the best-scored vinyl sulfones across the enzymatic cleft is described as follows. s ’ enzyme subsite the s ’ subsite residues f and w , in the uppermost part of the pocket, were not amongst the frequently contacted, and hence they were excluded from detailed analysis. s ’ enzyme subsite the derived s ’ residues n , h and w were frequently contacted by the other vinyl sulfones, along with an additional g (placed between n and h ). q and k also featured as additional contacts, which though positioned on the opposite side in the structure, made interactions with p ’ of the ligands. thus the residues were categorized as part of s ’. the upper part of the heavily occupied enzymatic pocket region is constituted by s ’ residues: w on the right, and q , k on the left. w , which made most of the hydrophobic interactions, on the right side of the pocket, with the ligand ring systems showed highly positive ddgbind values for thiophen group in particular. the residue seemed to prefer pi stacking with ligand ring systems as it showed favorable ddgbind values for in-plane ring interactions. the ligands with ethenyl group as well as the ones that did not place any subgroups near the residue showed moderately favorable interactions. the ligands whose rings were out of plane with the residue’s six-membered ring, and the ones that had groups like bromopyridine near the residue, showed unfavorable interactions. for q that is situated at the back of the cleft wall, the compounds’ covalent moiety with their sulfonyl group and/or benzyl/phenyl ring(s), when placed near the lower end of the residue, resulted in favorable interactions. large halide containing subgroups such as bromopyridine resulted in unfavorable interaction. k , positioned at the front of the cleft, showed favorable interactions with reasonably distanced polar substituents. interactions were favorable even when no substituent was close to the residue. understandably, unfavorable interactions were observed when the non-polar moiety of the residue’s sidechain was near polar ligand atoms, and interactions of non-polar ethenyl group of the ligand with polar end of the residue also led to highly negative ddgbind values. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / the mid-region of the highly occupied cleft is constituted by n , h and g (s ’ residues) on the right. these frequently contacted residues were actually within the contact range of both p ’ and p of k . however, the proximity of the ligand’s p ’ to the sidechains of n and h in the reference complex led to the residues’ allocation to s ’ – which therefore extends into the middle of the cleft. n showed favorable interaction with halide-containing substituents including bromopyridine that otherwise had unfavorable interactions with the other residues. the ligands that had their benzyl/phenyl rings at a comfortable distance from the residue showed favorable interactions. closely spaced ligand ring systems led to clashes. h , which is situated at the back (compared to n ) of the enzyme’s mid-pocket, preferred favorable interactions with the ligands’ sulfonyl or backbone. the residue, if not always, showed favorable interactions even when no ligand group was placed near it. favorable ring interactions were observed when the ligands’ ring systems were mostly tilted towards w . unfavorable ddgbind values were observed for inflexible ethenyl groups in ligands. g , which is buried in the mid-pocket, made interactions primarily with the covalent- bond forming moieties of the ligands. the residue showed favorable interactions with reasonably distanced ring systems. interactions were unfavorable for closely spaced rings and inflexible groups such as ethenyl. overall, the arrangement of the mentioned residues suggest that substituted benzene/napthalene ring systems could be accommodated in the upper region of the subsite, where the ligand rings can engage in hydrophobic interaction with w , and the polar substituents on those rings could interact with q and k to the left of the pocket. however, large (polar) halide-substituted rings such as bromopyridine could lead to clashes. the s ’ in the mid-pocket shows a preference for reasonably distanced ring systems and halide-substituted ligand subgroups. the subsite is not likely to tolerate inflexible groups such as diazospiro, ethenyls etc. s enzyme subsite the frequently contacted (derived) s residues g and c were positioned on the left side of the mid-pocket. w , that emerged as an additional frequent contact was placed close-by to g and c on the left, and formed part of s . g was observed to like interactions with double ring systems such as substituted napthalene or two separate benzyl/phenyl rings placed near the residue. it also showed favorable interactions with groups like sulfonyl and/or polar backbone atoms. ring as well as polar interactions showed the most favorable ddgbind values. the interactions became unfavorable when no ligand group was in the vicinity of the residue. bromopyridine showed unfavorable interactions with this residue too. c , the enzymatic triad residue that formed the covalent bond with the vinyl sulfones, preferred the ligands to be placed away from it and towards the front of the cleft. the favorably interacting compounds were positioned to the right and at the bottom of the residue. the compounds that were tilted towards the inside of the cleft showed moderately unfavorable interactions, and so did the ones that did not place any ring .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / system near the residue. unfavorable interactions for the residue were observed with the close proximity of ligands’ polar substituents or backbone. again, bromopyridine made unfavorable interactions with this residue as well. unlike other residues, c had far less borderline interactions and the individual ddgbind values mostly ranged on either side of favorable and unfavorable. w made favorable interactions with the ring systems of the ligands that were placed away, and towards the right side of the pocket. the interactions were better with more number of rings. the highest ddgbind value was obtained for the compound that had four ring systems. however close interactions either with the ligand backbone or side chain resulted in unfavorable interactions. inflexible groups such as diazospiro, even if placed away from the residue, amounted to negative ddgbind values. taken together, inflexible groups such as diazospiro, ethenyl etc. would not be tolerated by s . the subsite can accommodate multiple ring systems. the mid-pocket would have a preference towards polar backbone of ligands that are positioned towards the front. the catalytic c of s too dictates the compounds to be placed not too deep inside the cleft. large halide containing subgroups such as bromopyridine will not be favored in the subsite. the site shows a propensity towards closely packed ring interactions. s enzyme subsite the lowest part of the heavily occupied pocket is comprised by the frequently contacted (derived) s subsite residues: g , t , a and v . the s residues are distributed on both sides of the cleft. g , t are on the left, and a , v are on the right. g , placed above t , engaged mostly in h-bond interactions with backbone of the ligands, rather than favorably accommodating their side chains. the residue showed favorable ddgbind values for slightly spaced away ring systems of ligands. the most unfavorable interactions were shown for the compound containing bromopyridine. for t , the highest positive ddgbind value was observed for a halide-substituted ligand subgroup (fluro-triazinyl group) with its polar ring and polar backbone near the residue. t preferred reasonably distanced ring interactions (polar and non-polar). however, with no ligand group placed near the residue, the interactions were unfavorable. also, with large subgroups like bromopyridine again, the interactions were unfavorable. a had to be excluded from the mutational analysis as ddgbind value for ala to ala mutation is zero, and could not have provided any useful clue towards the type of interactions. v , despite being mostly hydrophobic, showed favorable interactions with comfortably distanced polar subgroups of ligands including the fluro-triazinyl group-containing compound that showed the best ddgbind value. such polar groups were presumably stabilized by long-ranged electrostatic effect of other s residues (see tables). summing up, s can certainly accommodate polar subgroups/backbone of ligands. the subsite however, like the other subsites, does not like to accommodate large polar subgroups like bromopyridine. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / orientation and placement of ligands across the enzymatic cleft the best-scored vinyl sulfones tended to occupy the s ’, s ’, s and s subsites. unlike k , the other compounds showed optimal interactions mostly with the prime site residues of the enzyme. the s ’ residues made half of the frequently contacted favorable interactions with the ligands. the rest half of such interactions were accounted by s and s members. with respect to the entire enzymatic cleft of cryptopain- , it can be deduced that the ligands’ placement towards the front of the cleft would be preferred to deep-seated interactions. polar backbones of ligands (even if not peptidyl) would be desired. s ’ and s like to be occupied, and are prone to make favorable interactions with polar subgroups of ligands. large halide-containing subgroups are not well tolerated presumably because of their size. reasonably distanced ring interactions would be preferred all across the cleft. unlike inflexible groups like substituted napthalene which could be favorably accommodated in s , the strain arising out of the inflexibility of ethenyl and/or diazospiro groups is not likely to be tolerated, especially in the s ’ and s subsites, as per the computational mutational analysis. quite relevantly, the compound that showed the maximum number of favorable interactions with the frequently contacted residues, (see table ) had all the preferred attributes and lacked the undesirable ones. the ligand-bound protease showed a very good score of - . . some other compounds that showed slightly better scores than were (score: - . ), (score: - . ), and (score: - . ). and were placed deep inside the cleft that led to clashes with the covalent bond forming c . the ligands’ polar backbones, in addition to the occupation of the enzymatic s ’ site with polar subgroups, somewhat mitigated the unfavorable interactions in totality. the compounds also had the undesirable ethenyl near s ’, which contributed to unfavorable interactions with k in case of (where the ethenyl was placed much closer to the residue). however, the overall scoring algorithm did not penalize ethenyl’s presence as much as the individual ddgbind calculations did. , which showed the best score, too had an ethenyl group (albeit not close to k ). this compound however was placed towards the front of the cleft, thereby avoiding unfavorable interactions with c . also, the ligand had ring systems in abundance (six) for favorable interactions. rings comprised its (polar) backbone as well as subgroups. the ligand desirably occupied the s ’ and s subsites, though not with much polar subgroups. conclusion: the efficacy of the thirty-one best-scored compounds as drug candidates within physiological limits remains to be tested on bench. the information, which has been garnered through this study on the substrate/ligand-binding cleft of the enzyme and its .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / interaction with the chemical groups of the docked compounds, could ultimately guide the design of potent vinyl sulfone inhibitors. and that shared most of the preferred ligand-subgroup attributes can serve as model compounds, based on which effective inhibitors against cryptopain- could be designed. figure provides the chemical structures of the reference (k ) and the model compounds. unlike the other two mentioned compounds ( and ), the subgroups of the model ligands extended into s – typically the key specificity determinant in cathepsin l-like cysteine proteases such as cryptopain- . placed a polar subgroup at s in contrast to the hydrophobic subgroup put by . polar ligand subgroups (as in ) at the enzyme’s s are likely to be stabilized via polar/electrostatic interactions by residues like t , m , t , k and e . hydrophobic subgroups too (as in ) could be accommodated by the virtue of s residues like a and v . thus, the study attempted to identify purchasable vinyl sulfone compounds that can possibly inhibit cryptopain- , as well as it provided crucial information pertaining to receptor-ligand interactions to help future design of other vinyl sulfones, which could prove to be effective in curbing cryptosporidiosis. acknowledgement: the author would like to thank prof. ruben abagyan of university of california san diego, for providing computational resources. references: [ ] dupont hl, chappell cl, sterling cr, okhuysen pc, rose jb, jakubowski w. . the infectivity of cryptosporidium parvum in healthy volunteers. n. engl. j. med. : – . [ ] janoff en, mead ps, mead jr, echeverria p, bodhidatta l, bhaibulaya m, sterling cr, taylor dn. . endemic cryptosporidium and giardia lamblia infections in a thai orphanage. am. j. trop. med. hyg. : – . [ ] griffiths jk. . human cryptosporidiosis: epidemiology, transmission, clinical disease, treatment, and diagnosis. adv. parasitol. : – . [ ] fayer r, santin m, macarisin d. . cryptosporidium ubiquitum n. sp. in animals and humans. vet. parasitol. : – . [ ] juranek dd. . cryptosporidiosis: sources of infection and guidelines for prevention. clin. infect. dis. (suppl. ): s –s [ ] o’donoghue pj. . cryptosporidium and cryptosporidiosis in man and animals. int. j. parasitol. : – . [ ] tzipori s, widmer g. . a hundred-year retrospective on cryptosporidiosis. trends parasitol. : – . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / [ ] na bk, kang jm, cheun hi, cho sh, moon su, kim ts, sohn wm. . cryptopain- , a cysteine protease of cryptosporidium parvum, does not require the pro- domain for folding. parasitology : – [ ] teo cf, zhou xw, bogyo m, carruthers vb. . cysteine protease inhibitors block toxoplasma gondii microneme secretion and cell invasion. antimicrobial agents and chemotherapy : – . [ ] shaw mk, roos ds, tilney lg. . cysteine and serine protease inhibitors block intracellular development and disrupt the secretory pathway of toxoplasma gondii. microbes and infection : – . [ ] rosenthal pj. . hydrolysis of erythrocyte proteins by proteases of malaria parasites. current opinions in hematology : – [ ] sajid m, mckerrow jh. . cysteine proteases of parasitic organisms molecular & biochemical parasitology : – . [ ] powers jc, asgian jl, ekici od, james ke. . irreversible inhibitors of serine, cysteine, and threonine proteases. chem. rev. : - . [ ] kerr id, lee jh, farady cj, marion r, rickert m, sajid m, pandey kc, caffrey cr, legac j, hansell e, mckerrow jh, craik cs, rosenthal pj, brinen ls. . vinyl sulfones as antiparasitic agents and a structural basis for drug design. ( ): – . [ ] jílkova a, rˇezácˇová p, lepsˇík m, horn m, va´chova´ j, fanfrlík j, brynda j, mckerrow jh, caffrey cr, mares m. . structural basis for inhibition of cathepsin b drug target from the human blood fluke, schistosoma mansoni. j. biol. chem. ( ): – . [ ] chen yt, lira r, hansell e, mckerrow jh, roush wr. . synthesis of macrocyclic trypanosomal cysteine protease inhibitors. bioorg med chem lett. ( ): – . [ ] jaishankar p, hansell e, zhao dm, doyle ps, mckerrow jh, renslo ar. . potency and selectivity of p /p -modified inhibitors of cysteine proteases from trypanosomes bioorg. med. chem. lett. : – . [ ] rasnick d. . small synthetic inhibitors of cysteine proteases perspectives in drug discovery and design december. ( ): – . [ ] palmer jt, rasnick d, klaus jl, bromme d. . vinyl sulfones as mechanism- based cysteine protease inhibitors j. med. chem. ( ): – [ ] mckerrow jh, rosenthal pj, swenerton r, doyle p. . development of protease inhibitors for protozoan infections. curr opin infect dis. ( ): - [ ] ndao m, nath-chowdhury m, sajid m, marcus v, mashiyama st, sakanari j, chow e, mackey z, land km, jacobson mp, kalyanaraman c, mckerrow jh, arrowood mj, caffrey cr. . a cysteine protease inhibitor rescues mice from a lethal .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / cryptosporidium parvum infection. antimicrob agents chemother. ( ): - [ ] benson da, cavanaugh m, clark k, karsch-mizrachi i, lipman dj, ostell j, sayers ew. . genbank. nucleic acids res. (database issue): d - . [ ] berman hm, westbrook j, feng z, gilliland g, bhat tn, weissig h, shindyalov in, bourne p. . the protein data bank. nucl acids res. : - [ ] abagyan ra, totrov mm, kuznetsov da. . icm—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. j. comp. chem. : - . [ ] laskowski ra, macarthur mw, moss ds. procheck: a program to check the stereochemical quality of protein structures. j. appl. cryst. : - . [ ] wiederstein m, sippl mj. prosa-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. nucleic acids research. ( ): – . [ ] kim s, thiessen pa, bolton ee, chen j, fu g, gindulyte a, han l, he j, he s, shoemaker ba, wang j, yu b, zhang j, bryant sh. . pubchem substance and compound databases. nucleic acids res. (database issue): d - . . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ligands score contact residues h-bond residues fav s ’ residues fav s ’ residues fav s residues fav s residues fav s residues (k ) - . a , d n , h , w g , c , c , d , g g , t , a , v , e f , l g , w a , d , h c , g g , t , a , v , e f , l - . q , k , n , c , g , c , w , d , g , g , t , n , h ,g , w , k , w q , k , n , c , n , g , w g , c , w , d , g g , t - . q , k , c , w , g , t , a , v n , h , g , w w q , k , h , g , w w t , a , v - . n , q , k , c , w , g , t , m , a , v , n , h , g , w , e h , w n , q , k , n , g w t , m , a - . q , k , n , c , g , c , w , g , t , q , n ,h , g , w , w q q , w q , k , n , c , n , h , g , w g , w g , t .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / - . n , q , k , n , g , c , w , d , g , g , t , m , a , q n , h , g , w , w g , w q , w n , q , k , n , n , g , w w , d g , m , a - . n , q , k , g , c , w , d , g , g , t , q , n , h , g w w n , g , w q , w n , q , k , n , h , g , w g , c , w , d , g g , t - . q , k , n , c g , c , w , d , g , g , t , n , h , g w q , w q , k , n , c , n , h , g , w g , c , w , d , g g , t - . n , q , k , g , c , w , d , g , g , t , n , h , g , w , w n , g , w w n , q , k , n , h , g , w g , c , w , d , g g , t - . n , q k , g , c , d q , f n , h w w g , q , h , w q , f , w n , q , k , d , n , h , w g , c - . q , k , c , w , g , m , a , v , q , k , g , w c , w g , m , a .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / n , h g , w - . c , w , g , t , m , a , v , n , h , g w w n , g , w w g , t , m , a , v - . q , k , n , c , g , c , w , g , t , m , q , v , n , h g , w , w q , h q q , k , n , c , n , h , g , w g , c , w g , t , m , v - . q , k , c , w , g , t , m , a , a , d v , n h , g w q , h q , k , a , h , g w t , m , a , v - . n , g , c , w , d , g , g , t , m , a , a , d q , v , n , h g , w , w c , c , g , w q , w n , a , d , n , h , g , w g , w , g g , t , m , a , v - . n , q , k , n , c , g c , w , f , d , g , g , t , m a , q , v , n , h , g , w , e g q n , q , k , n , c , n , h , g , w g , c , w , d , g g , t , m , a , v , e f - . g , d , g g , v n g , g t , a , .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / t , a , t , k , v , n , h , e t , k - . t , a , t , k , v , n h , e v n , h t , a t , k v , e - . n , q , k , c , w , g , g , t , a , q , n , h g w w q n , q , k , n , h , g , w c , w , g t , a - . q , k , n , c , g , c , c , d , v , n , h , w g , n , w q , k , n , c , n , h , w c - . q , k , n , c , g , c , w , d , g , g , t , n , h , g q , c , w q , k , n , c , n , h , g g , c , w , d , g g , t - . c , g , c , w c , d g , t , m , a , v , n , h , g c , n , h , g g , c , w , c , d g , t , m , a , v - . g , c , w , f , c , d , g , g , t , m , a , t , k , v , h , g , w , e g , w h , g , w g , w , c g , t , m , a , t , k , v , e f .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / - . q , g , c , w g , t m , a , a , d , q , v , n , h , g , w q , c , g q q , a , d , n , h , g , w g , c , w g , t , m , a , v - . q , k , n , c , g , c w , g , t , m , a , a , d , q , v , n , h , g w , w q q , w q , k , n , c , a , d , n , h , g , w g , w g , t , m , a , v - . n , q , k , c , w , g , g , t , a , q v , n , h , g w , w g q , w n , q , k , n , h , g , w c , w , g g , t , m ,a v - . c , w g , g , t , m , a , a d , q n , h g ,w , w q a , d , n , h , w c , g g , t , a - . c , w g , t m , a v , n , h , g w a , g w g , t , m , a , v - . q , c , g , s , c , w , c , d , g , g , t , a , a , d , q q , w q , c , a , d , n , h , g , w g , s , c , w , c , d , g g , t , a , v .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / table : the contact residues around k in cryptopain- are color-coded as per subsites. the residues around the p ’ sidegroup of k (s ’ subsite) are in orange. the s site is in green, s in pink and s in red. the residues that made favorable contacts with k are shown in bold in the subsequent columns. the residues around the ligand subgroups of the best-scored vinyl sulfones compounds (pubchem ids in ligands column) are listed. the favorable interactions (including additional contact residues, which does not appear for k ) are shown in bold and colored as per subsites. the additional s ’ subsite is shown in mauve. the scores and the h-bonding residues for the individual complexes are also listed. q , v , n , h , g , w - . n , q , k , c , g , s , c , w , a , f , c , d , g , a , q , n , h , w w q , g , h , w q , w n , q , k , c , a , n , h , w g , s , c ,w , a ,f , c , d , g - . q , k , n , c , g , c , w , g , t , m , a , a , d , q v , n , h , g w q q q , k , n , c , a , d , h , g , w c , w g , t , m , a , v - . n , q , k , n , c , g , c , c , d , g q , n , h , w g , n , w q n , q , k , n , c , n , h , w g , c , c , d , g .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / q k g c w g t v n h g w k ( ) - . . - . - . - . - . - . - . - . - . . - . - . . - . - . - . - . - . - . - . . - . - . . - . - . . - . . - . - . - . - . . . . - . . - . . - . . . . . . - . - . - . - . - . - . - . . . - . - . - . - . . . - . - . . - . - . . - . . - . . . - . - . . - . . - . - . . - . . - . - . - . - . - . . - . - . . - . . . - . - . - . - . . - . - . - . - . . - . - . - . - . - . . - . . . . . . . . . . . . . - . . - . - . . . - . . . - . - . . - . . - . . - . - . - . - . - . . - . . - . - . - . - . . . - . - . . . . - . . - . . . - . - . - . - . - . - . - . . . . - . - . - . - . . - . . . . . - . . . . . . . . - . . - . . . - . - . . - . . - . - . - . - . - . . - . - . - . . - . . - . . . . . . . . . . . . . - . . - . . - . - . - . - . - . . - . - . . - . . . - . - . - . - . - . - . . . - . - . - . - . - . - . - . . . . - . . - . . - . - . - . - . - . . - . .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / - . . . - . - . - . - . . . - . - . - . . - . . . - . - . - . - . . - . - . . - . . - . - . - . - . - . . - . . . - . - . . . . - . - . - . . . - . . - . . . - . - . - . - . . - . - . - . - . . - . . . - . . - . - . . - . . - . . - . - . - . - . - . - . - . . - . - . . . . - . . . . . - . table : the ddgbind values for the interaction of k and the best-scored ligands with the important residues of cryptopain- are tabulated. the residues that had showed high number of favorable interactions (supplementary table ) were taken into consideration for the second round of calculations to chart this table. the values for the most favorable interactions are shown in purple, moderately favorable interactions in brown, slightly unfavorable in aquamarine and unfavorable in blue. the scale for demarcation varies for each residue, depending on the range and type of its interactions. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / s ' p figure : illustration of the typical binding of vinyl sulfone inhibitors to cysteine protease enzymes. colored spheres represent the different subsites of the enzyme, and the ligand sidechain/subgroups of the vinyl sulfone inhibitor are in violet rectangles. spatial distribution of the subsites in three-dimensional protease structures differs from the linear arrangement that has been shown here for simplicity. the backbones of the enzyme and inhibitor are not shown. the site of covalent bond formation at c has been marked in red. the positioning/denotation of the ligand subgroups within the different subsites of the enzyme is according to their placement near the vinyl warhead – depicting what has been observed so far in the solved structures of peptidyl vinyl sulfone-bound cysteine proteases. the ligand sidegroup nearest the beta carbon of vinyl is p that fits into s . the following ligand subgroups are p , p etc. the groups beyond the sulfonyl are p ’, p ’ etc. which interact with the prime side subsites of the enzyme. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sulfone p p ' p p figure : k or k- (pubchem id: ) docked into the three-dimensional (homology) model of cryptopain- . the selected conformation (score: - . ) shown here conforms to the arrangement of the ligand subgoups (p ’, p , p , p ) in the different enzyme subsites as depicted in figure , and so does the color code that demarcates the subsites. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure : all the residues that are contacted by one or more ligands in the docked complexes of k and the best-scored (score <= - . ) vinyl sulfones are labeled and shown in spacefill representation (colored as per hydrophobicity) in the three dimensional structure (homology model) of cryptopain- . the enzymatic triad residue c - the site of covalent attachment - is in yellow. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / a .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / b .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / c .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / d figure : panels a, b, c, d show the orientation and placement of the best-scored (score <= - . ) compounds docked into the cryptopain- theoretical structure. the ligands are shown with respect to the enzyme subsites that have been derived from the k -cryptopain- reference complex. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / (k ) figure : the chemical structures (along with the pubchem identifiers) of the reference ligand k or k- , and the two model compounds - which showed optimum interactions with the enzymatic cleft of cryptopain- and thereby could aid the design of effective inhibitors to target the protease. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / hla-spread: a comprehensive resource for hla associated diseases, drug reactions and snps across populations hla-spread: a comprehensive resource for hla associated diseases, drug reactions and snps across populations dhwani dholakia , *#, ankit kalra #, uma kanga , mitali mukerji , * . institute of genomics and integrative biology-council of scientific and industrial research, new delhi- , india. . academy of scientific and innovative research, ghaziabad- , india. . netaji subhas university of technology, new delhi- , india. . all india institute of medical sciences, new delhi- , india. * correspondence: mitali mukerji; email: mitali@igib.res.in dhwani dholakia; email: dhwani.dholakia@igib.in #equal contribution keywords: hla associations, natural language processing, adverse drug reactions, hla biomarker, transplantation, hla alleles (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract extreme complexity in the hla system and its nomenclature makes it difficult to interpret and integrate relevant information for hla associations with diseases, adverse drug reactions (adr), transplantation. pubmed search displays ~ , studies on human leukocyte antigens (hla) reported from, diverse locations and on multiple populations and ipd-imgt/hla database houses data on , hla alleles till date. we developed an automated pipeline with a unified graphical user interface hla-spread that provides a structured information on snps, populations, resources, adrs and diseases information. information on hla was extracted from ~ million pubmed abstracts extracted using natural language processing (nlp). python scripts were used to mine and curate information on diseases, filter false positives and categorize to tree hierarchical groups and named entity recognition (ner) algorithms and semantic analysis to infer hla association(s). this resource from countries and ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and adrs for hypersensitivity. summary information on clinically relevant biomarkers related to hla disease associations with mapped susceptible/risk alleles are readily retrievable from hlaspread. this resource is first of its kind that can help uncover novel patterns in hla gene-disease associations. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction human leukocyte antigen (hla) locus consists of six classical genes (hla-a, -b, -c, -dp, -dq and - dr) that play an important role in eliciting immune response against pathogens ( ) and three non- classical genes (hla-e, -f and -g) that interact with natural killer cells to regulate virus-infected and malignant cells ( ). hla genes harbour a large number of mutations. as of september , there are , hla alleles reported in ipd-imgt/hla database. these variations mostly arise to generate defensive mechanisms against pathogens. however, some variations also confer risk to autoimmune diseases like rheumatoid arthritis, multiple sclerosis, type diabetes and graves’ disease etc. more than different autoimmune diseases, infectious diseases and adverse drug reactions have been reported to be associated with hla genes ( – ). these alleles have clinical utility as diagnostic markers for example in rheumatoid arthritis, ankylosing spondylitis ( – ). they are also used in genetic screening e.g. hla-b* : in caucasian population for abacavir hypersensitivity, hla-b* : in chinese and asians for carbamazepine induced life-threatening conditions like stevens-johnson syndrome (sjs) and toxic epidermal necrolysis (ten) and also for sjs due to carbamazepine and other drug combinations ( , ). in the context of transplantation, mismatch of hla alleles between donor and recipient impacts the solid organ and hematopoietic stem cell transplantation outcomes ( ). in addition, mismatching for certain hla loci are also reported to provide benefit in terms of graft versus leukemia effect ( ). each of the reported studies is unique in itself as they describe the molecular basis of disease associations, hla matching and anti-hla antibody formation that are relevant for transplantation. besides, studies also report some relevant and associated clinical information, e.g different hla-b subtypes are reported to be associated with clinical categories under spondyloarthropathies ( ). there are other studies that implicate hla allele association with the composition of gut microbiome and diseases ( – ). the expanse of this information is immense as there is wide genetic variability and heterogeneity among populations ( ). although advancements in hla typing technologies has been beneficial in identifying novel hla sequences ( ), this has also led to reporting the same hla allelic variant using different hla nomenclature. with the rapid increase in biomedical data, hla alleles and their associations in multiple diseases, it becomes imperative to create a platform with structured information to query and retrieve relevant information. current knowledge about hla limits to individual papers that can be searched through pubmed or reviews where a subset of studies has been summarised. hitherto, there exists no database that complies the existing hla related information in an organised framework. in absence of such a repository with meta information gaps, resource sharing among researchers and clinicians becomes a big challenge. the integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. natural language processing (nlp) is a method to extract relevant information from unstructured data ( ). a simple nlp pipeline contains components: data assembly, pre-processing and normalization, named entity recognition (ner) and relation extraction (re). the output of nlp algorithms, i.e. structured dataset can be used to generate insights via direct interpretation or through downstream analyses. in recent times, nlp methods have started (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gaining popularity in biological sciences. for instance, rakhi et.al ( ) reported a text mining pipeline to study spice-disease associations and link phytochemicals from different spices/herbs to diseases. another report by lee et.al highlights biobert, a pre-trained biomedical language representation model that can be used for various text mining tasks like name entity recognition (ner), relationship extraction (re) and question answering, specifically on biomedical datasets. similarly, pubtator central ( ) is an open access tool available via ncbi that uses text mining algorithms for assisted bio- curation of entities in literature. the tool uses ner to identify and thus highlight six bio-entities viz. gene, disease, chemical, mutation, cell line and species from abstracts and open access articles available on pubmed. another interesting report by kuleshov et.al( ) presents a machine compiled database for studying genotype-phenotype associations generated using applications of text mining on genome-wide association studies (gwas). all these resources work on similar text mining algorithms, but each has a different set of applications and tasks to perform. the use of these resources as such in addressing the hla research often overlooks the extent of variability of hla complex and involved parameters in this domain. for instance, pubtator central is able to mine gene names from literature, but would not pick hla allele information e.g. hla-drb * : when hla-drb is the search query. conventional processes to individually mine a large amount of unstructured literature available on hla research requires both manpower and resources. for understanding and integrating the observations from hla studies we require knowledge of genomic datasets, i.e. diseases, snps, drugs, populations, and ethnic groups along with an understanding of the relationship between them. nlp based text mining is an ideal approach to understand the complexity of this process to create a structured information. we provide hla-spread (figure ) as a platform for integrated hla resources that has been developed using nlp to understand the complexity of this locus. the resource provides a platform to summarize hla related genomics knowledge as well as to design and develop new hypothesis. in this study, we have used publicly available ~ million peer reviewed abstracts. we extracted biomedical entities including hla alleles, diseases, snps, drugs and geographical locations. we also tried assigning positive and negative relationships between disease and alleles. this hla connectivity was then used to address biologically and clinically relevant objectives like hla-biomarkers and risk and protective alleles for various diseases. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . material and methods data retrieval medline was used as a source of biomedical literature that comprises more than million peer- reviewed articles from over scholar journals. bulk data was downloaded from the ftp server in xml format. hla alleles with nomenclature were downloaded from ipd-imgt/hla database( ). to maintain uniformity in disease names and their ids, we used mesh keywords from umls (unified medical language system). drugs associated with side effects were obtained from sider . and allele frequency net database (afnd) ( , ). allele frequency of hla alleles were also taken from afnd. extensive pre-processing was done on all the datasets before they were implemented in the pipeline. pre-processing and keywords dictionary pubmed parsing: a modified version of pubmed parser was used to extract pmid, title, abstract, publication date, journal, article type and authors’ information from medline biomedical literature dataset ( ). only records with the above information were considered for further analysis and stored in a tabular format. all the subheadings in the abstract viz background, introduction, objective, method, experimental design, result, discussion, importance, setting, design, study objective, patients, participants and conclusion were removed. disease dictionary: mentions of disease keywords were identified using a dictionary created from umls mrconso.rrf ( ). umls is a set of biomedical vocabulary that includes data from omim, gene ontology, clinical repositories, medical subject headings (mesh) and ncbi taxonomy. in this study, we used mesh descriptors including entry term (et), main heading (mh), preferred entry term (pep), descriptor sort version (dsv), machine permutation (pm). descriptor entry version (dev) was excluded as keywords belonging to this category were incomplete, e.g. abdominal injury was reported as abdominal inj. these descriptors are assigned a unique mesh id which is stored in a hierarchical format with head categories along with a unique descriptor id. we termed the root form of the disease as level-zero and top-level diseases as level-one for our analysis. multiple forms of a disease like diabetes insipidus, diabetes mellitus, type diabetes, juvenile-onset diabetes and others are assigned the same mesh id. this dataset was also supplemented with keyword variants such as plural and lemmatised forms to increase the search space. hla dictionary: keywords for hla alleles and their nomenclature were fetched from the centralized repository of international immunogenetics project (imgt) database. imgt is updated quarterly with submission or deletion of alleles and their nomenclature and currently houses , alleles. many reports do not follow the conventional hla allele nomenclature which makes mapping a strenuous task. to maximally capture all hla alleles, we created a dataset comprising of all possible keywords including the removal of special characters, whenever required. we have also attempted (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . mapping all the old nomenclature to the current allele names. this dictionary also includes few generic hla keywords like hla class i, hla class ii, hla linked and hla associated. there are few alleles based on old nomenclature that belong to more than one antigenic group, hence they were put under “broad antigen” category. a few haplotypes that were a combination of more than one hla allele were grouped in “haplotype” category. named entity recognition keyword matching across abstracts a python-based ner pipeline was implemented to filter abstracts based on a dictionary matching approach using parallel multiprocessing. disease and hla allele keyword dictionaries were used for initial screening. abstracts were converted to lower case with special characters removed and if a match was found in either title or text, the abstract was sentence tokenized using sentence tokenizer, a part of python natural language tool kit (nltk). we encountered a great extent of variability in the names of disease keywords. most of it had special characters like (-) and (‘) in the keyword or with the plural and singular forms. to deal with the former, we kept instances of sentences where special characters were not removed, this increased the search space that enables capturing of keywords such as stevens-johnson syndrome (stevens-johnson syndrome), graves' disease (graves disease). our disease dictionary was already enriched with plural and lemmatized forms of keywords to tackle the latter. for hla allele keywords, word boundary-based regex matching was implemented to search alleles in the sentences. sentences with at least a single mention of both hla allele and disease keywords were considered for further steps. identification of tags: populations, drugs and snps populations: the filtered abstracts were processed using spacy nlp tagging algorithm (model: en_core_web_md) to search for mention of populations in text. from the two output tags, i.e. gpe (geo-political entities) and norp (nationalities or religious groups), we selected the keywords having the latter as gpe tag often reported scientific names of organisms as populations when applied on biomedical data, e.g. scientific names such as chlamydia spp. and chlamydomonas spp. were reported under gpe tags. the output was classified into countries and ethnic groups for further analysis with the help of an expert anthropologist. manual curation of the obtained list was also done to remove plural and inappropriate entries. drugs: the information on drugs with side effects were taken from the sider database (sider . ). we also added drugs from afnd, whose information was missing in sider. the list of drugs was mapped across the dataset to check for its occurrences in selected hla related abstracts. there were many instances where drug names were subpart of disease keywords, e.g. “insulin” was obtained as a false match wherever it was present as a part of the disease name “insulin dependent diabetes mellitus”. a small python snippet was written to remove such false positives. snps: snp ids were mapped across abstracts of the hla dataset using the regex module of python. the algorithm iteratively searched for all instances of rsids using regular expression “[rr][ss][ - ]{ ,}”. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . all the tags captured in various sentences of abstracts were stored in a list of strings format along with their respective pmids for facilitated future access. semantic assessment n-gram evaluation and manual labelling n-grams refers to a contiguous sequence of n items (can be syllables, letters, or word pairs) in a text for determining the context of said items in a sentence or paragraph. we used the functions of nltk viz. wordnetlemmatizer, wordpuncttokenizer and collocationfinder to create a corpus of ngrams (n= , and ) from the abstract dataset. after removal of stop words, that do not add significant meaning to the context, a subset consisting of all reported verb/adverb(n= ), adverb-verb(n= , ) combinations based on a frequency cut-off was filtered out using part of speech (pos) tags of tokenised words. we observed that n-grams for negative labels often gave misleading information, e.g. “hla-b negative” refers to the absence of allele rather than a negative association between entities. hence, we used very stringent criteria for choosing negative labels. manual annotation of positive and negative labels was then carried out on this dataset and a total of labels (supplementary table ) were categorised ( positive and negative) for labelling the sentences. we assert a positive label where the hla allele is positively associated with disease and hence its presence makes individuals susceptible to disease, whereas in negative statements the hla allele is negatively associated with disease and hence protective for the disease. we also considered negation words like “not, none, no” which if present, can reverse the actual meaning of the sentences. instances of above mentioned three keyword sets (positive, negative and negation) were iteratively searched in all the sentences. further, a coding scheme was constructed using the binary layout to label sentences as positive, negative, complex ambiguous. sentences having no match from either of the categories were labelled as others. root-verb and associated adverbs using dependency parsing dependency parsing refers to the formation of a tree layout based on the semantics of a sentence, where the root node is represented by a verb that relates different entities of that sentence. the allele and disease keywords present in each sentence were replaced with @gene and @disease tags and a parse tree was generated using stanfordcorenlp python module (stanford-corenlp-full- - - package). the list of verbs obtained from the root nodes of all the sentences in the dataset was manually curated under positive and negative labels. we also added a category “studied/investigatory” that doesn’t convey any positive or negative context but have mentions of both entities together, e.g. “to investigate the association of hla-a, b, and drb alleles with leukaemia in the han population in hunan province”. sentence annotation we termed our approach as “hybrid approach” for labelling sentences, where annotation was done using both n-gram labels and the type of root verbs. if a sentence had a positive n-gram label and a positive root verb, that inferred the relationship between entities as associated or linked, then the sentence was labelled as positive. for negative labelling also we used the same approach. finally, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . labelling of sentences were grouped into different categories: ) positive, ) negative, ) both positive and negative, referring as complex sentences, ) positive+negation referring as ambiguous group, and ) investigatory. database and web server hla spread database is built for quick and easy retrieval of information related to hla genes. the web interface was coded in html , css , bootstrap & es . we used d .js for data visualization and jquery datatables for table integration. the server was hosted using apache http server. the database uses flat file system with data stored in excel file. javascript handles the search queries & data visualizations. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results mining medline literature for hla association nlp based text mining of million publicly available biomedical abstracts provided abstracts with either one or more sentences that describe the relationship between the hla alleles and diseases. to understand the distribution of various kinds of articles published among the filtered abstracts, we studied the article type per year trend from to (figure ). we found research journal, comparative study and review articles to have maximum numbers every year. in addition, there were papers corresponding to clinical trials phase i, ii, iii and iv and observational studies highlighting the importance of this locus in translational studies. hla genes, alleles and its distribution there are , alleles, and we hypothesize that not all of them would be associated with a disease or pathological condition. for instance, while collating data/analysing of hla alleles, we observed a great extent of variability in the names within articles. e.g. hla-b* : , a risk factor for dapsone hypersensitivity syndrome in multiple populations was written as hla-b* : , hla-b* , b* , b(*) and b in different papers. in such instances, if one has to search for an allele and its related information, the user must be aware of all possible formats of writing an allele encompassing its current and previous nomenclature. so, based on this, we converted all existing hla keywords to a standard allele name. we identified only ~ % of the total alleles to be associated with conditions like diseases, graft survival, or drug reactions. to represent these alleles in the form of a graph, we collapsed the nomenclature to two-digit level (figure ). majority of the studies were with hla-drb loci, followed by hla-b and hla-a, while fewer studies were on hla-c locus. each hla alleles, collapsed to its two-digit information are linked to afnd server highlighting its allele frequency. the focus of our present study was also to understand the semantics between alleles and diseases, wherein we noted that some alleles were reported as protective and some as risk alleles. e.g. some reports indicated hla-drb * was protective for hiv and diabetes whereas some studies reported it as a risk allele for pulmonary tuberculosis. we were also interested in exploring the effects of multiple alleles individually on a single disease. to address this, we listed out articles (supplementary table )highlighting the fact that for a single disease, different alleles can have contrasting effects, e.g. hla- dqa * : and hla-dqb * : can be protective in artemisia pollen-induced allergic rhinitis while hla-dqa * : can be a risk factor ( ). exploring diseases, its associated categories and other relevant information (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the hla studies were divided into four broad categories: diseases, transplantations, sign and symptoms, and therapeutics/adrs, to study the information systematically. this grouping was done based on the mesh keywords identified in the abstracts. there is a total of categories for diseases in mesh, ranging from c to c and transplantation procedures are listed under e . keywords falling under c were grouped as “sign and symptoms” and c . (gvhd) and e were grouped as “transplantations”. for “therapeutics/adrs”, we selected only those sentences that had mentions of drug keywords, allele name and disease names together. we then filtered them further if they satisfied either of the three conditions: ) belongs to category drug adverse reactions category or ) sentences had mentions of keywords such as reactions, -induced(carbamazepine-induced) or ) disease keyword had mention of –induced (drug-induced liver injury). the remaining were grouped as “diseases”. table shows the number of articles under each category. to study the association with diseases, we analysed data from both the “diseases” and “transplantation” category. inconsistency in writing disease names increases the efforts in searching a specific query. to reduce this variability, mesh id was used to summarise the obtained information e.g. diseases like tumour, cancer, malignancy, and neoplasm (malignant and benign) were mapped to a single entity malignancy (d ). collapsing a large number of similar keywords to a single id reduces the complexity in searching for articles related to particular diseases. we observed a total of different disease terms mapping to unique mesh ids. figure represents a snapshot of common hla associated diseases. to examine the disease associations, we mapped it to level-one (level-zero) terms. diabetes mellitus type , rheumatoid arthritis, multiple sclerosis (autoimmune disease), melanoma and leukemic (neoplasms by histologic type), psoriasis (skin disease) and celiac disease (metabolic) were the topmost hla associated diseases. in the analysed abstracts, the list of hla associated diseases/conditions indicates that some diseases were very frequently reported, whereas other diseases like down syndrome, guillain-barre syndrome, polymyalgia rheumatica were infrequently or rarely reported. supplementary table represent the distribution of both common and less explored hla associated diseases. to get an overall perspective of genes and diseases, we considered the diseases at level-one along with hla gene. we observed the majority of reported associations with hla-drb , followed by hla- b and hla-a (figure ). we also listed details of individual allele-disease pairs for more information (supplementary table ). hla-drb was reported to be linked with disease conditions like rheumatoid arthritis, type diabetes, multiple sclerosis, melanoma and other diseases. hla-b association was reported with spondylitis, infections, hypersensitivities, psoriasis, drug allergies and other diseases and hla-a was reported to be associated with melanoma, leukemia, influenza, haemochromatosis, and other diseases. the analysis also takes into consideration the diseases which require transplantation and also include the complications associated with it both pre and post-transplantation. as anticipated, we observed that individuals suffering from beta thalassemia and sickle cell anaemia (genetic and congenital disorders), multiple myeloma (an immunoproliferative disorder) and liver injury underwent transplantations of bone marrow, hematopoietic stem cells and renal tissue. however, there were other additional details (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . included with the transplantation data such as disease history of patients before undergoing transplantation e.g. psoriasis, graves’ disease, diabetic neuropathy and post-transplantation complications e.g. ischemia, necrosis, fibrosis, haemorrhage.” such collated information under one platform may be of interest to a clinician for designing therapy modules. supplementary table represents details of transplantation related studies. snps and hla diseases hla loci have a repertoire of genetic variations, a large number of which have been linked to multiple diseases via genome-wide association studies (gwas). though gwas lists information about snps in/associated with hla gene, a number of genetic variation studies go unnoticed either because they are small cohort analysis or are not compiled in a single resource for systematic study. thus, to include the overlooked studies and missing information, this analysis reports information from all kinds of studies and includes abstracts mainly from journal articles, review, metanalysis, letters, and clinical trials. to acquire robust data, we retained only those hla variations, that are present in the sentences along with the disease and allele keywords. we identified unique snps mention and its details is compiled in supplementary table . majority of snps mapped to intronic variants followed by missense and intergenic. figure represents genomic distribution of mapped snps. a substantial number of variations also mapped to genes other than hla, indicating they may be in linkage disequilibrium (ld) or frequently occur in conditions like transplantation success or adrs example. we observed top hits of snps mapping to infectious diseases like hiv and hepatitis, inflammatory conditions like psoriasis, complex diseases like asthma and diabetes and hypersensitivity largely attributed by drug adrs. snp association studies are also based on a proxy snp, which can be in ld with the causal variant and the ld values vary from one population to another. to address this, we also added population information of the studies whenever available in the abstract. the most studied snp rs , associated with hepatitis b virus, has been studied across a large number of populations from asian and central asian countries like china, japan, asia, turkey, korea, and indonesia. geographical spread of hla literature across various ethnic groups and populations genetic differences in hla genes across populations and their link with biological conditions make it imperative to consider geographical information while studying hla association with a particular condition. we assumed that the population/ethnic groups name might not be present in the same sentences that mention hla and disease, so we used a flexible approach here and fetched the names of geographical locations present anywhere in the abstracts. in total, we reported norp tags, mapping to unique geographical entities. these unique tags were binned into country-based populations and ethnic groups. figure represents the frequency distribution of these matched populations belonging to the countries and ethnic groups. japan, china, usa, india and italy are the major countries where the hla gene-disease association studies have been reported with disease (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . groups as shown in supplementary table . along with this, the european subcontinent has been extensively studied ( unique reports) as a major ethnic group. apart from frequently studied areas, we also observed locations like new zealand, armenia and sri lanka that have a low number of reported studies. this type of analysis can help researchers understand not only the extent of allele- disease associations among populations in the context of these immune players but also the scope of research in their selected geographical location while planning their hypothesis. response to therapeutics hla genes are known to have association with various hypersensitivities and drug reactions, a few of them like stevens-johnson syndrome can also be life-threatening. due to allele differences among individual and population level, these hypersensitivities vary, and thus studying these pharmacogenetic markers with the population information becomes important. for instance, we observed from our data that hla-a* : is associated with carbamazepine induced stevens-johnson syndrome in european population while hla-b* : is associated with chinese and indian populations. a meta resource like hla-spread can help understand such population-wise differences that obstruct designing of therapy modules for adrs/ hypersensitivities. to be more specific, this analysis focuses on drugs that are present in sentences along with the disease and allele keywords. we observed a total of abstracts mentioning unique drugs, of which mapped to adr category. details of drugs and related information are listed in supplementary table . we also validated our results with afnd, a manually curated database that has information about adrs. out of drugs present, we were able to find common. one of the drugs “valporic acid”, mentioned in afnd, was not present in the actual cited article. the remaining drugs could not be captured because of the stringent criteria of drug mapping i.e. the drug name should be present in the sentence along with disease and allele keyword. figure lists the frequency-based distribution of top drugs fetched from our analysis. interestingly, we also observed drugs that are not mentioned in afnd database, e.g. hla-b* : : allele was found to predict carbimazole/methimazole induced agranulocytosis, hla-drb associated azathioprine induced pancreatitis in ibd patients. this analysis highlights, how one can miss information apart from the time and manpower intensive nature in manual curation. insights from hla-spread: biomarker analysis we demonstrate the usability of the database to address clinically relevant queries. multiple questions on the identification of hla alleles and diseases linked with hypersensitivity, allergy, genetic marker, prognosis and diagnosis can be addressed using hla-spread. as an example, we present an analysis to identify biomarkers in hla studies. to address this question, we used an n-gram based approach to identify the keyword most frequently occurring with “marker” in the sentences. supplementary table list the most common keywords identified. we checked the details of such sentences and complied the information (supplementary table ). a few of them like abacavir (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . hypersensitivity and sjs syndrome were present in multiple papers. hla-g and hla-e were also reported to be markers for conditions like tumour, transplantation and heart diseases. discussion hla alleles are known to be associated with a large number of diseases. there is no existing repository that summarises this information in a systemic manner. manual curation is a cumbersome process and one might also miss a lot of important information. the need for such a user-friendly platform increases significantly since hla alleles have been found clinically associated with a large number of conditions. nlp based text mining offers a way to fetch this information pragmatically. nlp is instrumental in terms of extracting information from unstructured data. this method has started assuming immense importance in the biomedical domain. a few papers like gwaskb and snp literature have used it for extracting information such as snp and its related knowledge from the biomedical data whereas monarch initiative has used it for studying phenotype information ( ). extracting information from hla related literature is very difficult owing to the large number of studies and complex nomenclature. this project is an attempt to consolidate all the hla relevant information such as snps, populations studied, adrs and associated diseases into a structured database. this resource is also handy for user-specific advanced hla searches like looking for biomarkers for toxicity-based studies and disease progression. there were a few drawbacks of this analysis worth highlighting – primary arising due to the different formats of various journals. the initial tokenised data used in the analysis was based on english stop words. however, we observed in a small set of papers, the author missed giving full stops or spaces which lead to the fusion of two sentences. the subheadings were present in different cases and often followed by different special characters leading to complexity in their removal. also, a prefix of keywords like settings, study design, etc. have been observed in a few sentences, as those papers did not follow standard headlines. apart from these, few other parameters like abbreviations at the end of sentences, presence of roman letters in sentences and different brackets and quotes styles in title caused errors during tokenisation process. similarly, it was observed that with the updation of various abstracts in new releases, the previous incorrect entries were not removed which lead to duplication of different information. since hlaspread has catalogued information from diverse resources, in many instances it provides pieces of information that would be more informative and exhaustive. for instance, besides information retrieved from databases like disgenet, omim (mendelian) reporting information on a few diseases we also used mesh is more comprehensive as it houses variant disease terms mapping to diseases. we also reduced the high variability in the method of mentioning the disease name in various articles. on average, a disease has around names with one id, showing the wide spectrum of disease dictionary required to capture all possible disease terms. in order to capture the hla and adrs we selected a list of drugs from sider . . however, not all drugs present in side effect database will be associated with ards. to get a more specific answer, we selected drugs from categories such as adverse drug reactions, hypersensitivity and toxicity. we were able to fetch a large number of studies (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and observed that the afnd database has missed quite some drugs in the adr analysis. we thus added information from both afnd and sider to get heuristic information for a set of different drugs. there were a few unique aspects that we could capture because of our approach. for instance, in transplantation studies in addition to just listing different kinds of transplantations, we also observed the most common diseases which required transplantation and drugs given during the process with few side effects. also, a unique aspect we added was a category called signs and symptoms for simplifying user searches. for instance, some users may also be interested in knowing the context of hla alleles with conditions like inflammation, relapse, hypoxia, septic shock, diarrhoea, etc. we aim to add a few features in future updates for example mapping the variants reported in dbsnp, omim, clinvar with to the hla alleles. this would help in seamless integration of high-throughput variation data with the wealth of hla information in literature and hla alleles reported in imgt database. to summarise this is one of its kind of efforts to integrate the diversity of hla information into a structured format for ease of query and analysis. this could also provide an informative resource for the non-hla specialists for initiating any new studies in populations and diseases. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgements the authors would acknowledge coe m/o ayush grant mlp- to mm and dd and srf fellowship to dd from department of biotechnology (dbt) and dr. yatender kumar (nsit) for permitting ak to work on this project. we would also acknowledge mr praveen sinha for designing and developing the webpage of hla spread, dr. debasis dash, csir-igib for critical reviewing of work, dr. ganesh bagler and rudransh tunwani from iiitd for nlp discussion, dr. ganganath jha from hazaribagh university in qc of population curation and malika seth in qc of semantic annotations. the authors would also like to acknowledge mr. raghunandanan mv and mr. amit khulve at csir-igib for it support. authors contributions mm, dd designed the study and co-wrote the manuscript. dd and ak executed the entire work. uk helped in hla analysis, interpretation and manuscript writing (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . list of figures figure . workflow of hla-spread: an automated pipeline developed to extract information related from ~ , studies related to hla retrieved from over million abstracts. structured information from these abstracts was created using natural language processing methods developed into a database hla-spread. the various resources used at each step are indicated. figure . nature and trends of hla related publications in pubmed annually from onwards: stacked bar plot shows distribution of pubmed articles in different categories. a) diverse studies including clinical trials are reported, with maximum numbers represented in the “journal article” category. b) a subplot of (a) after removing the most frequent “journal article” type to visualise the trends in other categories. a b (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . the topmost reported hla alleles associated with diseases: all the hla alleles indicated have been grouped to their second digit and represented in the pie chart. hla-a, hla-b and hla- drb are the most studied amongst the hla genes. figure . diseases/conditions associated with hla genes: graph represents three level hierarchy of diseases. each colour represents a level. there are major categories as represented in green colour, which is further divided into subcategories. each disease name is matched to its mesh id and a normalised mesh keyword. autoimmune, neoplasms and joint disease are the top most associated diseases. as anticipated, significant numbers of studies related to transplantation are also observed. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . heatmap of hla disease associations: the gradient heat map representing the number of diseases associated with hla genes. first column represents generic “hla” studies where specific gene information is not mentioned. a large number of associations were also observed with non- classical(hla-e,f,g) genes. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic distribution of snps: pie chart representing the number of variations in genic region with majority of them mapping to introns. figure . geographical spread of hla studies: identified geographical locations are binned to the nearest a) country b) ethnic group. color gradient representing the count of various hla alleles with respect to disease or ard’s studies. china, japan and the usa report maximum studies and european, asian and african are the most studied ethnic groups a b count (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . statistics of drugs related hla studies: this bar plot includes the most common top drugs associated with adr’s identified using hla-spread. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . list of tables table : number of articles in broad categories supplementary tables:- https://doi.org/ . /zenodo. categories number of pubmed abstracts diseases transplantation signs and symptoms adr (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references: . mosaad,y.m. ( ) clinical role of human leukocyte antigen in health and disease. scand j immunol, , – . . niehrs,a. and altfeld,m. ( ) regulation of nk-cell function by hla class ii. front. cell. infect. microbiol., , . . shiina,t., hosomichi,k., inoko,h. and kulski,j.k. ( ) the hla genomic loci map: expression, interaction, diversity and disease. j hum genet, , – . . blackwell,j.m., jamieson,s.e. and burgner,d. ( ) hla and infectious diseases. cmr, , – . . fricke-galindo,i., llerena,a. and lópez-lópez,m. ( ) an update on hla alleles associated with adverse drug reactions. drug metabolism and personalized therapy, . . klimenta,b., nefic,h., prodanovic,n., jadric,r. and hukic,f. ( ) association of biomarkers of inflammation and hla-drb gene locus with risk of developing rheumatoid arthritis in females. rheumatol int, , – . . khan,m.a., mathieu,a., sorrentino,r. and akkoc,n. ( ) the pathogenetic role of hla-b and its subtypes. autoimmunity reviews, , – . . khan,m.a. ( ) hla-b and its pathogenic role: jcr: journal of clinical rheumatology, , – . . ferrell,p.b. and mcleod,h.l. ( ) carbamazepine, hla-b* and risk of stevens–johnson syndrome and toxic epidermal necrolysis: us fda recommendations. pharmacogenomics, , – . . sawal,n., kanga,u., shukla,g., goyal,v. and srivastava,a.k. ( ) stevens-johnson syndrome triggered by levetiracetam—caution for use with carbamazepine. seizure, , – . . ayuk,f., beelen,d.w., bornhäuser,m., stelljes,m., zabelina,t., finke,j., kobbe,g., wolff,d., wagner,e.-m., christopeit,m., et al. ( ) relative impact of hla matching and non-hla donor characteristics on outcomes of allogeneic stem cell transplantation for acute myeloid leukemia and myelodysplastic syndrome. biology of blood and marrow transplantation, , – . . petersdorf,e.w. ( ) which factors influence the development of gvhd in hla-matched or mismatched transplants? best practice & research clinical haematology, , – . . kanga,u., mehra,n.k., larrea,c.l., lardy,n.m., kumar,a. and feltkamp,t.e.w. ( ) seronegative spondyloarthropathies and hla-b subtypes: a study in asian indians. clin rheumatol, , – . . xu,h. and yin,j. ( ) hla risk alleles and gut microbiome in ankylosing spondylitis and rheumatoid arthritis. best practice & research clinical rheumatology, , . . andeweg,s.p., keşmir,c. and dutilh,b.e. ( ) quantifying the impact of human leukocyte antigen on the human gut microbiome bioinformatics. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . gomez,a., luckey,d., yeoman,c.j., marietta,e.v., berg miller,m.e., murray,j.a., white,b.a. and taneja,v. ( ) loss of sex and age driven differences in the gut microbiome characterize arthritis-susceptible * mice but not arthritis-resistant * mice. plos one, , e . . buhler,s. and sanchez-mazas,a. ( ) hla dna sequence variation among human populations: molecular signatures of demographic and selective events. plos one, , e . . saxena,a., suzuki,s., mourya,m., shiina,t. and kanga,u. ( ) novel and extended hla class i and ii alleles encountered in kashmiri brahmin population from north india. hla, , – . . sfakianaki,p., koumakis,l., sfakianakis,s., iatraki,g., zacharioudakis,g., graf,n., marias,k. and tsiknakis,m. ( ) semantic biomedical resource discovery: a natural language processing framework. bmc med inform decis mak, , . . rakhi,n.k., tuwani,r., mukherjee,j. and bagler,g. ( ) data-driven analysis of biomedical literature suggests broad-spectrum benefits of culinary herbs and spices. plos one, , e . . wei,c.-h., allot,a., leaman,r. and lu,z. ( ) pubtator central: automated concept annotation for biomedical full text articles. nucleic acids research, , w –w . . kuleshov,v., ding,j., vo,c., hancock,b., ratner,a., li,y., ré,c., batzoglou,s. and snyder,m. ( ) a machine-compiled database of genome-wide association studies. nat commun, , . . giudicelli,v., chaume,d., bodmer,j., muller,w., busin,c., marsh,s., bontrop,r., marc,l., malik,a. and lefranc,m.-p. ( ) imgt, the international immunogenetics database. nucleic acids research, , – . . kuhn,m., letunic,i., jensen,l.j. and bork,p. ( ) the sider database of drugs and side effects. nucleic acids res, , d –d . . ghattaoraya,g.s., dundar,y., gonzález-galarza,f.f., maia,m.h.t., santos,e.j.m., da silva,a.l.s., mccabe,a., middleton,d., alfirevic,a., dickson,r., et al. ( ) a web resource for mining hla associations with adverse drug reactions: hla-adr. database, , baw . . achakulvisut,t., acuna,d. and kording,k. ( ) pubmed parser: a python parser for pubmed open-access xml subset and medline xml dataset xml dataset. joss, , . . bodenreider,o. ( ) the unified medical language system (umls): integrating biomedical terminology. nucleic acids research, , d – . . wang,m., xing,z.-m., yu,d.-l., yan,z. and yu,l.-s. ( ) association between hla class ii locus and the susceptibility to artemisia pollen-induced allergic rhinitis in chinese population. otolaryngol head neck surg, , – . . shefchek,k.a., harris,n.l., gargano,m., matentzoglu,n., unni,d., brush,m., keith,d., conlin,t., vasilevsky,n., zhang,x.a., et al. ( ) the monarch initiative in : an integrative data and analytic platform connecting phenotypes to genotypes across species. nucleic acids research, , d –d . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adroit: an accurate and robust method to infer complex transcriptome composition adroit: an accurate and robust method to infer complex transcriptome composition tao yang , nicole alessandri-haber , wen fury , michael schaner , robert breese , michael lacroix-fralish , jinrang kim , christina adler , lynn e. macdonald , gurinder s. atwal , yu bai , * affiliations . regeneron pharmaceuticals, inc., tarrytown ny . cellular longevity, inc., san francisco, ca *corresponding author abstract rna sequencing technology promises an unprecedented opportunity in learning disease mechanisms and discovering new treatment targets. recent spatial transcriptomics methods further enable the transcriptome profiling at spatially resolved spots in a tissue section. in controlled experiments, it is often of immense importance to know the cell composition in different samples. understanding the cell type content in each tissue spot is also crucial to the spatial transcriptome data interpretation. though single cell rna-seq has the power to reveal cell type composition and expression heterogeneity in different cells, it remains costly and sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. to computationally resolve the cell composition in rna-seq data of mixed cells, we present adroit, an accurate and robust method to infer transcriptome composition. the method estimates the proportions of each cell type in the compound rna-seq data using known single cell data of relevant cell types. it uniquely uses an adaptive learning approach to correct the bias gene-wise (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . due to the difference in sequencing techniques. adroit also utilizes cell type specific genes while control their cross-sample variability. our systematic benchmarking, spanning from simple to complex tissues, shows that adroit has superior sensitivity and specificity compared to other existing methods. its performance holds for multiple single cell and compound rna- seq platforms. in addition, adroit is computationally efficient and runs one to two orders of magnitude faster than some of the state-of-the-art methods. introduction rna sequencing is a powerful tool to address the transcriptomic perturbations in disease tissues and help understand the underlying mechanism to develop treatments . due to the presence of heterogeneous cell populations, bulk tissue transcriptome only characterizes the averaged expression of genes over a mixture of different types of cells. the identity of individual cell types and their prevalence remain unelucidated in the bulk data. however, knowledge of the cell type composition and gene expression perturbation at the cell type level is often critical to identifying disease-manifesting cells and designing targeted therapies. for instance, the constitution of stromal and immune cells sculpts the tumor microenvironment that is essential in cancer progression and control – . excessive expression of cytokines in particular leukocyte types underlines the etiology of many chronic inflammatory diseases – . such information cannot be directly read out from the bulk rna-seq. recent breakthroughs in spatial transcriptomics methods enable characterizing whole transcriptome-wise gene expressions at spatially resolved locations in a tissue section . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . however, it remains challenging to reach a single cell resolution while measuring tens of thousands of genes transcriptome-wise. some widely used technologies can achieve a resolution of - μm, equivalent to – cells depending on the tissue type , . the transcripts therein may originate from one or more cell types. unlike the bulk rna-seq, the profiling data at each spot contains substantial dropouts as merely a few cells are sequenced, imposing additional challenges to demystify the cell type content. we refer to bulk rna-seq and spatial transcriptomics data at the multi-cell resolution as compound rna-seq data hereafter. the rapid development of single-cell rna-seq (scrna-seq) technologies has allowed for cell- type specific transcriptome profiling . it provides the information missing from the compound rna-seq data. nevertheless, the technologies have low sensitivity and substantial noise due to the high dropout rate and the cell-to-cell variability. consequently, scrna-seq technologies require a large number of cells (thousands to tens of thousands) to ensure statistical significance in the results. in addition, the cells must remain viable during capture. these requirements render the scrna-seq technologies costly, prohibiting their application in clinical studies that involve many subjects or cannot allow real time tissue dissociation and cell capture. furthermore, scrna-seq technologies may not be well suited to characterizing cell-type proportions in solid tissues because the dissociation and capture steps can be ineffective to certain cell types – . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . as sequencing at the single cell level is not always feasible, in silico approaches have been developed to infer cell type proportions from compound rna-seq data – . the most common strategy is to conduct a statistical inference through the maximum likelihood estimation (mle) or the maximum a posterior estimation (map) on a constrained linear regression framework, wherein the unobserved mixing proportion of a finite number of cell types are part of the latent variables to be optimized. – the deconvolution methods are often applied to dissect the immune cell compositions in blood samples – . however, their performance in more complex tissues, such as the nervous, ocular, respiratory and gastrointestinal organs, remains unclear. these tissues often contain many cell types ( - ) and the difference among related cells can be subtle, rendering the deconvolution a challenging task. for example, a recent study on the mouse nervous system contains more than cell clusters and many are highly similar neuronal subtypes . earlier works often utilized the transcriptome profiling of the purified cell populations to estimate the gene expressions per cell type (e.g. cibersort) . more recently, acquiring cell type specific expression from the scrna-seq data was shown to be an intriguing alternative – . though it provides higher throughput by measuring multiple cell types in one experiment, profiling at single cell level is substantially noisy. deconvolution using scrna-seq data as reference can be biased by noise non-relevant to cell identities if not treated properly. moreover, the platform difference between the compound data and the single cell data cannot be ignored. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to overcome these challenges, additional information from the data may be considered. a recent method that weighs genes according to their expression variances across samples greatly improved the accuracy , highlighting the importance of gene variability in inferring cell type composition. some other methods and applications have pointed out the importance of cell type specific genes , , , . in these works, the cell type specific expression was only used to select the input genes (e.g., markers). nonetheless, it measures how informative a gene is in distinguishing cell types and thus can be incorporated as a part of the model. to address the platform difference between the compound data and the single cell data it is usually assumed there exists a single scaling factor or a linearly scaled bias for all genes that can be learned and corrected accordingly , . this assumption is hardly held because the impact of the platform difference to each gene is different. though learning a uniform scaling factor would correct the difference in the majority of genes, a few genes that remain significantly biased can easily confound the estimation, especially under a linear model framework. thus, a gene-wise correction should be considered. in this work, we presented a new deconvolution method, adroit, a unified framework that jointly models the gene-wise technology bias, genes’ cell type specificity and cross-sample variability. the method estimated the cell type constitution in the compound rna-seq samples using relevant single cell data as a training source. genes used for deconvolution were automatically selected from the single cell data based on their information richness. uniquely, it uses an adaptively learning approach to estimate gene-wise scaling factors, addressing the issue that different platforms impact genes differently. the model of adroit is further (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . regularized to avoid collinearity among closely related cell subtypes that are common in complex tissues. over a comprehensive benchmarking data sets with a varying cell composition complexity, adroit showed superior sensitivity and specificity to other existing methods. applications to real rna-seq bulk data and spatial transcriptomics data revealed strong and expected biologically relevant information. we believe adroit offers an accurate and robust tool for cell type deconvolution and will promote the value of the bulk rna-seq and the spatial transcriptomics profiling. results overview of the adroit framework adroit estimates the proportions of cell types from compound transcriptome data including but not limited to bulk rna-seq and spatial transcriptome. it directly models the raw reads without normalization, preserving the difference in total amounts of rna transcript in different cell types. the method utilizes as reference the relevant pre-existing single cell rna-seq data with cell identity annotation. it selects informative genes, estimates the mean and dispersion of the expression of selected genes per cell type, and constructs a weighted regularized linear model to infer percent combinations (fig. a). because sequencing platform bias impacts genes differently , , , a uniform scaling factor for all genes does not sufficiently eliminate such bias. a key innovation of adroit is that it uniquely adopts an adaptive learning approach, where the bias was first estimated for each gene, then adjusted such that more biased gene is corrected with a larger scaling factor (fig. b). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we also attribute the success of adroit to the consideration of a comprehensive set of other relevant factors including genes’ cross-sample variability, cell type specificity and collinearity of expression profiles among closely related cell types. the cross-sample variability of a gene confounds its biological expression variability due to the variety of cell types. the latter is referred as the cell type specific expression that helps identify the cell type. adroit weighs down genes with high cross-sample variability whilst weighs up those with an expression highly specific to certain cell types. the definition of cross-sample variability and cell type specificity also accounts for the overdispersion nature in counts data. lastly, adroit adopted a linear model to ensure the interpretability of the coefficients. at the same time, adroit included a regularization term to minimize the impact of the statistical collinearity. each of the factors contributes an indispensable part to adroit, leading to an accurate and robust deconvolution method for inferring complex cell compositions. to evaluate the performance, we compared adroit with music and nnls , for bulk data deconvolution, and stereoscope for spatial transcriptomics data deconvolution. when evaluating the algorithms, a common practice is to pool the single cell data to synthesize a “bulk” sample with the known ground truth of the cell composition. we measured the performance by comparing the estimated cell proportions with true proportions using four metrics: mean absolution difference (mad), rooted mean squared deviation (rmsd) and two correlation statistics (i.e., pearson and spearman). both correlations are included because pearson reflects linearity, while spearman avoids the artificial high scores driven by outliers when majority of estimates are tiny. good estimations feature low mad and rmsd along with (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . high correlations. when estimating cell proportions for a synthetic sample, cells from this sample are excluded from the input single cell reference (i.e., leave-one-out) to avoid overfitting. we further applied adroit to real bulk rna-seq data and validated the results by available rna fluorescence in-situ hybridization (rna-fish) data. the estimates were further confirmed by relevant biology knowledge of human pancreatic islets. we also used adroit to map cell types on spatial spots, and the accuracy was verified by in-situ hybridization (ish) images from allen mouse brain atlas . adroit excels in datasets with both simple and complex cell constitutions we started with a simple human pancreatic islets dataset that contains cells and four distinct endocrine cell types (i.e., alpha, beta, delta, and pp cells) (extended data fig. a; supplementary table ). the synthesized bulk data were constructed by mixing the single cell data at known proportions. though all three methods achieved satisfactory performance according to the evaluation metrics, adroit has slightly better performance as reflected by scatterplots of estimated proportion vs. true proportion (extended data fig. b, supplementary table ). it has moderately lower mad ( . vs. . for music and . for nnls), and rmsd ( . vs. . for music and . for nnls) and comparable correlations (pearson: . vs . for music and . for nnls; spearman: . vs . for music and . for nnls) (extended data fig. c). this performance was expected because there were only four cell types with very distinct transcriptome profiles. deconvoluting such data was a relatively easy task for all three methods. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we then tested the methods on a couple of complex tissues that are more challenging to deconvolute. one is the human trabecular meshwork (tm) tissue. we acquired published single cell data that contains cells and cell types from donors . the data include similar types of endothelial cells, types of schwann cells and types tm cells (supplementary fig. ; supplementary table ). cells from each donor were pooled as a synthetic bulk sample. the cell type proportions vary from < % to %. these proportions were the ground truth cell composition and were compared head-to-head with the estimated proportions inferred by adroit, music and nnls. for each synthetic bulk sample, estimations were performed using a reference built from cells of other donors (i.e., leaving-one-out). in each of the samples, the estimates made by adroit best approximated the true proportions. in particular, adroit had significantly lower mad ( . ) and rmsd ( . ), and higher correlations (pearson = . ; spearman = . ), comparing to music (mad = . ; rmsd = . ; pearson = . ; spearman = . ) and nnls (mad = . ; rmsd = . ; pearson = . ; spearman = . ) (fig. a). we further assessed the deviation of the estimates from the true proportions for each cell type. adroit consistently had the lowest deviations from the true proportions for all cell types, as well as the lowest variation among samples (fig. b, blue dots), indicating a higher robustness over various cell types and samples. notably, adroit only missed one rare cell type (true proportion = . %) out of cell types in one sample, while music missed to cell types in of the samples, and nnls missed to cell types in all samples (supplementary fig. , supplementary table ). adroit has better sensitivity and specificity (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we next systematically addressed the sensitivity and specificity of these algorithms. in the context of the cell type deconvolution, a false negative occurs when the proportion of an existing cell type is predicted to be zero (or below a given threshold). conversely, a non-zero prediction (or above a given threshold) of an absent cell type results in a false positive. false negatives and false positives measure the sensitivity and specificity of a deconvolution algorithm, respectively. both quantities are crucial to establish the utility of the algorithm. particularly, in real world applications, it is often difficult to know a prior what cell types exist in a bulk sample, users may inform the algorithm to consider more possible cell types than what are actually in the sample. false positive predictions in this situation would make the algorithm unusable. we designed a simulation to test the sensitivity and specificity. we selected out of the cell types, i.e., schwann-cell like cell, tm , smooth muscle cell, melanocyte, macrophage and pericyte, from each donor sample and pooled them within that sample to synthesize new bulk samples. the unselected cell types are considered absent in the bulk samples. some cell types in presence are highly similar to those in absence, challenging the programs to pinpoint the right cell type present in the bulk among similar candidates. we provided the full list of single cell types as reference to the programs to estimate the cell type proportions. nnls was excluded from this evaluation due to its low benchmarking performance observed earlier (fig. a, b). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . consistently across samples, adroit had very accurate estimates for the present cell types, and zero or close-to-zero estimated values for the non-existing cell types in the synthetic bulk data. music was notably less accurate on the selected cell types, meanwhile it had many non- negligible values (> % for out estimates) of the cell types excluded in the synthetic samples (fig. c, supplementary table ). for example, smooth muscle cells accounted for ~ % in donor but was largely missed (~ . %) by music. we noted that tm had false non- zero estimates from both methods though not included. this is because tm is easily mistaken as tm due to their high similarity . nonetheless, adroit’s estimates of tm were consistently small across samples (< % for out of estimates), while music had significantly larger estimates of tm that occasionally even exceeded the tm estimates (donors and in fig. c right). for a systematic comparison, we constructed the receiver operating characteristic (roc) curve by varying the threshold of detection (i.e., a cutoff below which the cell type was deemed undetected) (fig. d). adroit had significantly higher area under the curve (auc) than music ( . vs. . ), implying a dominantly better sensitivity and specificity. adroit outperforms in deconvoluting closely related subtypes to further evaluate adroit when multiple cell subtypes present in a complex tissue, we performed scrna-seq experiment on mouse lumbar dorsal root ganglion (drg) from five mice. following the standard analysis pipeline (methods), we obtained single cells after quality control procedures. after clustering and annotation, we discovered cell types including multiple subtypes of neuronal cells (fig. a, supplementary table ). the heatmap of the top marker genes showed distinct patterns of the major cell types as well as similar patterns of the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . subtypes (extended data fig. a), and the cell type proportions varied from . % to . % (extended data fig. b). these cell types include subtypes of neurofilament containing neurons (i.e., nf_calb , nf_pvalb, nf_ntrk .necab ), subtypes of non-peptidergic neurons (i.e., np_nts, np_mrgpra , np_mrgprd), and subtypes of peptidergic neurons (i.e., pep _dcn, pep _s a .tagln , pep _slc a .sstr , pep _htr a.sema a, pep _trpm ). also discovered were tyrosine hydroxylase containing neurons (th), satellite glia and endothelial cells. such complex compositions formed a challenging testing ground for evaluating the ability to distinguish closely related cell types. we again did the leave-one-out deconvolution on five synthesized bulk samples. adroit had highly accurate estimations on all cell types across samples (fig. b). it is worth to mention that, for the rare cell types that account for less than %, adroit still had a good estimation that is fairly close to the true proportions and never missed a single cell type, showing that adroit is very robust on rare cell types. for example, . % endothelial cells were predicted to be . %, and . % nf _ntrk .necab cells were predicted to be . % (supplementary fig. , supplementary table ). on the contrary, music and nnls were notably less accurate, especially for the cell types less than %, and missed multiple cell types including some large cell clusters taking account of ~ % (pep _slc a .sstr cells of sample ). we further examined how much the variability of the estimates was in each individual sample. we computed the metrics to evaluate the performance on each of the synthetic samples and compared them head-to-head among the algorithms. this fine comparison showed adroit significantly outperformed music and nnls on every sample (fig. c). further, the performance (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . metrics of adroit were highly consistent across samples with the lowest variability among the three methods. adroit excels on simulated spatial transcriptomics data given the promising performance on complex tissues, we continued to test adroit’s applicability to spatial transcriptomics data. spatial transcriptomics data differs from bulk rna- seq data in that each spot only contains transcripts from a handful of cells ( - ) . some of the spots contain multiple cells of the same type, while others may have mixtures of cell types at varying mixing percentages (e.g., spatial spots at the boundary of different cell types). also, because the mixture is a pool of only a few cells, the variations across spatial spots are expected to be greater than in bulk samples. we simulated a large number of spatial spots ( in total) by using sampled cells from the drg single cell data above (methods), then compared adroit with stereoscope over a range of simulation scenarios. we first tested whether the methods could correctly infer a single cell type when the spots contain cells from that same type. for each of the cell types from drg, we sampled cells and pooled them to form a spatial spot. we repeated the simulation for times for a robust testing, then used the full set of cell types as reference to deconvolute the simulated spots. both methods were able to identify the correct cell types with indistinguishable accuracy on the simulated cell types (i.e., estimates close to ) and comparably low estimated values (i.e., estimates close to zero) for other cell types not included when simulating the spots (extended data fig. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we then continued a difficult scenario where we sampled cells from the pep subtypes and mixed them. we created three simulation schemes for a comprehensive evaluation: ) pep subtypes had same percent of . ; ) pep _dcn was . and the other were . ; ) pep _s a .tagln and pepe _dcn were . , pep _htr a.sema a and pep _slc a .sstr were . , and pep _trpm was . . again, each simulation scheme was repeated times. under each scheme, the estimates by adroit consistently centered around true proportions and the other cell types had very low estimated values (close to zero) (fig. a, supplementary table ). in comparison, though the estimates for the other cell types were also generally close to zero, the estimates of the pep cells by stereoscope systematically deviated from the true proportions for all three simulated schemes except for pep _s a .tagln . we further expanded the simulated spatial spots to the mixture of np cell types and mixture of nf cell types. in addition, we sampled np_mrgpra cells and mixed them with other distinct cell types (i.e., th, satellite glia and endothelial), as well as nf_calb cells mixed with other distinct cell types, and pep _trpm mixed with other distinct cell types. for all these simulated spatial spots, adroit’s estimates were consistently centered at true proportions, whereas stereoscope’s estimates deviated in almost all simulated schemes (extended data fig. , supplementary table ). we speculate the main reason stereoscope underperformed at these simulated spots is that it normalizes the total umi counts to the same number for all cells. in real world, a spatial spot is unlikely to be a pool of cells that have the same total rna transcripts sampled, especially when a spot contains different cell types (e.g., immune cells (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . have about -fold less total umis than the neuronal cells or subtypes of neuronal cells). our simulation pooled the sampled cells by adding up the raw umi counts per gene, which we believe best mimics the real data. next, we asked how sensitive the methods are in detecting rare cell populations. we simulated mixtures of pep subtypes (i.e., pep _slc a .sstr , pep _htr a.sema a, pep _trpm ) with a series of low percent pep _trpm (from . to . by . ), and the other two cell types sharing the rest percentage equally (methods). at each given percent, the simulation was repeated times. we then checked how accurately the percent of pep _trpm cells was estimated. the medians of adroit’s estimates were always close to the true proportions (fig. b, red lines), whereas that of stereoscope’s estimates were largely lower than true proportions. stereoscope also missed the majority of pep _trpm cell type when the simulated proportion was below . . this comparison implied adroit is more advantageous in detecting low percent cells. for a complete comparison, we also simulated other types of cell mixtures in the same way. at each given low percent, we computed how many times out of the low percent cell component was detected (estimates > . ). adroit had systematically higher detection rates, as well as higher consistency across different cell mixtures (fig. c, supplementary table ). notably, at a simulated percent of %, adroit achieved > % of detention rate, making it a powerful tool in detecting rare cells. though music was not designed for deconvoluting spatial spots, theoretically it also can be applied to spatial transcriptomics data. we thus also compared adroit to music on the same (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sets of simulation data above. we observed adroit was also significantly more accurate over all simulation scenarios of spatial spots (fig. a, extended data fig. and , supplementary fig. ), and more sensitive when detecting low percent cells (fig. b, c, supplementary fig. ). application to real bulk rna-seq data of human pancreatic islets though using synthetic bulk data based on mixing of single cells is a useful benchmarking strategy, the bulk and single cell rna-seq often use distinct rna library preparation and sequencing protocols. the capability of a method to deconvolute real bulk samples shall be addressed to ensure it is useful in the real-world applications. we acquired real human pancreatic islets bulk samples from published studies , , (supplementary table ) and used single cell data of the same tissue as reference to infer the percentages of endocrine cell types (i.e., alpha, beta, delta, pp). the bulk samples were collected from distinct donors, including healthy donors, and donors with type diabetes (t d). each donor contributed to replicated bulk rna samples. replicates from the same donor are expected to have similar compositions and thus were used to assess the reproducibility of the estimates from adroit. for all cell types, adroit had highly consistent estimates for the same donors (fig. a, supplementary table ). the average standard deviations did not exceed % for all cell types (i.e., alpha: . ; beta: . ; delta: . ; pp: . ). to seek an independent validation, we obtained cell sorting results by rna- fish for of the donors (supplementary table ). the estimated cell proportions of the were highly consistent with the percentages measured by rna-fish (fig. b), and the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . consistency held for both major cells (alpha and beta) and the minor cells (delta and pp). reproducibility and independent validation showed adroit is reliable in deconvoluting real bulk rna-seq data. we then asked if adroit can detect known biological differences between healthy and t d donors. loss of functional insulin-producing beta cells is a prominent characteristic of t d – , typically reflected by elevated level of hemoglobin a c (hba c) , . among the healthy donors, the majority of beta cell proportions estimated by adroit ranged from % to % (fig. c), agreed with the known percent range of beta cells in human islets tissue , . a significant decreasing of the estimated beta cell proportions was seen in t d patients (p value = . e- ). further, a linear regression of estimated beta cell proportions on hba c levels showed a statistically significant negative association (p value = . e- ). adroit adequately reflected the cell composition difference between healthy donors and t d patients. application to mouse brain spatial transcriptomics we lastly demonstrated an application to the real spatial transcriptomics data. given the molecular architecture of brain tissue has been well studied, we chose mouse brain spatial transcriptomics data generated by x genomics, containing spatial spots (methods). the reference single cell data were acquired from an independent study which contains a comprehensive set of nervous cell types in brain . we curated the cell types by merging highly similar clusters and came down to a consolidated set of distinct brain cell types (methods, supplementary table ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the cell contents inferred by adroit per spot appear to accurately match the expected cell types at that location (extended data fig. , supplementary table ). for example, the three subtypes of cortex excitatory neurons each occupied a sub-area in the cerebral cortex region. as another example, the shape of hippocampal region was delineated by the estimated percentages of dentate gyrus granule/excitatory neurons. for an independent validation, we checked the consistency between estimated cell types with the in-situ hybridization (ish) images from allen mouse brain atlas . we chose genes highly expressed in brain regions respectively, i.e., spink for hippocampal field ca , c ql for dentate gyrus, clic for choroid plexus, and synpo for thalamus . the spots enriched with the cell types (i.e., hippocampal ca excitatory neuron type , dentate gyrus granule neuron type , choroid plexus cell, thalamus excitatory neuron type ), as mapped by adroit, precisely co-localized with the strong signals of the marker genes on the ish images respectively (fig. d). this agreement confirmed that the spatial mapping of cell types by adroit is reliable. computational efficiency besides the accuracy and robustness, another major advantage of adroit is its magnitude higher computational efficiency. adroit uses a two-step procedure to do the inference. the first step prepares the reference on single cell data where per-gene means and dispersions are estimated, and cell type specificity is subsequently computed. the built reference can be saved and reused. we tested the running time on the reference building using the aforementioned mouse brain single cell dataset containing ~ , cells. it took about . minutes on a cpu (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . that has cores ( used for parallel computing). the second step inputs the built reference and target compound data and does the estimation. deconvoluting ~ compound rna-seq samples took around minutes. therefore, adroit in total took less than minutes and ~ gb memory usage on a regular cpu. as a comparison, music took about hour and minutes on the same data using the same cpu. stereoscope ran about hours continuously with the published parameter setting (-scb -sce -topn_genes -ste -lr . -stb -scb ) on a powerful v gpu with cores and g memory, which is prohibitive for seeking a quick turnaround. discussion in this work we have demonstrated that adroit is capable of deconvoluting the cell compositions from the compound rna-seq data with a leading accuracy, measured by the consistency between the true and predicted cell proportions. its advantage over the existing state-of-the-art methods was verified over a wide range of use cases. in particular, adroit excelled in complex tissues composed of more than ten different cell types with wide range of cell proportions (e.g., trabecular meshwork, dorsal root ganglion). in both cases, adroit performed significantly better than the comparators music and nnls on deconvoluting bulk rna-seq data. adroit is also more accurate and sensitive than stereoscope in demystifying spatial transcriptomics spots, especially in detecting low percent cells. previous benchmarking often assumed the types of cells in the synthetic bulk data are not more or less than the cell types collected in the reference, and thus the only unknown was the proportion of each cell type. this assumption may not hold. missing existing cell types or false predictions of non- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . existing ones can hinder the utility of an algorithm. thus, besides the overall accuracy, we also examined the sensitivity and specificity of the algorithms. we observed a superior sensitivity and specificity in adroit, an important leverage for its usage in practice. the reference single cell data used by adroit came from different platforms, such as the x genomics chromium instrument (the mouse dorsal root ganglion), and the fluidigm c system (the human pancreatic islets data). adroit consistently exhibited excellent performance across all benchmarking datasets independent of their single cell sequencing technology platforms. more importantly, this statement holds not only for deconvoluting the synthesized bulk data, but also for the real bulk rna-seq data. the latter typically does not apply the unique molecular barcoding and requires a significantly different cdna amplification procedure from what is used in the single cell rna-seq (methods). besides, the sequencing depth, read mapping and gene expression quantification are dissimilar as well. the fact that adroit accurately dissected the cell compositions in the real bulk samples based on the single cell reference data further supports its cross-platform applicability. we attribute the power of adroit to its comprehensive modeling of relevant factors. firstly, we think a common rescaling factor is not sufficient to correct the platform difference between single cells and the compound data. rather, the impact of platform difference to genes is quite different and hardly is linearly scaled. correcting such differences entails rescaling factors specifically tailored to each gene. adroit uses an adaptive learning approach to estimate such gene-wise correcting factor and does the correction in a unified model. in addition, the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . contribution of a gene in a cell type to the loss function is jointly weighted by its specificity and variability in a cell type, where specificity and variability are defined in a way accounting for the overdispersion property of counts data. our observations over the multiple benchmarking dataset also show that the coexistence of similar cell types may have induced a collinearity condition that negatively impacted the regression-based methods developed by others. being able to alleviate this problem gives adroit an edge to outperform. all these factors help adroit to distinguish similar cell clusters while sensitive enough to separate rare cell types. technically, the input profiles of individual cell types to adroit does not necessarily come from the single cell rna-seq. bulk rna-seq profiles of individual isolated cell types can be used as well. nevertheless, using single cell rna-seq data as the reference has a few key advantages. it is a high throughput approach wherein multiple cell types can be interrogated simultaneously. prior knowledge of the cell types in presence as well as their specific gene markers are not required, which allows novel cell types to be identified. although detection of lowly expressing genes has been a challenge for the single cell rna-seq, significant enhancements have been demonstrated. for example, the number of detectable genes currently can reach an order of , per cell and keeps improving . as adroit focuses on the informative genes whose expressions are generally high, the detection limit of the single cell rna-seq does not impose a significant drawback. indeed, given the single cell reference profiles, adroit successfully deconvoluted the real bulk rna-seq data and spatial transcriptomics data. the results suggest that, besides enriching our understanding of the bulk transcriptome data, adroit can leverage the usage of the vast amount and continuously growing single cell data as well. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adroit is a reference-based deconvolution algorithm. a comprehensive collection of the possible cell components is important. however, completeness may not always be guaranteed. even with the single cell acquisition that is independent of prior knowledge, rare and/or fragile cell types may not survive through the capture procedure and hence are excluded. it is also difficult to generate a solid reference profile for cells that are versatile from sample to sample (e.g., tumor cells). currently adroit deals implicitly with the components unknown to the reference. if an unknown cell type reassembles one of the referenced ones, it may be considered as part of the known cell type and their joint population is predicted. such an outcome is acceptable as treating two similar cell types as one is still biologically meaningful although the resolution of the system may be compromised. if the unknown component is dissimilar to all the known ones, it will be ignored by adroit because its representative markers are unlikely among the top weighted genes associated with the known components. at the same time, the distinct component is expected to have a unique gene expression pattern and thus unlikely interferes significantly with the gene expressions from the known cell types. therefore, adroit essentially deconvolutes the relative populations among the known cell components. for example, adroit was able to correctly uncover the populations of endocrine cell types from the human islet bulk data despite the absence of many other cell types such as macrophages, schwann cells and endothelial cells in the input single cell reference . although under such a circumstance, the absolute percentages of the cells remain obscure, we expect their relative proportions can be studied and valuable. a future improvement is to explicitly (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . model the unknown cell types and estimate their percentages upon the signals in the compound data that cannot be explained by the contribution from the known components. methods gene selection adroit selects genes that contain information about cell type identity, excluding non- informative genes that potentially introduce noise. there are two ways for selecting such genes: ) union of the genes whose expression is enriched in one or more cell types in the single cell umi count matrix. these genes are referred as marker genes; ) union of the genes that vary the most across all the cells in the single cell umi count matrix, referred as the highly variable genes. for marker genes, we recommend selecting top ~ genes (p value < . ), ranked by fold change, from each cell type for resolving complex compound transcriptome data. considering some genes may mark more than one cell types, we further require selected markers presenting in no more than cell types to ensure specificity. we also suggest select a minimal of total number unique genes for an accurate estimation. if not satisfied, one may consider expand the number of top genes and/or loose the p value cutoff. adroit also offer the option to use highly variable genes. to avoid the selected highly variable genes being dominated by large cell clusters whilst underrepresents small clusters, adroit first balances the cell types in the single cell umi count matrix by finding the median size among all cell clusters, then sample cells from each cluster to make them equal to this size. next, adroit computes the variance of each gene across the cells in the balanced single cell umi matrix. due (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to the well-known dispersion effect in rna-seq data, directly computing variances from count matrix can results in overestimation. we thus compute variances on the normalized data done by variance-stabilizing transformation (vst) . genes with top large variances are then selected. in both ways, mitochondria genes were excluded as their expression do not have information of cell identity. the results shown in current paper were based the marker genes as described above. but we also demonstrated that using the balanced highly variable genes yields comparably accurate estimations (supplementary fig. ). estimate gene mean and dispersion per cell type modeling single cell rna-seq data is challenging due to the cellular heterogeneity, technical sensitivity, and noise. while the expression of some genes can be not detected by chance, other genes may be found to be highly dispersed. these factors can lead to excessive variability even within the same cell type. adroit combats high noise and computational complexity by building models with estimated mean and dispersion per cell type. this strategy reduced the data complexity while preserve the cell type specific information. although typical analyses of rna-seq data starts with normalization, adroit does not do normalization prior to the mean estimation. performing a normalization across all cell types forces every cell type to have the same amount of rna transcripts, measured by the total unique molecular identifier (umi) counts per cell. however, different cell types can have (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dramatically different amounts of transcripts. for example, the amount of rna transcripts in neuronal cells is about times fold of that in glial cells. thus, normalization can falsely alter the relative abundance of cell types, misleading the estimation of cell type percentages. to avoid this problem, adroit models the means using the raw umi counts. studies have shown that umi counts follows negative binomial distribution , , we therefore fit negative binomial distributions to single cells of each cell type and build the model based on the estimated means and dispersions from the selected genes. more specifically, let 𝑋!"be the set of single cell umi counts of gene i ∈ ,..,i for all cells in cell type k ∈ ,…,k. i is the number of selected genes, and k denotes number of cell types in the single cell reference. the distribution of 𝑋!"follows negative binomial distribution, 𝑋!" ∼ 𝑁𝐵(𝜆!",𝑝!"), ( ) where 𝜆!" is the dispersion parameter of the gene i in cell type k, and 𝑝!" is the success probability, i.e., the probability of gene i in cell type k getting one umi. the two parameters are estimated by mle. the likelihood function is 𝐿𝐻(𝜆!",𝑝!"|𝑋!") = ∏ 𝑓(𝑋!"|𝜆!",𝑝!") #! !$% , ( ) where 𝑛" is the number of cells in cell type k, and f is the probability mass function of negative binomial distribution. the mle estimates are then given by (𝜆&" ,𝑝&") = 𝑎𝑟𝑔max '"!,)"! 𝐿𝐻(𝜆!",𝑝!"|𝑋!"). ( ) once success probability and dispersion are estimated, the mean estimates can be computed numerically according to the property of negative binomial distribution, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝜇!" = '#!* ∙)#!, %-)#!, , ( ) 𝜎!" . = '#! * ∙)#!, (%-)#!, )$ . ( ) estimation using mle has been readily coded in many r packages. we choose ‘fitdist’ function from ‘fitdistrplus’ package for its fast computation speed and flexibility in selecting distributions. estimations are done for each selected gene in each cell type, resulting in a 𝐼 × 𝐾 matrix of cell type means. cell type specificity of genes genes with cell-type specific expression patterns better represent cell types, thus are more important when be used for resolving cell type composition. in line with this property, adroit weights genes with high specificity more than less specific ones. highly specific genes usually have consistently high expression and thus relatively low variance among cells within a cell type. to compute cell type specificity of a gene, we first identify the cell type in which the gene has the highest expression (i.e., most specifically expressed cell type), then defines the specificity of this gene as the mean-to-variance ratio within the cell type. a high ratio renders high weight to the gene in the model. we use the estimated means and variances from negative binomial fitting (𝜇!" and 𝜎!" . in eq. and ). let 𝑘 be the index of cell type that has the highest mean expression of gene i, 𝑘 = 𝑎𝑟𝑔max " {𝜇!"| 𝑘 𝜖 …𝐾}, ( ) then the cell type specificity weight for gene i, denoting 𝑤! , is given by, 𝑤! = "!% "!% $ , ( ) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and it is computed for each gene in the set of selected genes. cross-sample gene variability the variability of a gene contrasts how much stable a gene is across samples. the idea of weighting genes based on variability across samples is first explored by wang et al , where variability was defined as the cross-sample variance. by weighting down the high variability genes, the authors achieved a great advantage over the traditional unweighted method. genes with low cross-sample variability better represent the population, hence are more trust-worthy to be used to learn the cell composition. adroit incorporates the same notion to weight the importance of genes, however, defines the variability in a more sophisticated way. similar as we define the cell type specificity, adroit utilizes mean and variance, and computes variance- to-mean ratio (vmr) to stand for cross-sample gene variability. but here the mean and variance are computed across samples. the vmr is better scaled than the simple variance, and it can avoid underweighting genes that has low expression, while circumvent overweighting genes hugely dispersed. in addition, adroit extends the method to fit the case where multiple samples are not available. we proposed three ways to compute the vmr, depending on whether multi-sample data is available. typically, the compound transcriptome data to be deconvolved have multiple samples. in bulk rna-seq data, multiple samples are usually included to control for biological variability. in spatial transcriptome data, the spatial dots can be seen as multiple samples. therefore, we first consider computing the cross-sample gene variability from compound (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . transcriptome data. in case multi-sample for compound data is not available, adroit utilizes the single cell reference, and synthesizes compound samples by pooling all cells belonging to the same sample. if multi-sample is not available for both data, adroit subsample single cells and pool them to make pseudo samples. let 𝑌! denote the counts of sequences for gene i in sample j ∈ ,…,j, then 𝑌! ∼ 𝑁𝐵(𝜆! ,𝑝! ), ( ) where 𝜆! is the dispersion parameter of the gene i in sample j, and 𝑝! is the success probability. again, we use mle to get the estimates 𝜆& and 𝑝& g, following which cross-sample mean and variance can be numerically computed: 𝜇! = '#&* ∙)#&, %-)#&, , ( ) (𝜎! .) = '#&* ∙)#&, %-)#&, $, ( ) and cross-sample variability for gene i is then defined as 𝑉𝑀𝑅! = ( " $)' " ' = % " (, ( ) where 𝑤! : is later used in the model. the cross-sample variability weight is computed for each gene in the set of selected genes. gene-wise scaling factor to correct platform bias when linking the compound data to the single cell data, rescaling factor is often used to account for the library size and platform difference. the existing methods adopt a single rescaling factor for each unit of sample, i.e., all genes of a single sample are multiplied by the same factor , . this operation is based on a strong assumption that the impact of platform (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . difference to every gene is the same and linearly scaled among different cell types, which is hardly true. in addition, because estimates can be easily affected by outliers in linear model, estimation of cell proportions can be steered away from the truth by extremely high expression genes. therefore, applying a uniform scaling factor to all gene is inappropriate. to overcome this problem, adroit instead estimates gene-wise scaling factors via an adaptive learning strategy and rescales each gene with its respective scaling factor. to proceed, we first input the mean gene expression from the compound samples (𝜇! in eq. ) and the estimated means of each cell type from the single cell data (𝜇!" in eq. ), then apply a traditional non- negative least square regression (nnls) to get a rough estimation of the proportions of each cell type, denoting 𝜏". for each gene, a predicted mean expression (∑ 𝜏"g;" 𝜇!" in eq. ) is computed as the weighted sum of the means of each cell type wherein the weights are the roughly estimated proportions. the regression equation is given by, 𝜇! = 𝐴 ∙ (∑ 𝜏";" 𝜇!" + 𝜀), < 𝜏", ∑ 𝜏" ; " = ( ) where a is a constant to ensure 𝜏"’s sum to and 𝜀 is the error term. we use ‘nnls’ function in the ‘nnls’ package to estimate 𝜏"’s. next, we calculate the ratio between the mean expression from compound samples and the predicted means, and define the gene-wise rescaling factor as the logarithm of the ratio plus , 𝑟! = log ( " ) ∑ =!, * ! "! + ). ( ) given the dispersion property of count data, the logarithm of the ratio is a more appropriate statistic as it results in relatively stable scaling factors. the addition of avoids taking logarithm on zero. by multiplying the flexible gene-wise rescaling factor, the “outlier” genes will be (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pushed toward the truth regression line direction, while the genes around the true regression lines are less affected (fig. b). weighted and regularized model we next designed a model that incorporates all these factors to do the actual estimation of cell type proportions. adroit builds upon non-negative least square regression model. it gives high weights to the genes with high cell type specificity and low cross-sample variability. this was done by optimizing a weighted sum of squared loss function l, where the weights consist of two components (𝑤! : in eq. , 𝑤! in eq. ). the gene-wise scaling factor tailored for each gene effectively corrects the bias due to technology difference between compound sample and single cell data (𝑟!in eq ). in cases of complex tissues (e.g., neural tissues) where many highly similar subtypes are common, closely related subtypes can have strong collinearity, leading to overestimation of some cell types whilst underestimate or miss some others. adroit handles this problem by including a l norm of the estimates as the regularization component. denote 𝛽" as the unscaled coefficient for cell type k. for a compound transcriptome sample j, the loss function is given by, 𝐿 (𝛽%,…,𝛽;|𝑦! ,𝑤! :,𝑤! ,𝑟!,𝜇&"g) = ∑ 𝑤! : ∙ 𝑤! ∙ (𝑦! − 𝑟! ∙ ∑ 𝛽"𝜇&"g;" ). > ! + ∑ 𝛽" .; " . ( ) then the coefficient 𝛽" can be estimated by minimizing the loss function with the constraint 𝛽%,…,𝛽; > , 𝛽% ,…,𝛽; = argmax ?+,…,?* ?+,…,?*ab𝐿 . ( ) the estimation is done by a gradient projection method by byrd et al . we derive the gradient function by taking partial derivative of the loss function with w.r.t. 𝛽", (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝐺" = ∇?!𝐿 = − ∑ 𝑟! ∙ 𝜇&"g ∙ 𝑤! : ∙ 𝑤! ∙ ^𝑦! − 𝑟! ∙ ∑ 𝛽"𝜇&"g;" _ + > ! 𝛽". ( ) adroit uses the function ‘optim’ from the r package ‘stats’ to do the estimation , providing the loss function (eq. ) and the gradient (eq. ). to get the final estimates of cell type proportions, we rescale the coefficients 𝛽"’s to ensure a summation of , 𝜃" = ?!* ∑ ?!* * ! . ( ) each compound sample j is independently estimated by the model described above. simulation of bulk rna-seq and spatial transcriptomics data bulk rna-seq data used for benchmarking are synthesized by adding up the raw umi reads per gene from all single cells of a sample regardless of cell types. denote 𝑡" as a cell in cell type k, and 𝑡" ∈ , …, 𝑇", where 𝑇" is the number of cells in cell type k. let 𝑌! d be the read count of gene i in a synthesized bulk sample j, and 𝑋! e! be the umi count of the gene, then 𝑌! d = ∑ ∑ 𝑋! e! f! e! ; " . the true proportion of cell type k is given by, 𝜃" b = f! ∑ f! * ! . to simulate spatial transcriptomic spots, we first sample cells without replacement from each cell type and added them up, then mix them with designed proportions. for example, to simulate a spot with 𝑝" percent of cell type k, the read count 𝑌! g of gene i in a spatial spot j is given by, 𝑌! g = ∑ 𝑝";" ∑ 𝑋!"#%b#$% , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where 𝑋!"g is umi count of gene i in a sampled cell n of cell type k. for each mixing scheme, the simulation is repeated times. evaluation statistics we compared the estimated cell type proportions with the ground truth by calculating statistics. the mad and rmsd are given by, 𝑚𝐴𝐷 = ∑ hi!-i! , h*! ; , 𝑅𝑀𝑆𝐷 = ∑ i!-i! , $* ! ; . pearson correlation coefficient is computed as, 𝜌) = ∑ i!-i!jjjj ki! ,-i! ,jjjjl*! m∑ i!-i!jjjj * ! $m∑ ki! ,-i! ,jjjjl $* ! , where 𝜃"ggg and 𝜃" bggg are means of the estimated proportions and true proportions, respectively. spearman correlation coefficient is given by, 𝜌g = ∑ (n!-n!jjjj)kn! ,-n! ,jjjjl*! m∑ (n!-n!jjjj) * ! $m∑ kn! ,-n! ,jjjjl $* ! , where 𝑟"is the rank of 𝜃". single cell rna sequencing of mouse dorsal root ganglion as described previously , lumbar drgs were isolated from adult c bl/ mice and transferred to a dissociation buffer (dulbecco's modified eagle's medium supplemented with % heat- inactivated fetal calf serum) (gibco; cat # a - ). to generate a single cell suspension, drgs were subjected to a step-enzymatic dissociation followed by a mechanical dissociation. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . in brief, drgs were first incubated with . % collagenase p from clostridium histolyticum (roche applied science; cat # ) for minutes in an eppendorf thermomixer c ( °c; intermittent rpm shaking for about sec every minutes). then, drgs were transferred to a hank's balanced salt solution (hbss, mg + and ca + free; invitrogen) supplemented with . % trypsin (worthington biochemical corp.; cat # lsoo ) and . % edta and incubated for minutes at °c in the eppendorf thermomixer c. trypsin was neutralized by the addition of . mg/ml mgso (sigma; cat #m- ) and drgs were triturated with pasteur pipettes. the resulting cell suspension was passed through a µm mesh filter to remove remaining chunks of tissues and centrifuged for minutes at rpm at room temperature. the pellet was resuspended in hbss (ca +, mg + free; invitrogen) and the cell suspension was run on a % percoll plus gradient (sigma ge - - ) to further remove debris. finally, cells were resuspended in pbs supplemented with . % bsa at a concentration of cells/µl and cell viability was determined using the automated cell analyzer nucleocounter® nc- ™. the suspended single cells were loaded on a chromium single cell instrument ( x genomics) with about cells per lane to minimize the presence of doublets. - cells per lane were recovered. rna-seq libraries were constructed using chromium single cell ’ library, gel beads & multiplex kit ( x genomics). single end sequencing was performed on illumina nextseq . read starts with a -bp umi and cell barcode, followed by an -bp i sample index. read contains a -bp transcript read. sample de-multiplexing, alignment, filtering, and umi counting were conducted using cell ranger single-cell software suite ( x genomics, v . . ). mouse mm genome assembly and ucsc gene model were used for the alignment. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . data preprocessing drg single cell data the umi data output from cell ranger single-cell software suite ( x genomics, v . . ) was analyzed using seurat package to assess the cell quality and identify cell types, similar to what described previously . cells with the number of detected genes less than or over , or with a umi ratio of mitochondria encoded genes versus all genes over . were also removed. the umi data was normalized by the ‘normalizedata’ method in seurat with default settings. to avoid potential sample-to-sample variation caused by technical variation at various experiment steps, we employed seurat data integration method. the top variable genes of each of the samples were identified using ‘findvariablefeatures’ with selection.method=‘vst’. based on the union of these variable genes, the anchor cells in each sample were identified by ‘findintegrationanchors’. all the samples were then integrated by ‘integratedata’. we subsequently scaled the integrated data (‘scaledata’) and performed dimension reduction (‘runpca’). cells were then clustered based on the first principal components by applying ‘findneighbors’ and ‘findclusters’ (resolution= . , algorithm= ). marker genes for each cluster were identified using ‘findallmarkers’. parameters were used such that these genes were expressed in at least % of the cells in the cluster, and on average -fold higher than the rest of cells with a multiple-testing adjusted wilcoxon test p value of less than . . the specificity of the canonical cell type-specific genes or cell cluster-specific genes were further examined by visualizations (extended data fig. ) and used to define the cell type for each cluster. at the end, the original umi data from genes and cells that passed (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the quality control were organized into a matrix (genes as rows and cell identifiers as columns). this matrix, together with the cell type label for each cell therein, were loaded into adroit as reference profiles. mouse brain single cell data the scrna-seq reference data of the mouse brain were obtained from zeisel et. al . among all the available data, we only retained , cells that were acquired from the brain regions, had an assigned cell type by the authors and a minimal total umi of . these cells corresponded to clusters at the finest taxonomy level in the original study. as many of the clusters are highly similar, we decided to merge some of them to simplify the reference landscape. first, the top cluster enriched markers were derived using scanpy via the ‘rank_genes_groups’ function (method=‘wilcoxon’), following the normalization (‘normalize_per_cell’), log transformation (‘log p’) and regressing out (‘regress_out’) the variances associated with the total umi and the percentage of mitochondrial chromosome encoded genes per cell. then, the pair-wise overlapping p-values among the clusters were calculated using the top marker genes assuming the hypergeometric null distribution. last, clusters with overlapping p-values more significant than e- were merged and new names were assigned by combinedly considering the original annotation, the molecular features and the specificity to certain brain regions. a total of cell types were determined that cover all the brain regions and their important substructures (supplementary table ). to make the reference dataset more manageable in size and more balanced in the representation of cell types, we down sampled (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . each cluster to no more than cells. a final set of , cells over cell types were used for the deconvolution of the mouse brain spatial transcriptome data. human islets we used the high quality human islets single cell and annotation from xin et al . the rpkm expression table was directly downloaded and used as is. the rna-fish data was also from this study . for the real bulk human pancreatic islets data , , , the read counts table were deconvoluted. only data from donors with hba c level available were included in the regression of beta cell proportion on hba c level (fig. c, supplementary table ). trabecular meshwork we downloaded the raw sequence data and followed the same analysis procedure as in patel et al for quality control and cell type identification. mouse brain spatial transcriptomics data by x visium platform the filtered cell matrix, tissue image and the spatial coordinates of a coronal section of an adult c bl/ mouse brain from the x genomics were available for download and used as is. mouse brian ish images the ish images were directly downloaded from allen mouse brain atlas by searching the gene names. the images were used with further editing except for cropping. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . data availability drg single cell data are deposited at ncbi geo (accession number: gse ) . the bulk rna- seq and rna-fish data for human pancreatic islets were initially published as aggregated data where the data processing and experimental procedure were described therein , , . we acquired the individual sample data from the authors and released them along with the current study (supplementary table and supplementary table ). the other public data analyzed in this study are available from: geo (human pancreatic islets single cell data: gse ); ncbi (human trabecular meshwork single cell data: prjna ; mouse brain single cell data: srp ). mouse brain spatial transcriptomic data was downloaded from the x genomics website (https://support. xgenomics.com/spatial-gene- expression/datasets/ . . /v _adult_mouse_brain_coronal_section). code availability adroit’s source code is available on github (https://github.com/taoyang-dev/adroit). software the statistical analyses were done with r statistical software (v . . ) and python (v . . ) . the packages used include seurat (v . . ) , scanpy (v . . ) , dplyr (v . . . ) , doparallel (v . . ) , data.table (v . . ) , fitdistrplus (v . - ) , nnls (v . ) . reference (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . wang, z., gerstein, m. & snyder, m. rna-seq: a revolutionary tool for transcriptomics. nature reviews genetics ( ) doi: . /nrg . . chu, g. c., kimmelman, a. c., hezel, a. f. & depinho, r. a. stromal biology of pancreatic cancer. journal of cellular biochemistry ( ) doi: . /jcb. . . bussard, k. m., mutkus, l., stumpf, k., gomez-manzano, c. & marini, f. c. tumor- associated stromal cells as key contributors to the tumor microenvironment. breast cancer research ( ) doi: . /s - - - . . munn, d. h. & bronte, v. immune suppressive mechanisms in the tumor microenvironment. current opinion in immunology ( ) doi: . /j.coi. . . . . gonzalez, h., hagerling, c. & werb, z. roles of the immune system in cancer: from tumor initiation to metastatic progression. genes and development ( ) doi: . /gad. . . . garner, h. & de visser, k. e. immune crosstalk in cancer progression and metastatic spread: a complex conversation. nature reviews immunology ( ) doi: . /s - - -z. . singh, u. p. et al. chemokine and cytokine levels in inflammatory bowel disease patients. cytokine ( ) doi: . /j.cyto. . . . . van lint, p. & libert, c. chemokine and cytokine processing by matrix metalloproteinases and its effect on leukocyte migration and inflammation. j. leukoc. biol. ( ) doi: . /jlb. . . zelová, h. & hošek, j. tnf-α signalling and inflammation: interactions between old (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acquaintances. inflammation research ( ) doi: . /s - - - . . koelman, l., pivovarova-ramich, o., pfeiffer, a. f. h., grune, t. & aleksandrova, k. cytokines for evaluation of chronic inflammatory status in ageing research: reliability and phenotypic characterisation. immun. ageing ( ) doi: . /s - - - . . landskron, g., de la fuente, m., thuwajit, p., thuwajit, c. & hermoso, m. a. chronic inflammation and cytokines in the tumor microenvironment. journal of immunology research ( ) doi: . / / . . ståhl, p. l. et al. visualization and analysis of gene expression in tissue sections by spatial transcriptomics. science ( ) doi: . /science.aaf . . vickovic, s. et al. high-definition spatial transcriptomics for in situ tissue profiling. nat. methods ( ) doi: . /s - - -y. . tang, f. et al. mrna-seq whole-transcriptome analysis of a single cell. nat. methods ( ) doi: . /nmeth. . . denisenko, e. et al. systematic assessment of tissue dissociation and storage biases in single-cell and single-nucleus rna-seq workflows. genome biol. ( ) doi: . /s - - - . . nguyen, q. h., pervolarakis, n., nee, k. & kessenbrock, k. experimental considerations for single-cell rna sequencing approaches. frontiers in cell and developmental biology ( ) doi: . /fcell. . . . tanay, a. & regev, a. scaling single-cell genomics from phenomenology to mechanism. nature ( ) doi: . /nature . . abbas, a. r., wolslegel, k., seshasayee, d., modrusan, z. & clark, h. f. deconvolution of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. plos one ( ) doi: . /journal.pone. . . newman, a. m. et al. robust enumeration of cell subsets from tissue expression profiles. nat. methods ( ) doi: . /nmeth. . . baron, m. et al. a single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. cell syst. ( ) doi: . /j.cels. . . . . tsoucas, d. et al. accurate estimation of cell-type composition from gene expression data. nat. commun. ( ) doi: . /s - - -z. . wang, x., park, j., susztak, k., zhang, n. r. & li, m. bulk tissue cell type deconvolution with multi-subject single-cell expression reference. nat. commun. ( ) doi: . /s - - -x. . andersson, a. et al. single-cell and spatial transcriptomics enables probabilistic inference of cell type topography. commun. biol. , ( ). . newman, a. m. et al. determining cell type abundance and expression from bulk tissues with digital cytometry. nat. biotechnol. ( ) doi: . /s - - - . . myung, i. j. tutorial on maximum likelihood estimation. j. math. psychol. ( ) doi: . /s - ( ) - . . bassett, r. & deride, j. maximum a posteriori estimators as a limit of bayes estimators. math. program. ( ) doi: . /s - - - . . zhao, y. & simon, r. gene expression deconvolution in clinical samples. genome medicine ( ) doi: . /gm . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . chiu, y. j., hsieh, y. h. & huang, y. h. improved cell composition deconvolution method of bulk gene expression profiles to quantify subsets of immune cells. bmc med. genomics ( ) doi: . /s - - - . . kang, k. et al. cdseq: a novel complete deconvolution method for dissecting heterogeneous samples using gene expression data. plos comput. biol. ( ) doi: . /journal.pcbi. . . qiao, w. et al. pert: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. plos comput. biol. ( ) doi: . /journal.pcbi. . . zaitsev, k., bambouskova, m., swain, a. & artyomov, m. n. complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. nat. commun. ( ) doi: . /s - - - . . zeisel, a. et al. molecular architecture of the mouse nervous system. cell ( ) doi: . /j.cell. . . . . donovan, m. k. r., d’antonio-chronowska, a., d’antonio, m. & frazer, k. a. cellular deconvolution of gtex tissues powers discovery of disease and cell-type associated regulatory variants. nat. commun. ( ) doi: . /s - - - . . phipson, b., zappia, l. & oshlack, a. gene length and detection bias in single cell rna sequencing protocols. f research ( ) doi: . /f research. . . . chen, g., ning, b. & shi, t. single-cell rna-seq technologies and related computational data analysis. frontiers in genetics ( ) doi: . /fgene. . . . chen, d. & plemmons, r. j. nonnegativity constraints in numerical analysis. in the birth (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . of numerical analysis ( ). doi: . / _ . . lein, e. s. et al. genome-wide atlas of gene expression in the adult mouse brain. nature ( ) doi: . /nature . . xin, y. et al. rna sequencing of single human islet cells reveals type diabetes genes. cell metab. ( ) doi: . /j.cmet. . . . . patel, g. et al. molecular taxonomy of human ocular outflow tissues defined by single- cell transcriptomics. proc. natl. acad. sci. , lp – ( ). . xin, y. et al. pseudotime ordering of single human b-cells reveals states of insulin production and unfolded protein response. diabetes ( ) doi: . /db - . . gutierrez, g. d. et al. gene signature of proliferating human pancreatic a cells. endocrinology ( ) doi: . /en. - . . cerf, m. e. beta cell dysfunction and insulin resistance. frontiers in endocrinology ( ) doi: . /fendo. . . . maedler, k. & donath, m. y. beta-cells in type diabetes: a loss of function and mass. hormone research ( ). . donath, m. y. et al. mechanisms of β-cell death in type diabetes. diabetes ( ) doi: . /diabetes. .suppl_ .s . . calanna, s. et al. alpha- and beta-cell abnormalities in haemoglobin a c-defined prediabetes and type diabetes. acta diabetol. ( ) doi: . /s - - - . . kanat, m. et al. the relationship between β-cell function and glycated hemoglobin. diabetes care , lp – ( ). . nepton, s. beta-cell function and failure. in type diabetes ( ). doi: . / . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . dolenšek, j., rupnik, m. s. & stožer, a. structural similarities and differences between the human and the mouse pancreas. islets ( ) doi: . / . . . . lein, e. s. et al. genome-wide atlas of gene expression in the adult mouse brain. nature , – ( ). . vieth, b., parekh, s., ziegenhain, c., enard, w. & hellmann, i. a systematic evaluation of single cell rna-seq analysis pipelines. nat. commun. ( ) doi: . /s - - - . . anders, s. & huber, w. differential expression analysis for sequence count data. genome biol. ( ) doi: . /gb- - - -r . . hafemeister, c. & satija, r. normalization and variance stabilization of single-cell rna- seq data using regularized negative binomial regression. genome biol. ( ) doi: . /s - - - . . svensson, v. droplet scrna-seq is not zero-inflated. nature biotechnology ( ) doi: . /s - - - . . delignette-muller, m. l. & dutang, c. fitdistrplus: an r package for fitting distributions. j. stat. softw. ( ) doi: . /jss.v .i . . mullen, katharine m., i. h. m. van s. nnls: the lawson-hanson algorithm for non- negative least squares (nnls). r packag. version . ( ). . byrd, r. h., lu, p., nocedal, j. & zhu, c. a limited memory algorithm for bound constrained optimization. siam j. sci. comput. ( ) doi: . / . . the r core team. r: a language and environment for statistical computing. r foundation for statistical computing ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . alessandri-haber, n. et al. hypotonicity induces trpv -mediated nociception in rat. neuron ( ) doi: . /s - ( ) - . . zheng, g. x. y. et al. massively parallel digital transcriptional profiling of single cells. nat. commun. ( ) doi: . /ncomms . . stuart, t. et al. comprehensive integration of single-cell data. cell ( ) doi: . /j.cell. . . . . wolf, f. a., angerer, p. & theis, f. j. scanpy: large-scale single-cell gene expression data analysis. genome biol. ( ) doi: . /s - - - . . van rossum, g. & drake, f. l. python reference manual. scotts valley, ca ( ). . wickham, h. & francois, r. dplyr: a grammar of data manipulation. r packag. version . . . ( ). . weston, s., calaway, r. & tenenbaum, d. doparallel: foreach parallel adaptor for the parallel package. cran ( ). . dowle, m. & srinivasan, a. data.table: extension of ‘data.frame’. r package version . . . manual ( ). acknowledgements we thank yurong xin for pointing us to the relevant public data resource. we also thank gabor halasz and yuan zhu for the advice to algorithm design. author contributions (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . t.y., y.b., w.f., n.a.-h., m.l.-f., l.e.m. and g.s.a. designed the research. t.y., y.b., and w.f. developed the algorithm. t.y., y.b., w.f. and j.k. participated in the data analyzing. m.s. and r.b. performed the drg tissue collection. c.a. performed the single cell library preparation and sequencing experiment. t.y., y.b., n.a.-h. and g.s.a. wrote the manuscript. competing interests t.y., y.b., w.f. and g.s.a. have filed a patent application relating to the adroit computational framework. m.l.-f. is an employee of cellular longevity. all other authors are employees and shareholders of regeneron pharmaceuticals, although the manuscript’s subject matter does not have any relationship to any products or services of this corporation. figure legends fig. : schematic representation of adroit computational framework. a, adroit inputs bulk or spatial rna-seq data, single cell rna-seq data and cell type annotations. it first selects informative genes and estimates their means and dispersions, based on which the cell type specificity of genes is computed. depending on multi-sample availability, cross-sample gene variability is estimated from compound data, or single cell samples (dashed arrow). lastly the gene-wise scaling factors are estimated using both compound data and single cell data. these computed quantities are fed to a weighted regularized model to infer the transcriptome composition. b, a mock example to illustrate the role of gene-wise scaling factor. ideally, an accurate estimation of slop (i.e., cell proportion) would be the slope of the green line, however direct fitting would result in the red line due to the impact of the outlier genes. outlier genes (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . can be induced due to platform difference affecting genes differently. adroit adopts an adaptive learning approach that first learns a rough estimation of the slop (red line), then moves the outlier genes toward it such that the more deviated genes will be moved more toward the true line (i.e., longer arrows). after the adjustment, the new estimated slop (blue line) is closer to the truth (green line), thus is a more accurate estimation. fig. : benchmark on simulated bulk data synthesized from trabecular meshwork (tm) single cells data. a, adroit has the closest estimation to the true cell proportion comparing to music and nnls. each dot is a cell type from one donor. b, for each cell type in tm, adroit has the smallest differences from the true cell type proportion and the smallest variance of estimates across the donors. for each cell type, a dot on the graph denotes a donor, and the bars represent the . × interquartile ranges. estimation was done by using the single cell as reference leaving out the donor used for synthesizing bulk. c, adroit’s estimates are more accurate and specific than music’s estimates on synthetic bulk that contains partial cell types. the synthetic bulk was simulated by using only out of the cell types per donor, then estimated with the reference of cell types. adroit has notably fewer false positive estimates of the cell types not included, and more accurate estimation of the cell types used for synthesizing bulk. d, receiver operating characteristic (roc) curve shows adroit has a significantly higher auc than music ( . vs . ), meaning better sensitivity and specificity. fig. : benchmark on scrna-seq data from dorsal root ganglion (drg) where these exist many closely related subtypes of neuronal cells. a, cell types were identified from scrna-seq (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . samples of mice, including multiple subtypes of neurofilaments (nf), peptidergic (pep) and non-peptidergic (np) neurons. b, benchmarking with the synthetic data shows adroit’s estimation of cell type proportions are highly accurate. in particular, adroit achieves reasonably high accuracy when the cells are rare (e.g., < %). each dot represents a cell type from one sample. c, for each individual sample, mad, rmsd, pearson and spearman correlations were computed and compared across three methods. adroit has the lowest mad and rmsd, and highest pearson and spearman correlations. in addition, adroit’s estimation is also the most stable across samples. each dot on the boxplot is a sample. estimation was done by using the single cell reference leaving out the sample used for synthesizing bulk. fig. : adroit is more accurate and sensitive than stereoscope on spatial spots simulated from real drg cells. a, adroit and stereoscope estimations on simulated spatial spots that contains pep neuron subtypes. true mixing proportions were denoted by the red dashed lines. three schemes were simulated: ) the proportions of pep cell types are the same and equal to . ; ) pep _dcn is . and the other are . ; ) pep _dcn and pep _s a .tagln are . , pep _slc a .sstr and pep _htr a.sema a . are . , and pep _trpm is . . in all simulation schemes, adroit’s estimates are more consistently centered around the true proportions than stereoscope’s estimates. b, adroit is more accurate in estimating rare cells in spatial spots. the spots were simulated by simulating mixtures of pep cell types (i.e., pep _slc a .sstr , pep _htr a.sema a and pep _trpm ), with a series of low percent of pep _trpm cell type from % to % and the other two cell types sharing the rest proportion equally. adroit’s estimates are systematically closer to the true simulated (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . proportions than stereoscope’s estimates. c, adroit is consistently more sensitive than stereoscope in detecting low percent cells (estimates > . % deemed as detected) in simulated spots of ) low percent of nf_calb mixed with nf_pvalb and nf _ntrk .necab , ) low percent of np_mrgpra mixed with np_mrgprd and np_nts, ) low percent of pep _trpm mixed with pep _slc a .sstr and pep _htr a.sema a, ) low percent of nf_calb mixed with th, satellite glia and endothelial, ) low percent of np_mrgpra mixed with th, satellite glia and endothelial, and ) low percent of pep_trpm mixed with th, satellite glia and endothelial. fig. : applications to real bulk human islets rna-seq data and mouse brain spatial transcriptome data. a, adroit’s estimates on real human islets bulk rna-seq data were highly reproducible for the repeated samples from same donor. b, adroit estimated cell type proportions agreed with the rna-fish measurements. c, adroit estimated beta cell proportions in type diabetes patients are significantly lower than that in healthy subjects. in addition, the estimated proportions have a significant negative linear association with donors’ hba c level. d, the spatial mapping of mouse brain cell types is consistent with the ish images of marker genes from allen mouse brain atlas respectively. the genes, spink (marker of hippocampal field ca ), c ql (marker of dentate gyrus), clic (marker of choroid plexus), synpo (marker of thalamus) were identified as markers of corresponding tissues by zeisel et al . extended data fig. : benchmark three methods on human pancreatic islets data. a, human islets single cell data contains cell types from subjects including two major cell types alpha (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and beta cells, and two minor cells pp and delta cells . the cell proportion varies across different subjects. b, c, adroit achieves leading accuracy when applied to the bulk data synthesized from the single cell data. each dot on scatterplot is a cell type from one subject. estimation was done by using the single cell reference leaving out the subject used to synthesize bulk. extended data fig. : dorsal root ganglion single cell shows cell types including subtypes of neurofilament, subtypes of non-peptidergic neurons, and subtypes of peptidergic neurons. a, heatmap of top markers shows distinction between cell types as well as similarity between subtypes. b, the proportion of each cell type varies from . % to . % across different samples. extended data fig. : comparing the performance on estimated simulated spatial spots of pure cell type respectively. a, estimates by adroit and b, estimates by stereoscope are comparably accurate. simulations were done by sampling cells from the same cell type and adding up the read counts per gene. for each of the cell types of the drg tissue, we repeated the simulation times. the results shown were a summary of simulations for each cell type. for both methods, the median estimates of the sampled cell type were close to (red lines), whereas the cell type not sampled has zero or close-to-zero values. extended data fig. : the comparison of adroit and stereoscope on the simulated spots of additional cell mixing schemes. more types of mixed spatial spots were simulated: ) mixture (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . of neurofilaments (nf); ) mixture of non-peptidergic (np) cell types; ) nf _ntrk .necab mixing with th, satellite glia and endothelial; ) np_nts mixing with th, satellite glia and endothelial; and ) pep _trpm mixing with th, satellite glia and endothelial. each simulation was repeated times. consistently for all simulation schemes, adroit’s estimates were always closer to the true simulated proportions (red lines), whereas stereoscope’s estimates largely deviated from the true proportions. extended data fig. : spatial mapping of cell types with adroit quantitative depicts the content in each spot. spatial transcriptomics data was downloaded from x genomics (https://support. xgenomics.com/spatial-gene- expression/datasets/ . . /v _adult_mouse_brain_coronal_section). the reference single cells were sampled from zeisel et al and curated into cell types. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figures fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . learning association for single-cell transcriptomics by integrating profiling of gene expression and alternative polyadenylation guoli ji , , wujing xuan , , yibo zhuang , lishan ye , , sheng zhu , , wenbin ye , , xi wang , and xiaohui wu , * department of automation, xiamen university, xiamen , china xiamen ylz yihui technology co., ltd, xiamen, fujian , china xiamen health and medical big data center, xiamen, fujian , china national institute for data science in health and medicine, xiamen university, xiamen, fujian , china keywords: cell type clustering; alternative polyadenylation; single-cell rna-seq; integrative analysis; software guoli ji is a professor with the department of automation in xiamen university. his research interests include bioinformatics, advanced control, data mining and information system. wujing xuan is a graduate student with the department of automation in xiamen university. his research interests are bioinformatics and data mining. yibo zhuang is an employee in xiamen ylz yihui technology company. his research interests are software design, cloud computing and big data. lishan ye is the director of xiamen health and medical big data center. her research interests are cloud computing and healthcare big data. sheng zhu is a ph.d. candidate with the department of automation in xiamen university. his research interests are bioinformatics and healthcare big data. wenbin ye is a ph.d. candidate with the department of automation in xiamen university. her research interests are bioinformatics and mrna processing. xi wang is a graduate student with the department of automation in xiamen university. her research interests are bioinformatics and data mining. xiaohui wu is an associate professor with the department of automation in xiamen university. her research interests are mrna processing, bioinformatics, and data mining. * corresponding author. e-mail: xhuister@xmu.edu.cn, tel: + (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract single-cell rna-sequencing (scrna-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells. a myriad of computational methods have been proposed to learn cell-cell similarities and/or cluster cells, however, high variability and dropout rate inherent in scrna-seq confounds reliable quantification of cell-cell associations based on the gene expression profile alone. lately bioinformatics studies have emerged to capture key transcriptome information on alternative polyadenylation (apa) from standard scrna-seq and revealed apa dynamics among cell types, suggesting the possibility of discerning cell identities with the apa profile. complementary information at both layers of apa isoforms and genes creates great potential to develop cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scrna-seq data without changing experimental technologies. we proposed a toolkit called sclapa for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation derived from the same scrna-seq data. we compared sclapa with seven similarity metrics and five clustering methods using diverse scrna-seq datasets. comparative results showed that sclapa is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. moreover, with sclapa we found two hidden subpopulations of peripheral blood mononuclear cells that were undetectable using the gene expression data alone. as a comprehensive toolkit, sclapa provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in most existing scrna-seq pipelines. sclapa is available at https://github.com/bmilab/sclapa. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction single-cell rna-sequencing (scrna-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells, which has great potential to reveal cellular composition of tissues, transcriptional heterogeneity among cells and structure of cell types [ ]. cell-type identification is a critical step in most scrna-seq data analyses, and a myriad of computational methods have emerged to detect novel cell types, previously un-appreciated sub-types of cells and rare cells [ ]. fundamentally, these numerous clustering methods rely on cell-cell associations (or similarities) for categorizing individual cells into different clusters [ ]. a wide range of computational tools have been proposed to cluster cells, which implicitly or explicitly rely on a similarity concept [ ]. simlr (single-cell interpretation via multikernel learning) adapts k-means by simultaneously training a similarity measure based on multiple kernel learning [ ]. raceid extends k-means with outlier detection to discover rare cell types [ ]. sc (single-cell consensus clustering) utilizes a consensus approach to combine multiple clustering solutions [ ]. phenograph combines shared nearest-neighbour graphs and louvain community detection to fast identify cell clusters [ ]. despite of the considerable progress, there is no strong consensus on which is the best clustering approach to define cell types for all situations [ , , ]. particularly, high variability and dropout rate inherent in scrna-seq confounds the reliable quantification of lowly and/or moderately expressed genes [ , ], resulting in extremely sparse gene-cell count matrix. consequently, there might be little satisfactory overlap of observed genes among cells, hindering reliable quantification of cell-cell similarities based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . recently, multi-omics methods that leverage additional aspects of the cell, such as the dna methylome, open chromatin or proteome, are beginning to appear [ ]. seurat v [ ] harmonizes scrna-seq and scatac-seq data from a similar tissue to identify subpopulations of cells that are undistinguishable using the scatac-seq data alone. liger [ ], a method based on integrative non-negative matrix factorization (inmf), was proposed to classify cortical cells profiled from single-cell bisulfite sequencing by integrating scrna-seq data. additional modalities of individual cells provide valuable information about the phenotype and genetic cellular state not manifested by the transcriptome. however, not all scrna-seq data is accompanied data from different modalities. even that multimodal omics data are gradually available, integrative multimodal analysis is still in its infancy [ ]. it remains a challenge to reconcile the heterogeneity across modalities as different modalities are normally profiled from cells sampled from the same tissue rather than the same cells. although most scrna-seq studies focus on gene expression profiling, key information on transcript isoforms, e.g., alternative splicing (as) and/or alternative polyadenylation (apa), can be obtained, enabling multiple aspects of transcriptome information to be derived from standard scrna-seq without changing experimental technologies [ - ]. lately, several computational methods, such as scapatrap [ ], sierra [ ] and scapa [ ], have been proposed to identify apa sites in single cells from diverse ′ tag-based scrna-seq protocols, e.g., drop-seq [ ], cel-seq [ ] and x genomics [ ]. cell-to-cell heterogeneity in apa site usage was also observed [ - ]. particularly, the previous study [ ] revealed that the apa profile, even that from non-differentially expressed genes, can distinguish mouse cells in different stages during sperm cell differentiation, suggesting the possibility of discerning cell identities with apa usages independent of gene expression. recent efforts have (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pioneered methods to identify apa sites or explore apa dynamics across different cell types [ - , - ], however, most studies profiled apa among cells with predefined cell type labels rather than discern cell types in an unsupervised manner. complementary information at both layers of apa isoforms and genes can be refined from the same cells [ - ], which creates great potential to develop more sophisticated and cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scrna-seq experiments. here we proposed a toolkit called sclapa for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation. sclapa leverages the resolution and huge abundance of scrna-seq, boosting the gene-level analysis with additional layer of apa information directly derived from the same scrna-seq data. by employing the strategy of similarity network fusion, sclapa effectively learns highly informative cell-cell associations from expression profiles of both genes and apa isoforms. we compared sclapa with seven similarity metrics and five clustering methods, using diverse scrna-seq data from different experimental technologies and species. comparative results showed that sclapa is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. moreover, with sclapa we found two hidden subpopulations of cells in peripheral blood mononuclear cells (pbmcs) that were undetectable using the gene expression data alone. as a comprehensive toolkit, sclapa provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in many other standard scrna-seq pipelines for single-cell analyses. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . materials and methods scrna-seq datasets we used five publicly available scrna-seq datasets from animals and plants generated by ′ tag-based scrna-seq protocols (table s ), spanning a wide spectrum of tissues, cell types and species. raw data except for the pbmc data were downloaded from ncbi geo (gene expression omnibus). cell types and cell labels of the data of amygdala, mammary and root were obtained from the corresponding studies; cell labels of the hypothalamus data were obtained from panglaodb [ ]. the pbmc k dataset was downloaded from the x genomics website (https://www. xgenomics.com/). for cell type annotation of pbmcs, we followed the tutorial of seurat v [ ] to cluster cells on the basis of the gene-cell expression matrix. specifically, cells with total read counts less then were discarded. the lognormalize method was adopted for normalization. top highly variable features were selected by the vst method. pca (principal component analysis) was used for dimensionality reduction and top principal components were retained. finally, cells were clustered by seurat’s fundclusters with argument ‘resolution= . . for cell type annotation of cell clusters, known marker genes of pbmcs were complied from relevant studies (table s ). differentially expressed (de) genes for each cell group were calculated with seurat’s findallmarkers. we also calculated, for each cell cluster, the number of cells where a de gene is expressed and the mean expression level of a de gene. the cell type was carefully assigned to a cell cluster according to the presence and expression level of marker gene(s). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . overview of sclapa sclapa mainly consists of four modules (figure s ): (i) the input module, (ii) cell-cell distance, (iii) distance fusion, (iv) cell type clustering. the input module prepares the input for sclapa, including a poly(a) site expression matrix (hereinafter referred as pa-matrix) and a gene expression matrix (hereinafter referred as ge-matrix). the pa-matrix is generated from raw scrna-seq with scapatrap [ ], which stores expression levels of poly(a) sites, with each row denoting a poly(a) site and each column denoting a cell. the ge-matrix can be obtained from websites like ncbi geo and x genomics, or generated by various routine scrna-seq analysis tools like cell ranger. in the module of cell-cell distance, a cell-cell distance matrix is learned for pa-matrix (called pa-dist) and ge-matrix (called ge-dist), respectively. the module of distance fusion employs similarity network fusion (snf) [ ] to integrate the two distance matrices (pa-dist and ge-dist) into one cell-cell distance matrix. the cell type clustering module clusters cells based on the fused distance matrix with various clustering methods. sclapa was implemented as an open source r package and is available at https://github.com/bmilab/sclapa. scripts and data used in this study are also available at the github website. identification of poly(a) sites from scrna-seq we followed the tutorial provided at the scapatrap website (https://github.com/bmilab/scapatrap) to identify poly(a) sites with scapatrap [ ]. it should be noted that alternative tools, such as sierra [ ] and scapa [ ], can also be used. briefly, raw fastq reads were mapped with cell ranger . . (https://www. xgenomics.com/) and then uniquely mapped reads were obtained with samtools (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/bmilab/scapatrap https://github.com/bmilab/scapatrap https://doi.org/ . / . . . (http://samtools.sourceforge.net/). then umi-tools [ ] was employed to remove polymerase chain reaction (pcr) duplicates and extract unique molecular identifiers (umis). the findtails function in the scapatrap package was used to determine exact locations of poly(a) sites from reads with a/t stretches and the findpeaks function was adopted to identify all potential peaks of poly(a) sites from the whole genome level. finally consensus poly(a) sites supported by both of the peak and the tail evidence were used. the featurecounts function in the subread toolkit [ ] was adopted to quantify the expression level for each poly(a) site. poly(a) site annotation was performed with the movapa package [ ], using the latest genome annotation of the respective species -- tair for arabidopsis, mm for mouse and grch for human. briefly, poly(a) sites identified from scapatrap were annotated with rich information, such as genomic regions (i.e., ′ utr, ′ utr, coding sequence (cds), intron, exon and intergenic) and gene id. similar to previous studies [ - ], annotated ′ utrs were extended by a length of bp to recruit intergenic sites that may originate from authentic ′ utrs. calculation of cell-cell distance sclapa learns a cell-cell distance matrix for pa-matrix and ge-matrix, respectively. various distance metrics can be chosen, including euclidean distance, pearson correlation, two metrics of proportionality (𝜌𝑝 and ∅𝑠) [ ], rafsil (random forest based similarity learning) [ ] and simlr [ ]. euclidean distance and pearson correlation are widely used in either single-cell or bulk transcriptomics. the two measures of proportionality were found to have strong performance according to a comprehensive benchmarking analysis of a large single-cell transcriptome compendium (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . [ ]. rafsil is a random forest based approach that learns cell-cell similarities from scrna-seq data, including two variations -- rafsil / . simlr learns a distance metric that fits the structure of the scrna-seq data by combining multiple kernels corresponding to different informative representations of the data. euclidean distance and pearson correlation were calculated by the dists and cor functions in the r package stats, respectively; simlr metric was calculated by the simlr r package with argument ‘cores.ratio= ’; rafsil metric was calculated by the rafsil r package with arguments ‘nrep= , gene_filter=false’; 𝜌𝑝 and ∅𝑠 were calculated by the perb and phis functions in the r package propr, respectively. for each distance metric, cell-cell distance matrices, pa-dist and ge-dist, can be learned for pa-matrix and ge-matrix, respectively. pa-dist represents the cell-cell similarity network learned from the apa isoform layer, whereas ge-dist reflects the network learned from the gene layer, each of which encapsulates complementary information about cell-cell associations absent in the other genomic layer. distance fusion after learning pa-dist and ge-dist, similarity network fusion (snf) [ ] is utilized to flexibly integrate the two layers of cell-cell similarities into one similarity matrix. first, pa-dist and ge-dist were iteratively and gradually fused to a consensus network, utilizing the non-linear method of message passing theory [ ]. then weak similarities representing potential noise were discarded, and strong similarities were retained. by generating coherent cell-cell similarities from both apa isoform and gene layers, snf profiles a more comprehensive biological relationship among cells, beyond the scope of methods solely based on ge-matrix. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . given a pa-matrix storing expression levels of 𝑚 poly(a) sites in 𝑛 cells or a ge-matrix recording expression levels of 𝑚 genes in 𝑛 cells, the corresponding cell-cell distance matrix (pa-dist or ge-dist) can be obtained using a selected distance metric. the distance matrix can also be denoted as a graph 𝐺 =< 𝑉, 𝐸, 𝑊 >, with vertices 𝑉 {𝑐 , … , 𝑐𝑛 } corresponding to cells, edges 𝐸 representing cell-cell link and edge weights 𝑊[𝑛×𝑛] denoting the kernel representation of cell-cell similarities. the weight of an edge linking cells 𝑐𝑖 and 𝑐𝑗 is determined using a scaled exponential similarity kernel: 𝑊𝑖𝑗 = 𝑒𝑥𝑝 − 𝑑𝑖𝑗 𝜇𝛽𝑖𝑗 ( ) here 𝑑𝑖𝑗 represents the distance between cells 𝑐𝑖 and 𝑐𝑗 measured by a distance metric (e.g. pearson correlation). 𝜇 is an empirical hyperparameter with a recommended value in a sizable range of [ . , . ] [ ]. 𝛽𝑖𝑗 is a scaling factor defined as follows: 𝛽𝑖𝑗 = 𝑑 𝑐𝑖,𝑁𝑖 +𝑑 𝑐𝑗 ,𝑁𝑗 +𝑑𝑖𝑗 ( ) where 𝑁𝑖 are neighboring cells of 𝑐𝑖 and 𝑑 𝑐𝑖, 𝑁𝑖 is the average distance of 𝑐𝑖 to its neighbors. to obtain a fused network from pa-dist and ge-dist, a full and sparse kernel on the vertex set 𝑉 is derived from the weight matrix 𝑊. the full kernel is a normalized weight matrix 𝑊 [𝑛×𝑛] which stores the full information of cell-cell similarities. the normalized weight between 𝑐𝑖 and 𝑐𝑗 is defined as: 𝑊 𝑖𝑗 = 𝑊𝑖𝑗 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑖 ≠ 𝑗 . 𝑤ℎ𝑒𝑛 𝑖 = 𝑗 ( ) another matrix 𝐴[𝑛×𝑛] encodes the local affinity that measures similarities of a cell to its 𝐾 most similar cells: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝐴𝑖𝑗 = 𝑊𝑖𝑗 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑗 ∈ 𝑁𝑖 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ( ) here 𝑁𝑖 is the set of cell 𝑐𝑖 and its neighbors in the graph 𝐺. the network fusion initiates from 𝑊 , using 𝐴 as the kernel matrix to capture local structure of the graph. to fuse the two distance matrices (pa-dist and ge-dist), first 𝑊𝑃𝐴 and 𝑊𝐺𝐸 were computed, respectively. then the corresponding initial state matrices 𝑊 𝑃𝐴 and 𝑊 𝐺𝐸 were derived from the two similarity matrices, and the kernel matrices 𝐴𝑃𝐴 and 𝐴𝐺𝐸 were also computed. given the initial two status matrices at 𝑡 = , 𝑊 𝑡= 𝑃𝐴 and 𝑊 𝑡= 𝐺𝐸 , the fusion process iteratively updates the respective similarity matrix: 𝑊 𝑡+ 𝑃𝐴 = 𝐴𝑃𝐴 × 𝑊 𝑡 𝑃𝐴 × (𝐴𝑃𝐴 )𝑇 𝑊 𝑡+ 𝐺𝐸 = 𝐴𝐺𝐸 × 𝑊 𝑡 𝐺𝐸 × (𝐴𝐺𝐸 )𝑇 ( ) then after 𝑡 iterations, the final status matrix is obtained: 𝑊 = 𝑊 𝑡 𝑃𝐴 +𝑊 𝑡 𝐺𝐸 ( ) 𝑊 is the fused cell-cell distance network by incorporating cells’ apa isoform and gene expression profiles. the corresponding cell-cell similarity matrix is − 𝑊 . the distance or similarity matrix can be used for downstream cell type clustering. single cell clustering four widely-used clustering methods were provided in sclapa to cluster cells on the basis of the fused cell-cell similarity matrix, including louvain clustering [ ], hierarchical clustering (hc) [ ], spectral clustering (sc) [ ] and k-means. the louvain clustering was implemented by the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cluster_louvain function in the r package igraph, with arguments ‘mode=undirected, weighted=true, diag = true’. the spectral clustering was implemented by the spectralclustering function in the r package snftool with default settings [ ]. the hierarchical clustering [ ] was performed by the flashclust function in the r package flashclust with default settings [ ]. the k-means clustering was implemented by the kmeans function of the r package stats with arguments ‘iter.max= e+ , nstart= ’. performance evaluation we distinguished two scenarios, similarity learning and clustering, to evaluate our approach. for each scenario, we applied sclapa to four scrna-seq datasets with pre-annotated cell labels, and compared results with other competing approaches. for the scenario of similarity learning, we compared sclapa with seven similarity measures, including three measures designed for scrna-seq (rafsil / and simlr), two measures of proportionality (𝜌𝑝 and ∅𝑠) and two traditional similarity measures (euclidean distance and pearson correlation). each of these measures was applied to a given ge-matrix to learn a cell-cell similarity matrix. for sclapa, we applied each measure to learn two cell-cell similarity matrices from pa-matrix and ge-matrix and fused them into one matrix. we also applied different clustering methods including louvain, hc, sc and k-means on the similarity matrix learned from each similarity measure to assess different similarity measures in the context of clustering. for the scenario of clustering, we compared sclapa with five state of the art clustering methods for scrna-seq data, including sc [ ], seurat v [ ], sincera [ ], snn-cliq [ ] and dynamic tree (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cut method (dynamictreecut) [ ]. none of these approaches provides explicit similarity learning procedure, instead they provide cell labels by unsupervised learning on the ge-matrix. each approach was applied to a given ge-matrix for cell clustering and class labels of cells were obtained. for sclapa, we applied each of the four methods (louvain, hc, sc and k-means) on the fused similarity matrix to obtain clustering results. two internal validation metrics, dunn index [ ] and connectivity [ ], were employed for the first scenario to quantitatively assess the goodness of a clustering structure without relying on any clustering methods or knowing external information about class labels. the dunn index [ ] evaluates non-linear combinations of the between-group separation and the within-group compactness. the connectivity reflects the extent of observations that are present in the same group as their neighbors in the data space. the original value of connectivity ranges from zero to infinity, with smaller value denoting higher performance. here we used a transform, /log (connectivity + ), to make connectivity consistent with dunn. the larger the score of connectivity or dunn, the better the separation is. the r package clvalid [ ] was adopted to calculate the connectivity and dunn index. additionally, we used three popular metrics to evaluate the performance of sclapa in the context of clustering, including the ari (adjusted rand index), jaccard and nmi (normalized mutual information). the value of ari ranges from - to , and values of nmi and jaccard range from to , with the higher value indicating the better performance. ari is a widely-used metric for measuring the concordance between two clustering results. the jaccard index quantifies the similarity between two datasets. nmi is a variation of mutual information for evaluating clustering results, which corrects the bias of the consistency caused by chance. ari and jaccard were calculated using the adjustedrand (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://en.wikipedia.org/wiki/mutual_information https://en.wikipedia.org/wiki/cluster_analysis https://doi.org/ . / . . . function in the r package clues [ ]; nmi was obtained by the compare function in the r package igraph (https://igraph.org/r/). bioinformatics analyses umap [ ] was adopted for visualization of distributions of single cells, which employs the non-linear dimensional reduction technique to group similar cells in low-dimensional space. umap was implemented by the calculateumap function in the scater r package [ ]. for the analysis of the arabidopsis root data, deseq [ ] was adopted to identify de genes and de poly(a) sites. first ge-matrix and pa-matrix were normalized by the median ratio method provided in deseq . then the deseq function was applied for de detection. gene or poly(a) sites with log fold change>= . and adjusted p-value<= . were considered as de. results single-cell polyadenylation profile distinguishes cells recently, scrna-seq has emerged as a unique tool to explore cell-specific gene or isoform expression in plants [ - ]. a previous study [ ] utilized root-hair and nonhair cell types as models and revealed the potential of using scrna-seq data for inferring specific cells during the process of cell-type differentiation. here we focused on the epidermal tissue and analyzed differential expression on both gene and apa levels between root-hair and nonhair cells. a total of root-hair cells and nonhair cells were defined by the previous study [ ]. although both ge-matrix and pa-matrix were obtained from the same scrna-seq data, we still found four genes exclusively present in the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pa-matrix (figure a). for example, at g , a wrky transcription factor gene, was absent in the single-cell ge-matrix, while it has one poly(a) site (coord: ) with much higher expression level in nonhair than in hair cells according to the pa-matrix. interestingly, this poly(a) site is an annotated poly(a) site in extended ′ utr, which was supported by bulk ′-seq data according to plantapadb [ ]. similarly, at g , a hypothetical protein coding gene, is missing in the ge-matrix, while its one poly(a) site (coord: ) is expressed much higher in nonhair cells than in hair cells. this poly(a) site was also annotated as a ′ utr site in plantapadb. moreover, genes possess at least one differentially used poly(a) site, among which genes were not de genes (table s ). for example, at g is a dnaj heat shock family protein expressed in root. although both at g and its one poly(a) site are expressed higher in root hair cells than in nonhair cells, the difference between the two cell types characterized by the poly(a) profile is much more pronounced than that by the gene profile (figure b). further, using only the ge-matrix, a subset of cells are indistinguishable between hair and nonhair cell types (figure c). in contrast, cells from the two cell types were clearly separated on the basis of the pa-matrix and two potential subpopulations of nonhair cells were observed (figure c). therefore, we anticipate that the poly(a) site expression profile may encode complementary information that is absent or insignificant in the gene expression profile, which could be useful to distinguish cell types. there is a great potential to develop integrative approaches for discerning cell identities that can properly incorporate single-cell profiling of both gene expression and polyadenylation information. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . learning cell-cell similarities with sclapa we proposed the sclapa toolkit that can learn cell-cell similarities by taking advantage of the complementarities from both layers of apa isoforms and genes. here we compared the performance of the similarity metric learned from sclapa with other seven similarity metrics by analyzing four scrna-seq datasets. two metrics, dunn and connectivity, were adopted to quantitatively measure cell separation independent of clustering methods. generally, sclapa provides higher or comparable performance than other metrics across all the four datasets, whereas pearson correlation or euclidean has a consistently lower performance (figures a and s ). in terms of both dunn and connectivity, sclapa and simlr perform significantly better than other three metrics. particularly, simlr outperforms sclapa on the hypothalamus data whereas sclapa outperforms simlr on the mammary data. overall, sclapa performs better than at least six out of the seven metrics in all the four datasets, never being the worst in any case. according to the dunn index (figure a), even for datasets where the performance of sclapa is not the best, sclapa is always the close match to the best. for example, the dunn score from sclapa on the hypothalamus data is . , which is very close to the best score ( . from simlr). next we used the radar chart to compare the performance of these similarity metrics more intuitively. apparently, sclapa and simlr stand out as universally better than the others, and discrepancies of performance of other six metrics across different datasets were observed (figure b). for example, the overall similarity based on the rafsil / metric is much higher on mammary and hypothalamus data than the other two datasets, revealing the instability of performance of rafsil across different (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . datasets. in contrast, for all these four datasets, both euclidean and pearson correlation emerge as the worst similarity metric. in contrast, sclapa provides a more robust result regardless of datasets. sclapa is integrative and flexible in that different distance metrics can be chosen to learn cell-cell similarities for distance fusion. next we examined the effect of using different distance metrics in sclapa. the performance of sclapa according to the dunn index is highly robust across all datasets regardless of distance metrics used in sclapa (figure c). it is widely accepted that it is highly challenging to determine an optimal distance metric for profiling true cell-cell relationships from the complex and heterogeneous scrna-seq data [ ]. however, the integrative framework of sclapa provides an effective solution of distance fusion by assembling results from multiple data layers into one ensemble result, which can mitigate limitations in individual similarity metrics or data layers and facilitate the generalization and adaption for different scrna-seq datasets. take the hypothalamus data as an example. apparently, the matrix with block structures obtained from sclapa showed higher consistency with true labels than did other similarity metrics (figure s ). block structures learned by simlr are indistinguishable from background signatures; block structures learned by pearson correlation, euclidean or the two measures of proportion are also mixed with background signatures; block structures learned from rafsil are generally consistent with true structures except that cell types with small number of cells are less distinguishable. overall, sclapa provides more divergent clusters with higher distinction, and individual clusters obtained by sclapa are more compact than those by other similarity metrics. these results demonstrate the ability and robustness of sclapa in improving the cell separation across numerous scrna-seq datasets. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cell type clustering with sclapa cell-cell similarities learned by different similarity metrics can be adapted to other clustering methods that take similarities as inputs. here we performed extensive comparisons of sclapa with other seven similarity metrics by applying different clustering methods for cell clustering. first we applied louvain [ ], a graph-based method for community detection, to different similarity metrics for clustering. according to the ari score, similarities learned by sclapa and simlr significantly outperform similarities obtained from euclidean, pearson correlation or rafsil / (figure a). overall, simlr shows similar performance with sclapa, whereas sclapa outperforms simlr in three out of the four datasets. particularly, euclidean and pearson correlation present the worst performance in two datasets, mammary and root. similar results were obtained in terms of other two indexes, nmi and jaccard (figure s ). in addition to louvain clustering, we also investigated other three popular clustering methods, including hierarchical clustering [ ], spectral clustering [ ] and k-means [ ], to evaluate the robustness of results by applying different clustering methods on the same similarity metric (figures s - ). particularly, the performance of sclapa and rafsil / are robust regardless of clustering methods used, whereas sclapa consistently outperforms rafsil. in contrast, simlr, euclidean and pearson correlation are very sensitive to clustering methods applied (figure b). surprisingly, although simlr achieves comparable performance with sclapa based on louvain clustering (figure a), its performance is the worst using k-means or spectral clustering (figure b). take the mammary data for example, the ari score of simlr drops from . when using louvain clustering to an extremely low median value of . when using k-means. moreover, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we noted that, ari scores from individual runs of k-means clustering on simlr similarities varied greatly, revealing the relatively poor robustness of simlr with k-means clustering (figure s ). these results demonstrate that the cell-cell similarity matrix learned from sclapa is more effective and robust than competing similarity metrics in clustering cell subpopulations. during the preparation of this manuscript, we noticed another method scdapars [ ], which quantifies and recovers apa events in single cells using standard scrna-seq data. the authors also integrated apa information identified by scdapars with imputed gene expression by similarity network fusion to reveal novel cell subpopulations during human embryonic development. different from scdapars that employs the (imputed) percentage of distal poly(a) site usage index (pdui) to measure apa usage, sclapa directly utilizes raw poly(a) expression profile. here we compared the performance of sclapa and scdapars by applying them to the four scrna-seq datasets in our benchmarking analysis. following the process in gao et al. [ ], we calculated pdui based on the pa-matrix and imputed apa profiles using scdapars. then we applied five similarity metrics on the scdapars-imputed apa profile and the ge-matrix to generate scdapars-dist and ge-dist, respectively. after fusing the two distance matrices with snf, we applied louvain clustering on the fused cell-cell similarities to cluster cells. according to the ari score (figure ), sclapa significantly outperforms scdapars on all the four datasets. particularly, ari scores of scdapars with different similarity metrics varied greatly whereas the performance of sclapa is robust with different similarity metrics (figure vs. figure c), revealing that the poly(a) expression profile used in sclapa is more efficient and robust than the pdui profile used in scdapars for clustering cells. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . next we expanded the benchmarking analysis by comparing clustering results of sclapa with other single-cell clustering methods that directly take the gene-cell expression matrix as input without an explicit procedure of similarity learning. specifically, we included five popular tools for comparison, including sc [ ], seurat v [ ], sincera [ ], snn-cliq [ ] and dynamictreecut [ ]. according to the ari score, sclapa achieves generally higher or comparable performance than other methods, whereas dynamictreecut provides a consistently lower performance (figure ). similar results were observed using indexes of jaccard or nmi (figure s ). specifically, sclapa provides the best ari score in three out of the four datasets (figure ). for the hypothalamus data where sc performs the best, sclapa presents very close ari score to sc (sclapa= . ; sc = . ). particularly, for three datasets (mammary, hypothalamus and root), ari scores of individual sc runs varied greatly, reflecting the performance of sc may be unstable on some kinds of datasets. overall, the performance of sclapa is robust and consistently high across diverse scrna-seq datasets. sclapa identifies hidden subpopulations of cells we next applied sclapa on the human pbmc k dataset from x genomics for cell type clustering. first we examined the cell type composition of the pbmcs by applying seurat to the gene-cell expression matrix (ge-matrix). ten distinct cell clusters were yielded (figure a). based on the expression of known markers (table s ), nine clusters were annotated. up to , poly(a) sites from genes were identified from the raw rna-seq data with scapatrap. we learned cell-cell similarities with sclapa by jointly considering expression profiles of apa isoforms and genes. after (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . applying louvain clustering on the cell-cell similarity matrix, cell clusters were obtained and clusters were successfully annotated. these clusters covered the nine clusters identified by seurat and contained two new small clusters (figure b). both subclusters were supported by the expression patterns of markers, suggesting that they represented distinct cell types. one subcluster was annotated as regulatory t cell on the basis of elevated expression of three markers, ccr , foxp and il ra (figure s ). depending only on the gene expression profile, regulatory t cells were not well resolved and are indistinguishable among other t cells (figure a). although the gene expression of the marker ccr is sparse and weak among t cells, we could still distinguish clearly regulatory t cells from other t cell types according to the umap visualization of the gene expression profile (figure c). particularly, ccr has four annotated poly(a) sites according to apasdb [ ], whereas only one poly(a) site was identified from scrna-seq data. this is not unexpected as the bulk ′-seq data contain more diverse tissue samples than the pbmc data and scrna-seq data is generally too sparse to identify all poly(a) sites. however, we have shown that, even for a single poly(a) site, it could encapsulate useful information beyond the gene expression profile (figure ). the other subcluster where cell markers such as ppbp and pf are expressed, was annotated as megakanyocyte progenitors (figures d and s ). according to the pa-matrix, ppbp carries three poly(a) sites, and five poly(a) sites of ppbp were annotated in apasdb. these three poly(a) sites were all highly expressed in megakanyocyte progenitors (cluster ) (figure e). these results demonstrate that sclapa facilitates the capture and identification of hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion sclapa is an integrative framework for learning association for single-cell transcriptomics by leveraging expression profiles of genes and apa isoforms in individual cells, which highlights the inclusion of polyadenylation signatures for improving cell type clustering and discovering new cell types. the effectiveness of sclapa for cell-cell similarity learning and cell type clustering is evidenced by comparisons with various similarity metrics and single-cell clustering methods on several scrna-seq datasets. sclapa has a number of desirable features. first, sclapa incorporates existing tools to extract and quantify poly(a) sites directly from scrna-seq, which augments the gene-level analysis with additional layer of apa information without altering the scrna-seq protocol or performing additional sequencing experiment. second, by employing the strategy of similarity network fusion, sclapa jointly considers expression profiles at both levels of apa isoforms and genes for learning highly informative cell-cell similarities. third, in contrast to many other methods that cluster cells without explicit similarity learning step, sclapa provides two independent but connected modules for similarity learning and cell clustering, each with various methods for users’ choice. accordingly, users can freely combine different similarity metrics and clustering methods in sclapa to evaluate the clustering results for any given dataset. fourth, the framework of sclapa is highly flexible, which can be seamlessly embedded into most existing scrna-seq pipelines or tools for downstream analyses, such as dimension reduction, cell type clustering and differential expression analysis. accordingly, existing tools, such as those designed for dropout imputation, normalization and similarity learning, can also be easily incorporated into sclapa. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . with sclapa, distinct cell-cell similarity networks can be effectively learned from profiles of gene expression and polyadenylation separately by various similarity metrics. sclapa then employed the strategy of similarity network fusion for scalable and robust integration of similarity networks learned from different data layers. this strategy has the advantage to exploit complementarities in distinct data layers for fully profiling the spectrum of underlying data. moreover, the consensus set of cell-cell interactions and associations from the apa layer and the gene layer can be learned from the given data, mitigating noise and dropouts in conventional gene-cell expression profile and thus enhancing accuracy for downstream analyses. by combining expression profiles of apa and gene through similarity network fusion, we found two hidden subpopulations of pbmcs that were undetectable using only gene expression data (figure ). moreover, the augmentation of gene expression profiles with polyadenylation information enhances single-cell clustering results and generates more discriminative cell types (figures - ). as a comprehensive toolkit, sclapa provides a unique strategy to improve cell type clustering and discover novel cell types, by combining gene expression with polyadenylation information at single-cell resolution. sclapa consists of three core function modules, including learning cell-cell similarities, distance fusion and clustering. currently, numerous methods are available to learn cell-cell similarities or cluster cells with reasonable accuracy [ ]. however each method has its own strengths and limitations, and it is extremely challenging, if not impossible, to determine an optimal method for all kinds of datasets as different methods may exploit different characteristics in the data [ ]. moreover, some similarity metrics may be overly dependent on downstream clustering methods, exacerbating difficulties in choosing a universally applicable combination of similarity and clustering methods. for (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . example, based on the ge-matrix alone, similarities learned from simlr provide an overall high performance across datasets in terms of internal validation indexes (figure a). however, simlr is highly dependent on downstream clustering methods for single-cell clustering; it achieves high performance with louvain clustering (figure a), whereas its performance drops sharply with k-means or spectral clustering (figure b). in contrast, our benchmarking analyses showed that performances of sclapa are robust and consistently high across diverse datasets regardless of distance metrics or clustering methods selected in sclapa (figures - ). the unique strength of sclapa may be due to that it efficiently fuses rich structures stored in ge-matrix as well as the accompanied pa-matrix, thus can amplify biological signals and augment cell-cell relationships. sclapa is an easy-to-use and highly flexible framework. the input of sclapa is the ge- and pa-matrix, without using any priori biological information. even with raw scrna-seq data, it is easy obtain the prerequisite ge-matrix and/or pa-matrix using various tools, e.g. cell ranger for ge-matrix, scapatrap and sierra for pa-matrix. lately another tool, scdapars [ ], was proposed to quantify and recover apa usages from scrna-seq data, which uses the relative usage of the distal poly(a) site called pdui to measure a gene’s apa usage. with scdapars, gao et al. [ ] analyzed cell-type-specific apa regulation and discovered hidden cell subpopulations from cancer and human endoderm differentiation scrna-seq data. in sclapa the input pa-matrix can be replaced with any other gene-cell-like matrix, thus the scdapars-imputed pdui matrix can be used readily in sclapa for downstream cell type clustering. however, although the scdapars-imputed pdui profile seems to be effective in revealing apa dynamics among cell types in the previous study [ ], we found that, for cell type clustering, the performance with the pdui-matrix is much lower and less robust than that (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . with sclapa’s pa-matrix (figure ). this may be due to several reasons. first, only genes with at least two ′ utr poly(a) sites can be used for scdapars’ pdui calculation, consequently the pdui-matrix is much more sparse than the pa-matrix and information encoded in genes with single poly(a) site is lost. second, although the pdui profile can be imputed with scdapars, limited information in the highly sparse pdui-matrix confounds reliable imputation and may lead to propagation of errors or noises during the imputation process. third, unlike sclapa which is specifically designed for learning cell-cell similarities and cell type clustering, the main function of scdarpas is to analyze cell-type-specific apa dynamics and identify novel apa-related cell types. we anticipate that the pa-matrix used in sclapa may contain more comprehensive and reliable information than the pdui-matrix or the imputed pdui-matrix, which can significantly enhance the accuracy of cell type clustering. overall, the pa-matrix is simple but effective which can be easily obtained from scrna-seq data by various tools, making it more convenient to use sclapa for scrna-seq analyses. for practical application purpose, the current version of sclapa implements seven similarity metrics and four clustering methods for users’ choice, which allows users to investigate their own strategies for evaluation of the effect of different combinations of distance metrics and clustering methods. moreover, sclapa is easily expandable in that additional distance metrics or clustering methods can be readily incorporated. meanwhile, scrna-seq preprocessing steps, such as dropout imputation and normalization, can also be easily applied before similarity learning. sclapa can also be used as a plug-in architecture for most existing scrna-seq pipelines for similarity learning and cell clustering. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary data the file of supplemental materials contains all the supplementary figures, and tables. funding this work was supported by the national natural science foundation of china (nos. to x.w. and to g.j.) and xiamen ylz yihui technology co., ltd (xdht a). references . ziegenhain c, vieth b, parekh s et al. comparative analysis of single-cell rna sequencing methods, mol cell ; : - .e . . kiselev vy, andrews ts, hemberg m. challenges in unsupervised clustering of single-cell rna-seq data, nature reviews genetics . . skinnider ma, squair jw, foster lj. evaluating measures of association for single-cell transcriptomics, nature methods ; : - . . wang b, zhu j, pierson e et al. visualization and analysis of single-cell rna-seq data by kernel-based similarity learning, nature methods ; : . . grun d, lyubimova a, kester l et al. single-cell messenger rna sequencing reveals rare intestinal cell types, nature ; : - . . kiselev vy, kirschner k, schaub mt et al. sc : consensus clustering of single-cell rna-seq data, nature methods ; : . . levine jacob h, simonds erin f, bendall sean c et al. data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis, cell ; : - . key points  we proposed a computational toolkit called sclapa for learning association for single-cell transcriptomics from scrna-seq data.  sclapa improves cell-cell similarity learning and cell type clustering by integrating single-cell profiling of gene expression and alternative polyadenylation.  objective benchmarking analyses using diverse scrna-seq datasets demonstrate higher performance and robustness of sclapa than competing methods in cell-cell similarity learning and cell type clustering.  sclapa discovers hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . qi r, ma a, ma q et al. clustering and classification methods for single-cell rna-sequencing data, briefings in bioinformatics ; : - . . petegrosso r, li z, kuang r. machine learning and statistical methods for clustering single-cell rna-sequencing data, briefings in bioinformatics ; : - . . kharchenko pv, silberstein l, scadden dt. bayesian approach to single-cell differential expression analysis, nature methods ; : . . grun d, kester l, van oudenaarden a. validation of noise models for single-cell transcriptomics, nat methods ; : - . . stuart t, satija r. integrative single-cell analysis, nature reviews genetics ; : - . . stuart t, butler a, hoffman p et al. comprehensive integration of single-cell data, cell ; : - .e . . welch jd, kozareva v, ferreira a et al. single-cell multi-omic integration compares and contrasts features of brain cell identity, cell ; : - .e . . wu x, liu t, ye c et al. scapatrap: identification and quantification of alternative polyadenylation sites from single-cell rna-seq data, briefings in bioinformatics . . patrick r, humphreys dt, janbandhu v et al. sierra: discovery of differential transcript usage from polya-captured single-cell rna-seq data, genome biol ; : . . levin m, zalts h, mostov n et al. gene expression dynamics are a proxy for selective pressures on alternatively polyadenylated isoforms, nucleic acids res ; : - . . shulman ed, elkon r. cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data, nucleic acids res ; : - . . arzalluz-luque a, conesa a. single-cell rnaseq for the study of isoforms-how is that possible?, genome biology ; : . . song y, botvinnik ob, lovci mt et al. single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation, molecular cell ; : - .e . . macosko ez, basu a, satija r et al. highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, cell ; : - . . hashimshony t, wagner f, sher n et al. cel-seq: single-cell rna-seq by multiplexed linear amplification, cell rep ; : - . . zheng gx, terry jm, belgrader p et al. massively parallel digital transcriptional profiling of single cells, nat commun ; : . . ye c, zhou q, hong y et al. role of alternative polyadenylation dynamics in acute myeloid leukaemia at single-cell resolution, rna biology ; : - . . kim n, chung w, eum hh et al. alternative polyadenylation of single cells delineates cell types and serves as a prognostic marker in early stage breast cancer, plos one ; :e . . velten l, anders s, pekowska a et al. single-cell polyadenylation site mapping reveals ' isoform choice variability, molecular systems biology ; : - . . franzén o, gan l-m, björkegren jlm. panglaodb: a web server for exploration of mouse and human single-cell rna sequencing data, database ; . . wang b, mezlini am, demir f et al. similarity network fusion for aggregating data types on a genomic scale, nature methods ; : . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . smith t, heger a, sudbery i. umi-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy, genome research ; : - . . liao y, smyth gk, shi w. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features, bioinformatics ; : - . . ye w, liu t, fu h et al. movapa: modeling and visualization of dynamics of alternative polyadenylation across biological samples, bioinformatics . . shen y, ji g, haas bj et al. genome level analysis of rice mrna '-end processing signals and alternative polyadenylation, nucleic acids research ; : - . . wu x, liu m, downie b et al. genome-wide landscape of polyadenylation in arabidopsis provides evidence for extensive alternative polyadenylation, proceedings of the national academy of sciences, usa ; : - . . zhao z, wu x, raj kumar pk et al. bioinformatics analysis of alternative polyadenylation in green alga chlamydomonas reinhardtii using transcriptome sequences from three different sequencing platforms, g : genes|genomes|genetics ; : - . . wu x, gaffney b, hunt a et al. genome-wide determination of poly(a) sites in medicago truncatula: evolutionary conservation of alternative poly(a) site choice, bmc genomics ; : . . pouyan mb, kostka d. random forest based similarity learning for single cell rna sequencing data, bioinformatics ; :i -i . . pearl j. probabilistic reasoning in intelligent systems: networks of plausible inference. morgan kaufmann, . . blondel vd, guillaume j-l, lambiotte r et al. fast unfolding of communities in large networks, journal of statistical mechanics: theory and experiment ; :p . . eisen mb, spellman pt, brown po et al. cluster analysis and display of genome-wide expression patterns, proc natl acad sci u s a ; : - . . ng ay, jordan m, weiss y. on spectral clustering: analysis and an algorithm. advances in neural information processing systems. , – . . langfelder p, horvath s. fast r functions for robust correlations and hierarchical clustering, journal of statistical software ; : - . . guo m, wang h, potter ss et al. sincera: a pipeline for single-cell rna-seq profiling analysis, plos comput biol ; :e -e . . xu c, su z. identification of cell types from single-cell transcriptomes using a novel clustering method, bioinformatics ; : - . . langfelder p, zhang b, horvath s. defining clusters from a hierarchical cluster tree: the dynamic tree cut package for r, bioinformatics ; : - . . guy brock, vasyl pihur, susmita datta et al. clvalid, an r package for cluster validation, journal of statistical software ; : - . . chang f, qiu w, zamar rh et al. clues: an r package for nonparametric clustering based on local shrinking, journal of statistical software ; : . . mcinnes l, healy j, saul n et al. umap: uniform manifold approximation and projection, journal of open source software ; : . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . mccarthy dj, campbell kr, lun at et al. scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r, bioinformatics ; : - . . love m, huber w, anders s. moderated estimation of fold change and dispersion for rna-seq data with deseq , genome biology ; : . . jean-baptiste k, mcfaline-figueroa jl, alexandre cm et al. dynamics of gene expression in single root cells of arabidopsis thaliana, the plant cell ; : - . . ryu kh, huang l, kang hm et al. single-cell rna sequencing resolves molecular relationships among individual plant cells, plant physiology ; : - . . shahan r, hsu c-w, nolan tm et al. a single cell arabidopsis root atlas reveals developmental trajectories in wild type and cell identity mutants. . . shulse cn, cole bj, ciobanu d et al. high-throughput single-cell transcriptome profiling of plant cell types, cell reports ; . . zhang t-q, xu z-g, shang g-d et al. a single-cell rna sequencing profiles the developmental landscape of arabidopsis root, molecular plant ; : - . . zhu s, ye w, ye l et al. plantapadb: a comprehensive database for alternative polyadenylation sites in plants, plant physiology ; : - . . kaufmann l, rousseeuw p. clustering by means of medoids. in: dodge y. (ed) statistical data analysis based on the l -norm and related methods. amsterdam: north-holland, , – . . gao y, li l, amos ci et al. dynamic analysis of alternative polyadenylation from single-cell rna-seq(scdapars) reveals cell subpopulations invisible to gene expression analysis, biorxiv : . . . . . you l, wu j, feng y et al. apasdb: a database describing alternative poly(a) sites and selection of heterogeneous cleavage sites downstream of poly(a) signals, nucleic acids research ; :d -d . . shirkhorshidi as, aghabozorgi s, wah ty. a comparison study on similarity and dissimilarity measures in clustering continuous data, plos one ; :e . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure legends figure . single-cell poly(a) profile in root hair and nonhair cells. (a) genes exclusively present in the pa-matrix. four genes (at g , at g , at g and at g ) were not present in the ge-matrix, whereas they had at least one poly(a) site according to the pa-matrix. for each gene, the violin plot shows expression levels of its poly(a) site in hair and nonhair cells and the umap visualization shows the d embeddings of poly(a) profile. (b) two example genes (at g and at g ) that are not differentially expressed (de) but possess at least one de poly(a) site. the upper panel places the violin plot and umap visualization showing the poly(a) profile of the respective gene in hair and nonhair cells. the lower panel shows the gene profile. (c) single-cell poly(a) profile distinguishes root hair and root nonhair cells. the left plot is the umap representation on the basis of genes that are not de but with at least one de poly(a) site, the right plot is the umap representation on the basis of poly(a) profile of the genes. figure . benchmarking of similarity learning with sclapa on four published scrna-seq datasets. (a) the internal validation metric of dunn was employed to measure the cell separation. (b) radar chart showing the performance of different similarity metrics across datasets. dataset names are shown near the vertex of the plot. each vertex denoting the dunn score of a metric on the respective dataset. the larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. (c) radar chart showing the performance of sclapa with different distance metrics for distance fusion. each vertex denotes the dunn score of using different distance metrics on the respective dataset. figure . benchmarking of similarity learning with sclapa in the context of clustering on four published scrna-seq datasets. (a) ari was employed to measure the concordance between inferred and true cluster labels. louvain clustering was applied on the similarity matrices obtained from different methods. (b) radar charts showing ari scores by applying different clustering methods on cell-cell similarities learned by each similarity metric. each plot represents results of one dataset. clustering methods are shown near the vertex of the plot. the vertex of a plot denotes the ari score of applying a clustering method on different metrics. the larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. hc, hierarchical clustering; sc, spectral clustering. figure . comparison of performance between sclapa and scdapars across four scrna-seq datasets. five similarity metrics were applied on the scdapars-imputed pdui profile and the ge-matrix to generate scdapars-dist and ge-dist, respectively. after fusing the two distance matrices with snf, louvain clustering was applied on the fused cell-cell similarities to cluster cells. we did not include rafsil in this experiment due to its slow calculation speed. for sclapa, pearson correlation was used for similarity learning and louvain was used for clustering. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . ari scores from six clustering methods across four scrna-seq datasets. for sclapa, pearson correlation was used for similarity learning and louvain was used for clustering. figure . sclapa identifies hidden subpopulations of cells from human pbmcs. (a) umap representation of seurat’s clustering results on the basis of ge-matrix. ten clusters were obtained and nine were annotated with known cell types: naive t cell ( ), cd + monocytes ( ), cd + t cell ( ), b cell ( ), cd + memory t ( ), nk cell ( ), cd + monocytes ( ), monocyte derived dendritic ( , ) and plasmacytorid dendritic ( ). (b) umap representation of sclapa’s clustering results on the basis of ge-matrix and pa-matrix. fourteen clusters were obtained and clusters were annotated with known cell types: regulatory t cell ( ), naive t cell ( , ), plasmacytorid dendritic ( ), cd + memory t ( ), cd + t cell ( ), cd + monocytes ( ), monocyte derived dendritic ( , , ), cd + monocytes ( ), megakaryocyte progenitors ( ), b cell ( ) and nk cell ( ). the two arrows mark two new subpopulations of cells identified by sclapa. (c) gene expression of ccr distinguishes regulatory t cells from other t cell types according to the umap visualization of the gene expression profile. the details in the dashed line box are shown in the solid line box. (d) gene expression of ppbp distinguishes megakanyocyte progenitors from other cell types. (e) three poly(a) sites of ppbp are all highly expressed in megakanyocyte progenitors. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . hair nonhair hair p a a t g (b) (c) (a) a t g (p a c o o rd : ) a t g (p a c o o rd : ) a t g (p a c o o rd : ) a t g (p a c o o rd : ) hair nonhair umap umap u m a p a t g p a - - - - - - u m a p (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . d u n n euclidean pearson simlr rafsil rafsil sclapa (a) p  s  hypothalamus mammary amygdala root hypothalamus mammaryroot amygdala (b) (c) euclidean pearson p  s  simlr rafsil rafsil sclapa euclidean+euclidean pearson+pearson + + p  s  simlr+simlr rafsil +rafsil rafsil +rafsil p s  (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . a r i euclidean pearson simlr rafsil rafsil sclapa hypothalamus mammary amygdala root (a) p  s  hc sc k-means louvain (b) euclidean pearson p  s  simlr rafsil rafsil sclapa hc sc k-means louvain hc sc k-means louvain hc sc k-means louvain (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . a r i euclidean pearson simlr sclapa p  s  scdapars+ (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . amygdala hypothalamus mammary root . . . . . a r i sc sincera snnclip seurat dynamictreecut sclapa (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (a) (c) - - - umap u m a p (e) - - - - - - umap u m a p (d) pa (coord: ) pa (coord: ) identity e x p re s s io n l e v e l pa (coord: ) - - - umap u m a p - - - umap u m a p regulatory t cell megakaryocyte progenitors (b) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . interpretable detection of novel human viruses from genome sequencing data i i “output” — / / — : — page — # i i i i i i published online dd mm yyyy preprint, yyyy, vol. xx, no. xx – interpretable detection of novel human viruses from genome sequencing data jakub m. bartoszewicz , , , ∗, anja seidel , and bernhard y. renard , , ∗ bioinformatics (mf ), department of methodology and research infrastructure, robert koch institute, berlin, germany, department of mathematics and computer science, free university of berlin, berlin, germany, data analytics and computation statistics, hasso plattner institute for digital engineering, potsdam, brandenburg, germany and digital engineering faculty, university of postdam, potsdam, brandenburg, germany. current address: central research institute of ambulatory health care, berlin, germany. received yyyy-mm-dd; revised yyyy-mm-dd; accepted yyyy-mm-dd abstract viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. here, we predict whether a virus can infect humans directly from next-generation sequencing reads. we show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. we propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example the sars-cov- coronavirus, unknown before it caused a covid- pandemic in . all methods presented here are implemented as easy-to-install packages enabling analysis of ngs datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics. introduction background within a globally interconnected and densely populated world, pathogens can spread more easily than they ever had before. as the recent outbreaks of ebola and zika viruses have shown, the risks posed even by these previously known agents remain ∗to whom correspondence should be addressed. tel: + ; email: jakub.bartoszewicz@hpi.de, bernhard.renard@hpi.de unpredictable and their expansion hard to control ( ). what is more, it is almost certain that more unknown pathogen species and strains are yet to be discovered, given their constant, extremely fast-paced evolution and unexplored biodiversity, as well as increasing human exposure ( , ). some of those novel pathogens may cause epidemics (similar to the sars and mers coronavirus outbreaks in and ) or even pandemics (e.g. sars-cov- and the “swine flu” h n / strain). many have more than one host or vector, which makes assessing and predicting the risks even more difficult. for example, ebola has its natural reservoir most likely in fruit bats ( ), but causes deadly epidemics in both humans and chimpanzees. as the state-of-the art approach for the open- view detection of pathogens is genome sequencing ( , ), it is crucial to develop automated pipelines for characterizing the infectious potential of currently unidentifiable sequences. in practice, clinical samples are dominated by host reads and contaminants, with often less than a hundred reads of the pathogenic virus ( ). metagenomic assembly is challenging, especially in time-critical applications. this creates a need for read-based approaches complementing or substituting assembly where needed. screening against potentially dangerous subsequences before their synthesis may also be used as a way of ensuring responsible research in synthetic biology. while potentially useful in some applications, engineering of viral genomes could also pose a biosecurity and biosafety threat. two controversial studies modified the influenza a/h n ("bird flu") virus to be airborne transmissible in mammals ( , ). a possibility of modifying coronaviruses to enhance their virulence triggered calls for a moratorium on this kind of research ( ). synthesis of an infectious horsepox virus closely related to the smallpox-causing variola virus ( ) caused a public uproar and calls for intensified discussion on risk control in synthetic biology ( ). © yyyy the author(s) this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/ by-nc/ . /uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx current tools for host range prediction several computational, genome-based methods exist that allow to predict the host-range of a bacteriophage (a bacteria-infecting virus). a selection of composition-based and alignment-based approaches has been presented in an extensive review by edwards et al. ( ). prediction of eukariotic host tropism (including humans) based on known protein sequences was shown for the influenza a virus ( ). support-vector machines based on word vec representations were shown to outperform homology searches with blast and hmms in the same task, but lost their advantage when applied to nucleic acid sequences directly ( ). two recent studies employ k-mer based, k-nn classifiers ( ) and deep learning ( ) to predict host range for a small set of three well- studied species directly from viral sequences. while those approaches are limited to those particular species and do not scale to viral host-range prediction in general, the host taxon predictor (htp) ( ) uses logistic regression and support vector machines to predict if a novel virus infects bacteria, plants, vertebrates or arthropods. yet, the authors argue that it is not possible to use htp in a read-based manner; it requires long sequences of at least , nucleotides. this is incompatible with modern metagenomic next-generation sequencing (ngs) workflows, where the dna reads obtained are at least - times shorter. another study used gradient boosting machines to predict reservoir hosts and transmission via arthropod vectors for known human-infecting viruses ( ). zhang et al. ( ) designed several classifiers explicitly predicting whether a new virus can potentially infect humans. their best model, a k-nn classifier, uses k-mer frequencies as features representing the query sequence and can yield predictions for sequences as short as base pairs (bp). it worked also with bp-long reads from real dna sequencing runs, although in this case the reads originated also from the viruses present in the training set (and were therefore not "novel"). deep learning for dna sequences while dna sequences mapped to a reference genome may be represented as images ( ), a majority of studies uses a distributed orthographic representation, where each nucleotide {a,c,g,t} in a sequence is represented by a one-hot encoded vector of length . an "unknown" nucleotide (n) can be represented as an all-zero vector. chaos game representation (cgr) and its extension, the frequency matrix cgr (fcgr) are promising alternatives able to encode an arbitrary sequence in an image-like format. fcgr has been used to encode genomic inputs for deep learning approaches, including full bacterial genomes ( ) and coding sequences of hiv for the drug resistance prediction task ( ). in this study, we use one-hot encoding with ns as zeroes, which was previously shown to perform well for raw ngs reads ( ) and abstract phenotype labels. cnns and lstms have been successfully used for a variety of dna-based prediction tasks. early works focused mainly on regulation of gene expression in humans ( , , , , ), which is still an area of active research ( , , ). in the field of pathogen genomics, deep learning models trained directly on dna sequences were developed to predict host ranges of three multi-host viral species ( ) and to predict pathogenic potentials of novel bacteria ( ). deepvirfinder ( ) and viraminer ( ) can detect viral sequences in metagenomic samples, but they cannot predict the host and focus on previously known species. for a broader view on deep learning in genomics we refer to a recent review by eraslan et al. ( ). interpretability and explainability of deep learning models for genomics is crucial for their wide-spread adoption, as it is necessary for delivering trustworthy and actionable results. convolutional filters can be visualized by forward-passing multiple sequences through the network and extracting the most-activating subsequences ( ) to create a position weight matrix (pwm) which can be visualized as a sequence logo ( , ). direct optimization of input sequences is problematic, as it results in generating a dense matrix even though the input sequences are one-hot encoded ( , ). this problem can be alleviated with integrated gradients ( , ) or deeplift, which propagates activation differences relative to a selected reference back to the input, reducing the computational overhead of obtaining accurate gradients ( ). if the bias terms are zero and a reference of all-zeros is used, the method is analogous to layer-wise relevance propagation ( ). deeplift is an additive feature attribution method, and may used to approximate shapley values if the input features are independent ( ). tf-modisco ( ) uses deeplift to discover consolidated, biologically meaningful dna motifs (transcription factor binding sites). contributions in this paper, we first improve the performance of read- based predictions of the viral host (human or non-human) from next-generation sequencing reads. we show that reverse-complement (rc) neural networks ( ) significantly outperform both the previous state-of-the-art ( ) and the traditional, alignment-based algorithm – blast ( , ), which constitutes a gold standard in homology-based bioinformatics analyses. we show that defining the negative (non-human) class is non-trivial and compare different ways of constructing the training set. strikingly, a model trained to distinguish between viruses infecting humans and viruses infecting other chordates (a phylum of animals including vertebrates) generalizes well to evolutionarily distant non- human hosts, including even bacteria. this suggests that the host-related signal is strong and the learned decision boundary separates human viruses from other dna sequences surprisingly well. next, we propose a new approach for convolutional filter visualization using partial shapley values to differentiate between simple nucleotide information content and the contribution of each sequence position to the final classification score. to test the biological plausibility of our models, we generate genome-wide maps of "infectious potential" and nucleotide contributions. we show that those maps can be used to visualize and detect virulence-related regions of interest (e.g. genes) in novel genomes. as a proof of concept, we analyzed one of the viruses randomly assigned to the test set – the taï forest ebolavirus, which has a history of host-switching and can cause a serious disease. to show that the method can also be used for other biological problems, we investigated the networks trained by .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx bartoszewicz et al. ( ) and their predictions on a genome of a pathogenic bacterium staphylococcus aureus. the authors used this particular species to assess the performance of their method on real sequencing data. finally, we studied the sars-cov- coronavirus, which emerged in december , causing the covid- pandemic ( ). materials and methods data collection and preprocessing vhdb dataset we accessed the virus-host database ( ) on july , and downloaded all the available data. we note that all the reference genomes from ncbi viral genomes are present in vhdb, as well as their curated annotations from refseq. additional, manually curated records in vhdb extend on metadata available in ncbi. more non-reference genomes are available, but considering multiple genomes per virus would skew the classifiers’ performance towards the more frequently resequenced ones. the original dataset contained , records comprising refseq ids for viral sequences and associated metadata. some viruses are divided into discontiguous segments, which are represented as separate records in vhdb; in those cases the segments were treated as contigs of a single genome in the further analysis. we removed records with unspecified host information and those confusing the highly pathogenic variola virus with a similarly named genus of fish. following zhang et al. ( ), we filtered out viroids and satellites, which are classified as subviral agents and not bona fide viruses ( , ). note that even though they require helper viruses for replication, this step did not affect ubiquitous adeno-associated viruses and large virophages, which are well established within the viral taxonomy in the families parvoviridae and lavidaviridae, respectively. human-infecting viruses were extracted by searching for records containing "homo sapiens" in the "host name" field. note that vhdb contains information about multiple possible hosts for a given virus where appropriate. any virus infecting humans was assigned to the positive class, also if other, non- human hosts exist. in total, the dataset contained , viruses (grouped in species), including , human viruses ( species). we considered both dna and rna viruses; rna sequences were encoded in the dna alphabet, as in refseq. defining the negative class while defining a human-infecting class is relatively straightforward, the reference negative class may be conceptualized in a variety of ways. the broadest definition takes all non-human viruses into account, including bacteriophages (bacterial viruses). this is especially important, as most of known bacteriophages are dna viruses, while many important human (and animal) viruses are rna viruses. one could expect that the multitude of available bacteriophage genomes dominating the negative class could lower the prediction performance on viruses similar to those infecting humans. this offers an open-view approach covering a wider part of the sequence space, but may lead to misclassification of potentially dangerous mammalian or avian viruses. as they are often involved in clinically relevant host-switching events, a stricter approach must also be considered. in this case, the negative class comprises only viruses infecting chordata (a group containing vertebrates and closely related taxa). two intermediate approaches consider all eukaryotic viruses (including plant and fungi viruses), or only animal-infecting viruses. this amounts to four nested host sets: "all" ( , non-human viruses, species), "eukaryota" ( , viruses, species), "metazoa" ( , viruses, species) and "chordata" ( , viruses, species). auxiliary sets containing only non-eukaryotic viruses ("non-eukaryota"), non-animal eukaryotic viruses ("non-metazoa eukaryota") etc. can be easily constructed by set subtraction. for the positive class, we randomly generated a training set containing % of the genomes, and validation and test sets with % of the genomes each. importantly, the nested structure was kept also during the training-validation-test split: for example, the species assigned to the smallest test set ("chordata") were also present in all the bigger test sets. the same applied to other taxonomic levels, as well as the training and validation sets wherever applicable. read simulation we simulated bp long illumina reads following a modification of a previously described protocol ( ) and using the mason read simulator ( ). first, we only generated the reads from the genomes of human-infecting viruses. then, the same steps were applied to each of the four negative class sets. finally, we also generated a fifth set, "stratified", containing an equal number of reads drawn from genomes of the following disjunct host classes: "chordata" ( %), "non-chordata metazoa" ( %), "non- metazoa eukaryota" ( %) and "non-eukaryota" ( %). in each of the evaluated settings, we used a total of million ( %) reads for training, . million ( %) reads for validation and . million ( %) paired reads as the held-out test set. read number per genome was proportional to genome length, keeping the coverage uniform on average. viruses with longer genomes were therefore represented by more reads than shorter viruses. on the other hand, their sequence diversity was covered at a similar level. this length-balancing step was previously shown to work well for bacterial genomes of different lengths ( , ). while the original datasets are heavily imbalanced, we generated the same number of negative and positive data points (reads) regardless of the negative class definition used. this protocol allowed us to test the impact of defining the negative class, while using the exactly same data as representatives of the positive class. we used three training and validation sets ("all", "stratified", and "chordata"), representing the fully open-view setting, a setting more balanced with regard to the host taxonomy, and a setting focused on cases most likely to be clinically relevant. in each setting, the validation set matched the composition of the training set. the evaluation was performed using all five test sets to gain a more detailed insight on the effects of negative class definition on the prediction performance. human blood virome dataset similarily to zhang et al. ( ), we used the human blood dna virome dataset ( ) to test the selected classifiers on real data. we obtained , , reads of bp and searched all of vhdb using blastn (with default parameters) to obtain high-quality reference labels. if .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx a read’s best hit was a human-infecting virus, we assigned it to a positive class; the negative class was assigned if this was not the case. this procedure yielded , , "positive" and , "negative" reads. virus-level and species-level predictions in this study, we focus on predicting labels for reads originating from novel viruses. what constitutes a "novel" biological entity is an open question – a novel virus does not necessarily belong to a novel species ( ). if a given viral isolate clusters with a known group of isolates, it is considered to be the same virus; if it does not, it may be assigned a distinct name and considered novel ( ). this is separate from its putative taxonomical assignment. assigning a novel virus to a novel or a previously established species is performed pursuing a wider set of criteria, and the criteria for delineating distinct species differ between viral families ( , , , ). in most cases, species are perceived as human constructs rather than biological entities and host range often is explicitly one of the defining features ( , ), rendering reasoning based on cross-species homology searches inherently difficult. the most prominent example of this problem is the sars- cov- virus, which is a novel virus within a previously known species (severe acute respiratory syndrome–related coronavirus). other members of this species include the human-infecting sars-cov- , but also multiple related bat sarsr-cov viruses (e.g. sarsr-cov ratg or bat sars- like coronavirus wiv ). importantly, sars-cov- is not a strain of sars-cov- ; those two viruses share a common ancestor ( ). this echoes similar problems related to pathogenic potential prediction for novel bacterial pathogens. a novel bacterium may be defined as a novel strain or a novel species ( ), and the classifiers must be trained according to the desired definition. as the pandemic has shown, different viruses of the same species can differ wildly in their infectious potential and the broader impact on human societies. therefore, threat assessment must be performed for novel viruses, not only novel taxa; different related viruses are non-redundant. at the same time, redundancy below this level (i.e. multiple instances of the same virus) must be eliminated from the dataset to ensure reliability of the trained classifier. vhdb tackles this problem by collecting and annotating reference genomes – each virus in the database is a separate entity with its own id in ncbi taxonomy. this virus-level approach was previously used by zhang et al. ( ). we show that homology-based algorithms underperform in this setting already, suggesting that machine learning is indeed required to accurately predict labels for novel viruses even if other members of the same species are present in the training database. nevertheless, a more difficult alternative – predictions for reads of viruses belonging to completely novel species – is a related and potentially equally important task. for bacterial datasets, species novelty can be modelled by selecting a single representative genome per species ( ). as the sars- cov- example shows, this is often not possible for viruses. to assess our approach in this stricter setup, we re-divided the vhdb dataset into training, validation and test sets ensuring that all viruses of a given species were assigned to only one of those subsets. this effectively models a "novel species" scenario while also reflecting within-species phenotype diversity. we recreated the species-wide versions of the "all" and "chordata" datasets by assigning %, % and % of the species to the training, validation and test datasets, respectively. we resimulated the reads as outlined above and compared the performance of the machine learning and homology-based approaches achieving the highest accuracy in the simpler "novel virus" setting (see section prediction performance). training we used the deepac package ( ) to investigate rc-cnn and rc-lstm architectures, which guarantee identical predictions for both forward and reverse-complement orientations of any given nucleotide sequence, and have been previously shown to accurately predict bacterial pathogenicity. here, we employ an rc-cnn with two convolutional layers with filters of size each, average pooling and fully connected layers with units each. the lstm used has units (fig. s ). we use dropout regularization in both cases, together with aggressive input dropout at the rate of . or . (tuned for each model). input dropout may be interpreted as a special case of noise injection, where a fraction of input nucleotides is turned to ns. representations of forward and reverse-complement strands are summed before the fully connected layers. as two mates in a read pair should originate from the same virus, predictions obtained for them can be averaged for a boost in performance. if a contig or genome is available, averaging predictions for constituting reads yields a prediction for the whole sequence. we used tesla p and tesla v gpus for training and an rtx ti for visualizations. we wanted the networks to yield accurate predictions for both bp (our data, modelling a sequencing run of an illumina miseq device) and bp long reads (as in the human blood virome dataset). as shorter reads are padded with zeros, we expected the cnns trained using average pooling to misclassify many of them. therefore, we prepared a modified version of the datasets, in which the last bp of each read were turned to zeros, mocking a shorter sequencing run while preserving the error model. then, we retrained the cnn which had performed best on the original dataset. since in principle, the human blood virome dataset should not contain viruses infecting non-human chordata, a "chordata"- trained classifier was not used in this setting. benchmarking we compare our networks to the the k-nn classifier proposed by zhang et al. ( ), the only other approach explicitly tested on raw ngs reads and detecting human viruses in a fully open view setting (not focusing on a limited number of species). we use the real sequencing data that they used ( ) for an unbiased comparison. we trained the classifier on the "all" dataset as described by the authors, i.e. using non-overlapping, bp-long contigs generated from the training genomes (retraining on simulated reads is computationally prohibitive). we also tested the performance of using blast to search against an indexed database of labeled genomes. we constructed the database from the "all" training set and used discontiguous megablast to achieve high inter-species sensitivity. for ngs mappers .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (bwa-mem ( ) and bowtie ( )), the indices were constructed analogously. kraken ( ) was previously shown to perform worse than both blast and machine learning when faced with read-based pathogenic potential prediction for novel bacterial species ( ). its major advantage – assigning reads to lowest common ancestor (lca) nodes in ambiguous cases – turns into a problem in the infectivity prediction task, as transferring labels to lcas is often impossible ( ). therefore, we focus on alignment-based approaches as the most accurate alternative to machine learning in this context. note that both alignment and k-nn can yield conflicting predictions for the individual mates in a read pair. what is more, blast and the mappers yield no prediction at all if no match is found. therefore, similarly to bartoszewicz et al. ( ), we used the accept anything operator to integrate binary predictions for read pairs and genomes. at least one match is needed to predict a label, and conflicting predictions are treated as if no match was found at all. missing predictions lower both true positive and true negative rates. filter visualization substring extraction in order to visualize the learned convolutional filters, we downsample a matching test set to , reads and pass it through the network. this is modelled after the method presented by alipanahi et al. ( ). for each filter and each input sequence, the authors extracted a subsequence leading to the highest activation, and created sequence logos from the obtained sequence sets ("max- activation"). we used the deepshap implementation ( ) of deeplift ( ) to extract score-weighted subsequences with the highest contribution score ("max-contrib") or all score- weighted subsequences with non-zero contributions ("all- contrib"). computing the latter was costly and did not yield better quality logos. we use an all-zero reference. as reads from real sequencing runs are usually not equally long, shorter reads must be padded with ns; the "unknown" nucleotide is also called whenever there is not enough evidence to assign any other to the raw sequencing signal. therefore, ns are "null" nucleotides and are a natural candidate for the reference input. we do not consider alternative solutions based on gc content or dinucleotide shuffling, as the input reads originate from multiple different species, and the sequence composition may itself be a strong marker of both virus and host taxonomy ( ). we also avoid weight-normalization suggested for zero- references ( ), as it implicitly models the expected gc content of all possible input sequences, and assumes no ns present in the data. finally, we calculate average filter contributions to obtain a crude ranking of feature importance with regard to both the positive and negative class. partial shapley values building sequence logos involves calculating information content (ic) of each nucleotide at each position in a prospective dna motif. this can be then interpreted as measure of evolutionary sequence conservation. however, high ic does not necessarily imply that a given nucleotide is relevant in terms of its contribution to the classifier’s output. some sub-motifs may be present in the sequences used to build the logo, even if they do not contribute to the final prediction (or even a given filter’s activation). to test this hypothesis, we introduce partial shapley values. intuitively speaking, we capture the contributions of a nucleotide to the network’s output, but only in the context of a given intermediate neuron of the convolutional layer. more precisely, for any given feature xi, intermediate neuron yj and the output neuron z, we aim to measure how xi contributes to z while regarding only the fraction of the total contribution of xi that influences how yj contributes to z. although similarly named concepts were mentioned before as intermediate computation steps in a different context ( , ), we define and use partial shapley values to visualize contribution flow through convolutional filters. this differs from recently introduced contribution weight matrices ( ), where feature attributions are used as a representation of an identified transcription factor binding site irreducible to a given intermediate neuron. using the formalism of deeplift’s multipliers ( ) and their reinterpretation in shap ( ), we backpropagate the activation differences only along the paths "passing through" yj. in eq. , we define partial multipliers µ (yj) xiz and express them in terms of shapley values φ and activation differences w.r.t. the expected activation values (reference activation). calculating partial multipliers is equivalent to zeroing out the multipliers mykz for all k =j before backpropagating myjz further. µ (yj) xiz =mxiyjmyjz = φi(yj,x)φj(z,y) (xi−e[xi])(yj−e[yj]) ( ) we define partial shapley values ϕ (yj) i (z,x) analogously to how shapley values can be approximated by a product of multipliers and input differences w.r.t. the reference (eq. ): ϕ (yj) i (z,x)=µ (yj) xiz (xi−e[xi])= φi(yj,x)φj(z,y) yj−e[yj] ( ) from the chain rule for multipliers ( ), it follows that standard multipliers are a sum over all partial multipliers for a given layer y. therefore, shapley values as approximated by deeplift are a sum of partial shapley values for the layer y (eq. ). φi(z,x)=mxiz(xi−e[xi])= ∑ j ϕ (yj) i (z,x) ( ) once we calculate the contributions of convolutional filters for the first layer, ϕ (yj) i (z,x) for the first convolutional layer of a network with one-hot encoded inputs and an all-zero reference can be efficiently calculated using weight matrices and filter activation differences (eq. - ). first, in this case we do not traverse any non-linearities and can directly use the linear rule ( ) to calculate the contributions of xi to yj as a .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx product of the weight wi and the input xi. second, the input values may only be or . φi(yj,x)=wixi = { wi, if xi = , otherwise ( ) ϕ (yj) i (z,x)= wiφj(z,y) yj−e[yj] ( ) resulting partial contributions can be visualized along the ic of each nucleotide of a convolutional kernel. to this end, we design extended sequence logos, where each nucleotide is colored according to its contribution. positive contributions are shown in red, negative contributions are blue, and near- zero contributions are gray. therefore, no information is lost compared to standard sequence logos, but the relevance of individual nucleotides and the filter as a whole can be easily seen. color saturation is limited by the reciprocal of a user- defined gain parameter, here set to nm, where n equals the number of input features xi (sequence length) and m equals the number of convolutional filters yj in a given layer. genome-wide phenotype analysis we create genome-wide phenotype analysis (gwpa) plots to analyse which parts of a viral genome are associated with the infectious phenotype. we scramble the genome into overlapping, bp long subsequences (pseudo-reads) without adding any sequencing noise. for the highest resolution, we use a stride of one nucleotide. for s. aureus, we used a stride of bp. we predict the infectious potential of each pseudo-read and average the obtained values at each position of the genome. analogously, we calculate average contributions of each nucleotide to the final prediction of the convolutional network. finally, we normalize raw infectious potentials into the [− . , . ] interval for a more intuitive graphical representation. we visualize the resulting nucleotide-resolution maps with igv ( ). for protein structures, we average the scores codon-wise to obtain contribution scores per amino acid and visualize them with pymol ( ). for well-annotated genomes, we compile a ranking of genes (or other genomic features) sorted by the average infectious potential within a given region. in addition to that, we scan the genome with the learned filters of the first convolutional layer to find genes enriched in subsequences yielding non-zero filter activations. we use gene ontology to connect the identified genes of interest with their molecular functions and biological processes they are engaged in. results negative class definition choosing which viruses should constitute the negative class is application dependent and influences the performance of the trained models. table s summarizes the prediction accuracy for different combinations of the training and test set composition. the models trained only on human and chordata-infecting viruses maintain similar, or even better performance when evaluated on viruses infecting a much broader host range, including bacteria. this suggests that the learned decision boundary separates human viruses from all the others surprisingly well. we hypothesize that the human host signal must be relatively strong and contained within the chordata host signal. dropout rate of . resulted in the highest validation accuracy for cnnstr- and lstmstr. a rate of . was selected for the other models. adding more diversity to the negative class may still boost performance on more diverse test sets, as in the case of cnn trained on the "all" dataset (cnnall). this model performs a bit worse on viruses infecting hosts related to humans, but achieves higher accuracy than the "chordata"- trained models and the best recall overall. rebalancing the negative class using the "stratified" dataset helps to achieve higher performance on animal viruses while maintaing high overall accuracy. the lstms are outperformed by the cnns, but they can be used for shorter reads without retraining (see sections training and prediction performance). prediction performance we selected lstmall and cnnall for further evaluation. we used a single consumer-grade rtx ti gpu to measure inference speed. the cnn classifies reads/s and the lstm reads/s. analyzing ten million reads takes only minutes using the faster model; linear speed-ups are possible if more gpus are available. therefore, the trained models achieve high-throughputs necessary to analyze ngs datasets. table presents the results of a benchmark using the "all" test set. low performance of the k-nn classifier ( ) is caused by frequent conflicting predictions for each read in a read pair. in a single-read setting it achieves . % accuracy, while our best model achieves . % (table s ). although blast achieves high precision, it yields no predictions for over % of the samples. cnnall is the most sensitive and accurate. as expected, standard mapping approaches (bwa- mem and bowtie ) struggle with analysing novel pathogens – they are the most precise but the least sensitive. our approach outperforms them by - %. although we focus on the extreme case of read-based predictions, our method can also be used on assembled contigs and full genomes if they are available, as well as on read sets from pure, single-virus samples. we note that assembly itself does not yield any labels and a follow-up analysis (via alignment, machine learning or other approaches) is required to correctly classify metagenomic contigs in any case. we ran predictions on contigs without any size filtering with both k- nn and blast (table ). we present performance measures for both individual contigs and whole genome predictions based on contig-wise majority vote. we compare them to blast with read-wise majority vote ( ) and to read-wise average predictions of our networks, analogous to presented previously for bacteria ( ). our method outperforms blast by . % and k-nn by . %, even though they have access to the full biological context (full sequences of all contigs in a genome), while we simply average outputs for short reads originating from the contigs. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx table . classification performance in the fully open-view setting (all virus hosts), read pairs. acc. – accuracy, prec. – precision, rec. – recall, spec. – specificity. bowtie , bwa-mem and blast yield no predictions for over %, % and % of the samples, respectively. best performance in bold. acc. prec. rec. spec. cnnall (ours) . . . . lstmall (ours) . . . . k-nn . . . . bowtie . . . . bwa-mem . . . . blast . . . . we benchmarked our models against the human blood virome dataset used by zhang et al. ( ). our models outperform their k-nn classifier. as the positive class massively outnumbers the negative class, all models achieve over % precision. cnnall- performs best (table ). however, the positive class is dominated by viruses which are not necessarily novel. the cnn was more accurate on training data, so we expected it to detect those viruses easily. finally, we repeated the analysis in the "novel species" scenario. classifying novel viral species when restricted to chordata-infecting viruses is too challenging for practical purposes (table s ). read-wise predictions are not much better than random guesses for both blast and cnns. low precision of blast shows that it often recovers wrong labels even when it does find a match – sequence similarity is not a reliable predictor of the infectious potential in this setting. even if a whole genome is available, overall accuracy is low. this looks very differently in the fully-open view scenario (table ). the cnn trained on the species-wise division of the "all" dataset (cnnsp-all) outperforms blast by a wide margin on both reads and genomes. strikingly, cnnsp-all predictions based on a single read pair achieve higher accuracy than blast predictions using whole genomes, mainly due to their significantly higher recall. what is more, pooling predictions from all the reads originating from a given genome does not improve overall cnnsp-all accuracy any further. as cnnsp-all does not reliably outperform its chordata-trained analog on the "chordata" dataset (cnnsp-cho, table s ), we suspect that its relatively high accuracy on the "all" dataset is caused by its high sensitivity while maintaining good specificity on non-chordata viruses. filter visualization over % of all contributing first-layer filters in cnnall have positive average contribution scores. we comment more on this fact in section nucleotide contribution logos. for cnnall, the average information content of our motifs is strongly correlated nucleotide-wise with ic of deepbind-like logos (spearman’s ρ> . , p< − for all contributing filter pairs except one). the difference in average ic is negligible ( . bit higher for "max-contrib", wilcoxon test, p< − ). therefore, our contribution logos represent analogous "motifs", while extracting additional, nucleotide- level interpretations. for exactly one filter, "max-contrib" and "max-activation" scores are not correlated. a deeper analysis reveals that this particular filter is activated by stretches table . classification performance, all hosts. whole available genomes. negative class is the majority class. bacc. – balanced accuracy, rec. – recall, spec. – specificity. blast (reads) and our networks use read-wise majority vote or output averaging to aggregate predictions over all reads from a genome. k-nn (genome) and blast (genome) use contig-wise majority vote. k-nn (contigs) and blast (contigs) represent performance on individual contigs treated as separate entities. k-nn (reads) was not used, as high conflicting prediction rates made read-wise aggregation impracticable. bacc. aupr rec. spec. cnnall (ours) . . . . lstmall (ours) . . . . blast (reads) . n/a . . k-nn (genome) . n/a . . blast (genome) . n/a . . k-nn (contigs) . n/a . . blast (contigs) . n/a . . table . classification performance on the human blood virome dataset. positive class is the majority class. bacc. – balanced accuracy, rec. – recall, spec. – specificity. bacc. aupr rec. spec. cnnall- (ours) . > . . . lstmall (ours) . > . . . k-nn . . . . table . classification performance, novel species. top: paired reads (see table ). blast yields predictions for only . % of the pairs. bottom: whole available genomes or contigs – negative class is the majority class (see table ). bacc. – balanced accuracy (equal to accuracy for the balanced paired-read dataset), rec. – recall, spec. – specificity. blast (reads) and our networks use read-wise majority vote or output averaging to aggregate predictions over all reads from a genome. blast (genome) uses contig-wise majority vote. blast (contigs) represents performance on individual contigs treated as separate entities. note that low precision is heavily affected by class imbalance. bacc. prec. rec. spec. cnnsp-all (ours) . . . . blast . . . . cnnsp-all (ours) . . . . blast (reads) . . . . blast (genome) . . . . blast (contigs) . . . . of s (ns) – it is the only filter with a positive bias, and almost all of its weights are negative (with one near- zero positive). therefore, an overwhelming majority of its maximum activations are in fact padding artifacts. on the other hand, regions of unambiguous nucleotide sequences result in high positive contributions, since they correspond to a lack of filter activation, where an activation is present for the all-n reference. in fact, for over . % of the reads, positive contributions occur at every single position. we suspect that the filter works as an "ambiguity detector". since ns are modelled as all-zero vectors in the one-hot encoding .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx scheme used here, the network represents "meaningful" (i.e. unambiguous) regions of the input as a missing activation of the filter. this is supported by the fact that the filter lacks any further preference for the specific non-zero nucleotide type. since sequence logos presented here ignore ambiguous (i.e. noninformative) nucleotides, their ics for this filter are near- zero, preventing meaningful visualization. on the other hand, this ambiguity seems to play a role in the final classification decision, as contribution distributions are well-separated for both classes (fig. s ). we speculate that this could be caused by lower quality of the non-pathogen reference genomes, but understanding how exactly this information is used would require further investigation, including feature interactions at all layers of the network. importantly, only the contribution analysis reveals the relevance of the filter beyond simple activation and nucleotide overrepresentation. the choice of the reference input is crucial. in the fig. we present example filters, visualized as "max- contrib" sequence logos based on mean partial shapley values for each nucelotide at each position. all nucleotides of the filters with the second-highest (fig. a) and the lowest (fig. b) score have relatively strong contributions in accordance with the filters’ own contributions. however, we observe that some nucleotides consistently appear in the activating subsequences, but the sign of their contributions is opposite to the filter’s (low-ic nucleotides of a different color, fig. c). those "counter-contributions" may arise if a nucleotide with a negative weight forms a frequent motif with others with positive weights strong enough to activate the filter. we comment on this fact in the section nucleotide contribution logos. some filters seem to learn gapped motifs resembling a codon structure (fig. c). we extracted this filter from the original deepac network predicting bacterial pathogenicity ( ) where the counter-contributions are common, but we find similar filters in our networks as well (fig. s ). we scanned a genome of s. aureus subsp. aureus (refseq assembly accession: gcf_ . ) with this filter and discovered that the learned motif is indeed significantly enriched in coding sequences (fisher exact test with benjamini-hochberg correction, q< − ). it is also enriched in a number of specific genes. the one with the most hits (srap, q< − ) is a serine-rich adhesin involved in the pathogenesis of infective endocarditis and mediating binding to human platelets ( ). the filter seems to detect serine and glycine repeats in this particular gene (fig. s ), but a broader, cross-species, multi-gene analysis would be required to fully understand its activation patterns. an analogous analysis revealed that the second-highest contributing filter (fig. a) is overall enriched in coding sequences in both taï forest ebolavirus (q< − , refseq accession: nc_ ) and sars-cov- coronavirus (q= . × − , refseq accession: nc_ . ). the top hits are the nucleocapsid (n) protein gene of sars-cov- and the vp ebolavirus gene encoding a polymerase cofactor suppressing innate immune signaling (q< − ). genome-wide phenotype analysis we created a gwpa plot for the taï forest ebolavirus genome. most genes ( out of ) can be detected with visual inspection by finding peaks of elevated infectious potential score predicted by at least one of the models (fig. a). intergenic regions are characterized by lower mean scores. noticeably, most nucleotide contributions are positive, and low non-negative contributions coincide with regions of negative predictions. taken together with the surprisingly good generalization of chordata-trained classifiers and a dominance of positive filters discussed above, this suggests that our networks work as positive class detectors, treating all other sequences as “negative” by default. indeed, the reference sequence of all ns is predicted to be "non-pathogenic" with a score of . we ran a similar analysis of s. aureus using the built-in deepac models ( ) and our interpretation workflow. while a viral genome contains usually only a handful of genes, by compiling a ranking of annotated genes of the analyzed s. aureus strain we could test if the high-ranking regions are indeed associated with pathogenicity (table s ). indeed, out of three top-ranking genes with known biological names and gene ontology terms, sarr and sspb are directly engaged in virulence, while hupb regulates expression of virulence- involved genes in many pathogens ( ). in contrast to the viral models, both negative and positive contributions are present (fig. s ), and the model’s output for the all-n reference is slightly above the decision threshold ( . ). even though the network architecture of the viral and the bacterial model are the same, the latter learns a "two-sided" view of the data. we assume this must be a feature of the dataset itself. fig. b presents a gwpa plot for the whole genome of the sars-cov- coronavirus, successfully predicted to infect humans, even though the data was collected at least months before its emergence. interestingly, its mean infectious potential ( . as scored by cnnall) is relatively close to the decision threshold, while its closest known relative, a bat- infecting sarsr-cov ratg , is actually falsely classified as a human virus with a slightly lower mean infectious potential ( . ). what is more, the gene encoding the spike protein, which plays a significant role in host entry ( ), has a mean score slightly above the threshold for sars- cov- ( . ) and below the threshold for ratg ( . ). as shown in the gwpa plots of both viruses (fig. b and fig. s ), regions that the network has learned to associate with the infectious phenotype are distributed non-uniformly and tend to cluster together. this suggests that low-confidence mean prediction for those viruses is not a result of random guessing, but genuine ambiguity present in the data – and the misclassification of ratg could be indicative of a general zoonotic potential of sars-related coronaviruses. in the fig. b, we highlighted the score peaks aligning the spike protein gene (s), as well as the e and n genes, which were scored the highest (apart from an unconfirmed orf of just aa downstream of n) by the cnn and the lstm, respectively. correlation between the cnn and lstm outputs is significant, but species-dependent and moderate ( . for ebola, . for sars-cov- ), which suggests they capture complementary signals. fig. c shows the nucleotide-level contributions in a small peak within the receptor-binding domain (rbd) of the s protein, crucial for recognizing the host cell. the domain location was predicted with cd-search ( ) using the default parameters. the maximum score of this peak is noticeably higher for sars-cov- ( . ) than for its analog in ratg .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (a) (b) (c) (d) (e) (f) figure . nucleotide contribution logos of example filters. a: second-highest mean contribution score (cnnall). error bars correspond to bayesian % confidence intervals. b: lowest mean contribution score (cnnall). c: gaps resembling a codon structure, extracted from bartoszewicz et al. ( ). consensus sequence: cawcnncnncnncnn. d- f: analogous logos created with the deepbind-like "max-activation" approach. our "max-contrib" logos visualize contributions of individual nucleotides, including counter-contributions. ( . ). fig. presents the rbd in the structural context of the whole s protein (pdb id: vsb, ( )), as well as in complex with a sars-neutralizing antibody cr (pdb id: w , ( )). the high score peak roughly corresponds to one of the regions associated with reduced expression of the rbd ( ), located in the core-rbd subdomain. it covers over % of the cr epitope, as well as the neighbouring site of the n glycan. the latter is present in the epitope of another core-rbd targeting antibody, s ( ). all the per-residue average contributions in the region are positive (fig. s ), even in the regions of lower pathogenicity score, in accordance with the results presented in fig. c. discussion accurate predictions from short dna reads compared to the previous state-of-the-art in viral host prediction directly from next-generation sequencing reads ( ), our models drastically reduce the error rates. this holds also for novel viruses not present in the training set. generalization of virus-level chordata models to other host groups is a sign of a strong, “human” signal. we suspect our classifiers detect the positive class treating all other regions of the sequence space as “negative” by default, exhibiting traits of a one-class classifier even without being explicitly trained to do so. we find further support for this hypothesis: the networks learn many more “positive” than “negative” filters and regions of near-zero nucleotide contributions (including the null reference sample) result in negative predictions. as this effect does not occur for bacteria, we expect it do be task- and data-dependent. while we ignore the simulated quality information here, investigating the role of sequencing noise will be an interesting follow-up study. although the data setup is crucial in general, the modelling step is also important, as shown by our comparison to the baseline k-nn model. the rc-nets are relatively simple, but they are invariant to reverse-complementarity and perform better than random forests, naïve bayes classifiers and standard nn architectures in another ngs task ( ). in the paired read scenario, the previously described k- nn approach fails, and standard, alignment-based homology testing algorithms cannot find any matches in more than % of the cases, resulting in relatively low accuracy. on a real human virome sample, where a main source of negative class reads is most likely contamination ( ), our method filters out non-human viruses with high specificity. in this scenario, the blast-derived ground-truth labels were mined using the complete database (as opposed to just a training set). in all cases, our results are only as good as the training data used; high quality labels and sequences are needed to develop trustworthy models. ideally, sources of error should be investigated with an in-depth analysis of a model’s performance on multiple genomes covering a wide selection of taxonomic units. this is especially important as the method assumes no mechanistic link between an input sequence and the phenotype of interest, and the input sequence constitutes only a small fraction of the target genome without a wider biological context. still, it is possible to predict a label even from those small, local fragments. a similar effect was also observed for image classification with cnns ( ). virulence arises as a complex interplay between the host and the virus, so the predictions reflect only an estimated potential of the infectious phenotype. this mirrors the caveats of bacterial pathogenic potential prediction ( ), including the considerations of balancing computational cost, reliability of error estimates, size and composition of the reference database. even though deep learning outperforms the standard homology-based methods, it is still an open question whether it captures "functional" signals, or just a more flexible sequence similarity function. by the very nature of machine learning and sequence comparison in general, we expect similar viruses to yield similar predictions; in principle this could be used to asses a risk of a host-switching event. the .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (a) (b) (c) figure . taï forest ebolavirus and sars-cov- coronavirus genomes. top: score predicted by lstmall. middle: score predicted by cnnall. heatmap: nucleotide contributions of cnnall. bottom, in blue: reference sequence. a: taï forest ebolavirus. genes that can be detected by at least one model are highlighted in black. b: whole genome and sequences encoding the spike protein (s), envelope protein (e) and nucleocapsid protein (n). c: spike protein gene, a small peak (positions , - , , dashed line in fig. b) within the receptor-binding domain (predicted by cd-search, positions , - , ). binding to the receptor is crucial for entry to the host cell. local host adaptation could help switch hosts between the animal reservoir and humans. interpretability suite presented here aims at shedding some light on this question, but more research is needed. dual-use research and biosecurity while we focused on the ngs-based prediction scenario, our models could in principle be used to screen dna synthesis orders for potentially dangerous sequences the context of cyberbiosecurity in synthetic biology. since standard, homology-based approaches like blast are not enough to guarantee accurate screening at a reasonable cost ( , , ), machine learning methods are a promising solution. this has been suggested before for the bacterial deepac models ( ), and is applicable to the viral networks presented here as well. however, this line of research can raise questions about possible dual-use. o’brien and nelson ( ) suggested that while the intended purpose of pathogenicity potential prediction is to mitigate biosecurity threats, it could actually enable designing new pathogens to cause maximal harm. the importance of this concern is difficult to overstate and it must be addressed. if an ml-guided, genome-wide phenotype optimization tool existed, it would indeed be a classical dual-use technology not unlike more established computer-aided design approaches for synthetic biology – potentially dangerous, but offering tremendous benefits (e.g. in agriculture, medicine or manufacturing) as well. however, the models presented here do not allow biologically sensible optimization of target sequences. for example, we find meaningless, low-complexity sequences of mononucleotide repeats corresponding to global maxima (infectious potential of . ). these artifacts highlight the fact that only some generally undefined regions of the theoretically possible sequence space are biologically relevant. what is more, we operate on short sequences constituting minuscule fractions of .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx (a) (b) (c) (d) (e) figure . predicted infectious potentials plotted over the sars-cov- spike glycoprotein receptor-binding domain. a- c: top and side view of the spike protein. three receptor-binding domains (rbds) are colored in blue, white and red according to the predicted infectious potential of the corresponding genomic sequence. one of the domains is in the "up" conformation. red regions corresponding to the peak in fig. c are located in the core-rbd subdomain. d: rbd in complex with a sars-neutralizing antibody cr (green). the red region covers over % of the cr epitope, but spans also to the neighbouring fragments, including the site of the n glycan (carbohydrate in red stick representation). this is a part of the epitope of another neutralizing antibody, s . e: cartoon representation of fig. d. the red region is centered on two exposed α-helices surrounding the core β-sheet (lower score, white). the whole genome with all its complexity. although successful deep learning approaches for both protein ( , , ) and regulatory sequence design ( , , , ) do exist, moving from read-based classification to genome-wide phenotype optimization would require considerable research effort, if possible at all. this would entail capturing a wealth of biological contexts well beyond the capabilities of even the best classification models currently available. nucleotide contribution logos visualizing convolutional filters may help to identify more complex filter structures and disentangle the contributions of individual nucleotides from their "conservation" in contributing sequences. counter-contributions suggest that the information content and the contribution of a nucleotide are not necessarily correlated. visualizing learned motifs by aligning the activating sequences ( ) would not fully describe how the filter reacts to presented data. it seems that the assumption of nucleotide independence – which is crucial for treating deeplift as a method of estimating shapley values for input nucleotides ( ) – does not hold in full. indeed, k-mer distribution profiles are frequently used features for modelling dna sequences (as shown also by the dimer-shuffling method of generating reference sequences proposed by shrikumar et al. ( )). however, deeplift’s multiple successful applications in genomics indicate that the assumption probably holds approximately. we see information content and deeplift’s contribution values as two complementary channels that can be jointly visualized for better interpretability and explainability of cnns in genomics. filter enrichment analysis enables even deeper insight in the inner workings of the networks. we generate activation data for hundreds to thousands of species, genes and filters. yet, aggregation and interpretation of those results beyond case studies is non-trivial, and a promising avenue for further research. genome-scale interpretability mapping predictions back to a target genome can be used both as a way of investigating a given model’s performance and as a method of genome analysis. gwpa plots of well- annotated genomes highlight the sequences with erroneous .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx and correct phenotype predictions at both genome and gene level, and nucleotide-resolution contribution maps help track those regions down to individual amino-acids. on the other hand, once a trusted model is developed, it can be used on newly emerging pathogens, as the sars-cov- virus briefly analyzed in this work. therefore, we see gpwa applications in both probing the behaviour of artificial neural networks in pathogen genomics and finding regions of interest in weakly annotated genomes. what is more, the approach could be easily co-opted to genome-wide activation analyses of any arbitrary, intermediate neuron. the methods presented here may also be applied to other biological problems, and extending them to other hosts and pathogen groups, multi-class classification or gene identification is possible. however, experimental work and traditional sequence analysis are required to truly understand the biology behind host adaptation and distinguish true hits from false positives. conclusion we presented a new approach for predicting a host of a novel virus based on a single dna read or a read pair, cutting the error rates in half compared to the previous state-of-the-art. for convolutional filters, we jointly visualize nucleotide contributions and information content. finally, we use gwpa plots to gain insights into the models’ behaviour and analyze a recently emerged sars-cov- virus. the approach presented here is implemented as a python package (see data availability) and a command line tool easily installable with bioconda ( ). data availability the datasets of simulated reads with associated metadata are hosted at https://doi.org/ . /zenodo. . the tool can be installed with bioconda (conda install deepacvir, requires setting up bioconda), docker (docker pull dacshpi/deepac) or pip (pip install deepacvir). detailed installation instructions, user guide and the main codebase (including the interpretability workflows presented here) are available at https://gitlab.com/dacs-hpi/deepac. source code of the plugin shipping the trained models, config files describing the architectures used and the models themselves are available at https://gitlab.com/dacs-hpi/deepac-vir. acknowledgements we gratefully acknowledge yong-zhen zhang and the scientists at the shanghai public health clinical center & school of public health, fudan university, who shared the sequence of the sars-cov- virus ahead of publication. we thank melania nowicka (max plank institute for molecular genetics) for inspiring discussions on efficient calculations of partial shapley values, vitor c. piro (hasso plattner institute) for discussions on traversing taxonomy graphs, lothar h. wieler (robert koch institute) for useful comments on the first draft of the manuscript and the anonymous reviewers for their suggestions and feedback. funding this work was supported by the german academic scholarship foundation (jmb), the bmbf computational life sciences initiative (project deepath, to byr) and the bmbf-funded de.nbi cloud within the german network for bioinformatics infrastructure (de.nbi) ( a b, a a, a a, a b, a a, a c, a a, a b). references . calvignac-spencer, s., schulze, j. m., zickmann, f., and renard, b. y. ( ) clock rooting further demonstrates that guinea ebov is a member of the zaïre lineage. plos currents, . . vouga, m. and greub, g. (january, ) emerging bacterial pathogens: the past and beyond. clinical microbiology and infection, ( ), – . . trappe, k., marschall, t., and renard, b. y. (september, ) detecting horizontal gene transfer by mapping sequencing reads across species boundaries. bioinformatics, ( ), i –i . . leendertz, s. a. j., gogarten, j. f., düx, a., calvignac-spencer, s., and leendertz, f. h. (mar, ) assessing the evidence supporting fruit bats as the primary reservoirs for ebola viruses. ecohealth, ( ), – . . lecuit, m. and eloit, m. ( ) the diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. frontiers in cellular and infection microbiology, , . . calistri, a. and palù, g. ( ) editorial commentary: unbiased next-generation sequencing and new pathogen discovery: undeniable advantages and still-existing drawbacks. clinical infectious diseases: an official publication of the infectious diseases society of america, ( ), – . . andrusch, a., dabrowski, p. w., klenner, j., tausch, s. h., kohl, c., osman, a. a., renard, b. y., and nitsche, a. ( ) paipline: pathogen identification in metagenomic and clinical next generation sequencing samples. bioinformatics, ( ), i –i . . herfst, s., schrauwen, e. j. a., linster, m., chutinimitkul, s., wit, e. d., munster, v. j., sorrell, e. m., bestebroer, t. m., burke, d. f., smith, d. j., rimmelzwaan, g. f., osterhaus, a. d. m. e., and fouchier, r. a. m. (june, ) airborne transmission of influenza a/h n virus between ferrets. science, ( ), – . . imai, m., watanabe, t., hatta, m., das, s. c., ozawa, m., shinya, k., zhong, g., hanson, a., katsura, h., watanabe, s., li, c., kawakami, e., yamada, s., kiso, m., suzuki, y., maher, e. a., neumann, g., and kawaoka, y. (june, ) experimental adaptation of an influenza h ha confers respiratory droplet transmission to a reassortant h ha/h n virus in ferrets. nature, ( ), – . . lipsitch, m. and inglesby, t. v. (december, ) moratorium on research intended to create novel potential pandemic pathogens. mbio, ( ). . noyce, r. s., lederman, s., and evans, d. h. (january, ) construction of an infectious horsepox virus vaccine from chemically synthesized dna fragments. plos one, ( ), e . . thiel, v. ( ) synthetic viruses-anything new?. plos pathogens, ( ), e . . edwards, r. a., mcnair, k., faust, k., raes, j., and dutilh, b. e. ( ) computational approaches to predict bacteriophage-host relationships. fems microbiology reviews, ( ), – . . eng, c. l., tong, j. c., and tan, t. w. ( ) predicting host tropism of influenza a virus proteins using random forest. bmc medical genomics, ( ), s . . xu, b., tan, z., li, k., jiang, t., and peng, y. (july, ) predicting the host of influenza viruses based on the word vector. peerj, , e . . li, h. and sun, f. ( ) comparative studies of alignment, alignment- free and svm based approaches for predicting the hosts of viruses based on viral sequences. scientific reports, ( ), . . mock, f., viehweger, a., barth, e., and marz, m. ( , ) vidhop, viral host prediction with deep learning. bioinformatics, btaa . . gałan, w., bąk, m., and jakubowska, m. ( ) host taxon predictor - a tool for predicting taxon of the host of a newly discovered virus. scientific reports, ( ), . . babayan, s. a., orton, r. j., and streicker, d. g. (november, ) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /zenodo. https://gitlab.com/dacs-hpi/deepac https://gitlab.com/dacs-hpi/deepac-vir https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx predicting reservoir hosts and arthropod vectors from evolutionary signatures in rna virus genomes. science, ( ), – . . zhang, z., cai, z., tan, z., lu, c., jiang, t., zhang, g., and peng, y. ( ) rapid identification of human-infecting viruses. transboundary and emerging diseases, ( ), – . . poplin, r., chang, p.-c., alexander, d., schwartz, s., colthurst, t., ku, a., newburger, d., dijamco, j., nguyen, n., afshar, p. t., gross, s. s., dorfman, l., mclean, c. y., and depristo, m. a. ( ) a universal snp and small-indel variant caller using deep neural networks. nature biotechnology, ( ), – . . rizzo, r., fiannaca, a., la rosa, m., and urso, a. (june, ) classification experiments of dna sequences by using a deep neural network and chaos game representation. in proceedings of the th international conference on computer systems and technologies new york, ny, usa: association for computing machinery compsystech ’ pp. – . . löchel, h. f., eger, d., sperlea, t., and heider, d. (january, ) deep learning on chaos game representation for proteins. bioinformatics, ( ), – . . bartoszewicz, j. m., seidel, a., rentzsch, r., and renard, b. y. ( , ) deepac: predicting pathogenic potential of novel dna with reverse-complement neural networks. bioinformatics, ( ), – . . alipanahi, b., delong, a., weirauch, m. t., and frey, b. j. ( ) predicting the sequence specificities of dna- and rna-binding proteins by deep learning. nature biotechnology, ( ), – . . zhou, j. and troyanskaya, o. g. ( ) predicting effects of noncoding variants with deep learning–based sequence model. nature methods, ( ), – . . zeng, h., edwards, m. d., liu, g., and gifford, d. k. ( ) convolutional neural network architectures for predicting dna–protein binding. bioinformatics, ( ), i –i . . quang, d. and xie, x. ( ) danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. nucleic acids research, ( ), e –e . . kelley, d. r., snoek, j., and rinn, j. l. ( ) basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. genome research, ( ), – . . greenside, p., shimko, t., fordyce, p., and kundaje, a. ( ) discovering epistatic feature interactions from neural network models of regulatory dna sequences. bioinformatics, ( ), i –i . . nair, s., kim, d. s., perricone, j., and kundaje, a. (july, ) integrating regulatory dna sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. bioinformatics, ( ), i –i . . avsec, Ž., weilert, m., shrikumar, a., alexandari, a., krueger, s., dalal, k., fropf, r., mcanany, c., gagneur, j., kundaje, a., and zeitlinger, j. (august, ) deep learning at base-resolution reveals motif syntax of the cis-regulatory code. biorxiv, p. . . mock, f., viehweger, a., barth, e., and marz, m. ( ) viral host prediction with deep learning. biorxiv, p. . . ren, j., song, k., deng, c., ahlgren, n. a., fuhrman, j. a., li, y., xie, x., and sun, f. (june, ) identifying viruses from metagenomic data by deep learning. arxiv: . [q-bio], arxiv: . . . tampuu, a., bzhalava, z., dillner, j., and vicente, r. (september, ) viraminer: deep learning on raw dna sequences for identifying viral genomes in human samples. plos one, ( ), e . . eraslan, g., avsec, Ž., gagneur, j., and theis, f. j. (july, ) deep learning: new computational modelling techniques for genomics. nature reviews genetics, ( ), – . . schneider, t. d. and stephens, r. m. (october, ) sequence logos: a new way to display consensus sequences. nucleic acids research, ( ), – . . crooks, g. e., hon, g., chandonia, j.-m., and brenner, s. e. (june, ) weblogo: a sequence logo generator. genome research, ( ), – . . lanchantin, j., singh, r., lin, z., and qi, y. ( ) deep motif: visualizing genomic sequence classifications. corr, abs/ . . . lanchantin, j., singh, r., wang, b., and qi, y. ( ) deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. pacific symposium on biocomputing. pacific symposium on biocomputing, , – . . sundararajan, m., taly, a., and yan, q. ( ) gradients of counterfactuals. corr, abs/ . . . jha, a., aicher, j. k., singh, d., and barash, y. ( ) improving interpretability of deep learning models: splicing codes as a case study. biorxiv,. . shrikumar, a., greenside, p., and kundaje, a. (august, ) learning important features through propagating activation differences. in precup, d. and teh, y. w., (eds.), proceedings of the th international conference on machine learning, international convention centre, sydney, australia: pmlr vol. of proceedings of machine learning research, pp. – . . bach, s., binder, a., montavon, g., klauschen, f., müller, k.-r., and samek, w. (july, ) on pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. plos one, ( ), e . . lundberg, s. m. and lee, s.-i. ( ) a unified approach to interpreting model predictions. in guyon, i., luxburg, u. v., bengio, s., wallach, h., fergus, r., vishwanathan, s., and garnett, r., (eds.), advances in neural information processing systems , pp. – curran associates, inc. . shrikumar, a., tian, k., shcherbina, a., avsec, Ž., banerjee, a., sharmin, m., nair, s., and kundaje, a. (march, ) tf-modisco v . . . -alpha: technical note. arxiv: . [cs, q-bio, stat], arxiv: . . . altschul, s. f., gish, w., miller, w., myers, e. w., and lipman, d. j. ( ) basic local alignment search tool. journal of molecular biology, ( ), – . . camacho, c., coulouris, g., avagyan, v., ma, n., papadopoulos, j., bealer, k., and madden, t. l. (december, ) blast+: architecture and applications. bmc bioinformatics, ( ), . . wu, f., zhao, s., yu, b., chen, y.-m., wang, w., hu, y., song, z.- g., tao, z.-w., tian, j.-h., pei, y.-y., yuan, m.-l., zhang, y.-l., dai, f.-h., liu, y., wang, q.-m., zheng, j.-j., xu, l., holmes, e. c., and zhang, y.-z. (january, ) complete genome characterisation of a novel coronavirus associated with severe human respiratory disease in wuhan, china. biorxiv, p. . . . . . mihara, t., nishimura, y., shimizu, y., nishiyama, h., yoshikawa, g., uehara, h., hingamp, p., goto, s., and ogata, h. ( ) linking virus genomes with host taxonomy. viruses, ( ), . . king, a. m. q., adams, m. j., carstens, e. b., and lefkowitz, e. j., (eds.) ( ) virus taxonomy: ninth report of the international committee on taxonomy of viruses, academic press, london; waltham. . lefkowitz, e. j., dempsey, d. m., hendrickson, r. c., orton, r. j., siddell, s. g., and smith, d. b. (january, ) virus taxonomy: the database of the international committee on taxonomy of viruses (ictv). nucleic acids research, (d ), d –d . . holtgrewe, m. ( ) mason – a read simulator for second generation sequencing data. technical report fu berlin,. . deneke, c., rentzsch, r., and renard, b. y. ( ) paprbag: a machine learning approach for the detection of novel pathogens from ngs data. scientific reports, , . . moustafa, a., xie, c., kirkness, e., biggs, w., wong, e., turpaz, y., bloom, k., delwart, e., nelson, k. e., venter, j. c., and telenti, a. (march, ) the blood dna virome in , humans. plos pathogens, ( ), e . . gorbalenya, a. e., baker, s. c., baric, r. s., de groot, r. j., drosten, c., gulyaeva, a. a., haagmans, b. l., lauber, c., leontovich, a. m., neuman, b. w., penzar, d., perlman, s., poon, l. l. m., samborskiy, d. v., sidorov, i. a., sola, i., ziebuhr, j., and coronaviridae study group of the international committee on taxonomy of viruses (april, ) the species severe acute respiratory syndrome-related coronavirus : classifying -ncov and naming it sars-cov- . nature microbiology, ( ), – . . simmonds, p. and aiewsakun, p. (august, ) virus classification – where do you draw the line?. archives of virology, ( ), – . . van regenmortel, m. h. v. (january, ) chapter one - the species problem in virology. in kielian, m., mettenleiter, t. c., and roossinck, m. j., (eds.), advances in virus research, vol. , pp. – academic press. . li, h. and durbin, r. ( ) fast and accurate short read alignment with burrows–wheeler transform. bioinformatics, ( ), – . . langmead, b. and salzberg, s. l. ( - ) fast gapped-read alignment with bowtie . nature methods, ( ), – . . wood, d. e. and salzberg, s. l. ( ) kraken: ultrafast metagenomic sequence classification using exact alignments. genome biology, ( ), r . . nix, r. and kantarciouglu, m. (july, ) incentive compatible privacy-preserving distributed classification. ieee transactions on .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / i i “output” — / / — : — page — # i i i i i i preprint, yyyy, vol. xx, no. xx dependable and secure computing, ( ), – conference name: ieee transactions on dependable and secure computing. . matejczyk, s. and michalak, t. ( ) solving influence maximization problem using methods from cooperative game theory., instytut podstaw informatyki pan, publication title: k . . thorvaldsdóttir, h., robinson, j. t., and mesirov, j. p. (march, ) integrative genomics viewer (igv): high-performance genomics data visualization and exploration. briefings in bioinformatics, ( ), – . . delano, w. l. and others ( ) pymol: an open-source molecular graphics tool. ccp newsletter on protein crystallography, ( ), – . . yang, y.-h., jiang, y.-l., zhang, j., wang, l., bai, x.-h., zhang, s.-j., ren, y.-m., li, n., zhang, y.-h., zhang, z., gong, q., mei, y., xue, t., zhang, j.-r., chen, y., and zhou, c.-z. (june, ) structural insights into srap-mediated staphylococcus aureus adhesion to host cells. plos pathogens, ( ), e . . stojkova, p., spidlova, p., and stulik, j. ( ) nucleoid-associated protein hu: a lilliputian in gene regulation of bacterial virulence. frontiers in cellular and infection microbiology, , . . li, f. ( ) structure, function, and evolution of coronavirus spike proteins. annual review of virology, ( ), – . . marchler-bauer, a., bo, y., han, l., he, j., lanczycki, c. j., lu, s., chitsaz, f., derbyshire, m. k., geer, r. c., gonzales, n. r., gwadz, m., hurwitz, d. i., lu, f., marchler, g. h., song, j. s., thanki, n., wang, z., yamashita, r. a., zhang, d., zheng, c., geer, l. y., and bryant, s. h. ( ) cdd/sparcle: functional classification of proteins via subfamily domain architectures. nucleic acids research, (d ), d –d . . wrapp, d., wang, n., corbett, k. s., goldsmith, j. a., hsieh, c.-l., abiona, o., graham, b. s., and mclellan, j. s. (march, ) cryo- em structure of the -ncov spike in the prefusion conformation. science, ( ), – publisher: american association for the advancement of science section: report. . yuan, m., wu, n. c., zhu, x., lee, c.-c. d., so, r. t. y., lv, h., mok, c. k. p., and wilson, i. a. (may, ) a highly conserved cryptic epitope in the receptor binding domains of sars-cov- and sars- cov. science, ( ), – publisher: american association for the advancement of science section: report. . starr, t. n., greaney, a. j., hilton, s. k., crawford, k. h., navarro, m. j., bowen, j. e., tortorici, m. a., walls, a. c., veesler, d., and bloom, j. d. (june, ) deep mutational scanning of sars-cov- receptor binding domain reveals constraints on folding and ace binding. biorxiv, p. . . . publisher: cold spring harbor laboratory section: new results. . pinto, d., park, y.-j., beltramello, m., walls, a. c., tortorici, m. a., bianchi, s., jaconi, s., culap, k., zatta, f., de marco, a., peter, a., guarino, b., spreafico, r., cameroni, e., case, j. b., chen, r. e., havenar-daughton, c., snell, g., telenti, a., virgin, h. w., lanzavecchia, a., diamond, m. s., fink, k., veesler, d., and corti, d. (may, ) cross-neutralization of sars-cov- by a human monoclonal sars-cov antibody. nature, pp. – publisher: nature publishing group. . brendel, w. and bethge, m. ( ) approximating cnns with bag- of-local-features models works surprisingly well on imagenet. in international conference on learning representations. . national research council ( ) sequence-based classification of select agents: a brighter line, the national academies press, . . national academies of sciences, engineering, and medicine ( ) biodefense in the age of synthetic biology, the national academies press, . . diggans, j. and leproust, e. ( ) next steps for access to safe, secure dna synthesis. frontiers in bioengineering and biotechnology, . . o’brien, j. t. and nelson, c. (june, ) assessing the risks posed by the convergence of artificial intelligence and biotechnology. health security, ( ), – . . brookes, d., park, h., and listgarten, j. (may, ) conditioning by adaptive sampling for robust design. in international conference on machine learning pp. – . . alley, e. c., khimulya, g., biswas, s., alquraishi, m., and church, g. m. (december, ) unified rational protein engineering with sequence- based deep representation learning. nature methods, ( ), – . . biswas, s., khimulya, g., alley, e. c., esvelt, k. m., and church, g. m. (january, ) low-n protein engineering with data-efficient deep learning. biorxiv, p. . . . . . gupta, a. and zou, j. (february, ) feedback gan for dna optimizes protein functions. nature machine intelligence, ( ), – . . gupta, a. and kundaje, a. (july, ) targeted optimization of regulatory dna sequences with neural editing architectures. biorxiv, p. . . linder, j., bogard, n., rosenberg, a. b., and seelig, g. (december, ) deep exploration networks for rapid engineering of functional dna sequences. biorxiv, p. . . schreiber, j., lu, y. y., and noble, w. s. (may, ) ledidi: designing genomic edits that induce functional activity. biorxiv, p. . . . . . grüning, b., dale, r., sjödin, a., chapman, b. a., rowe, j., tomkins- tinch, c. h., valieris, r., and köster, j. (july, ) bioconda: sustainable and comprehensive software distribution for the life sciences. nature methods, ( ), – number: publisher: nature publishing group. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / interpretable detection of novel human viruses from genome sequencing data introduction materials and methods results discussion data availability acknowledgements funding references profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs tsung-yu lu , the human genome structural variation consortium, mark chaisson * * corresponding author, mchaisso@usc.edu department of quantitative and computational biology, university of southern california, california, usa abstract variable number tandem repeat sequences (vntr) are composed of consecutive repeats of short segments of dna with hypervariable repeat count and composition. they include protein coding sequences and associations with clinical disorders. it has been difficult to incorporate vntr analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. we solve vntr mapping for short reads with a repeat-pangenome graph (rpgg), a data structure that encodes both the population diversity and repeat structure of vntr loci from multiple haplotype-resolved assemblies. we developed software to build a rpgg, and use the rpgg to estimate vntr composition with short reads. we used this to discover vntrs with length stratified by continental population, and novel expression quantitative trait loci, indicating that rpgg analysis of vntrs will be critical for future studies of diversity and disease. introduction the human genome is composed of roughly % simple sequence repeats (ssrs) (i. h. g. s. consortium and international human genome sequencing consortium ) , loci composed of short, tandemly repeated motifs. these sequences are classified by motif length into short tandem repeats (strs) with a motif length of six nucleotides or fewer, and variable-number tandem repeats (vntrs) for repeats of longer motifs. ssrs are prone to hyper-mutability through motif copy number changes due to polymerase slippage during dna replication (viguera, canceill, and ehrlich ) . variation in ssrs are associated with tandem repeat disorders .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:mchaisso@usc.edu https://paperpile.com/c/h ctd /ndo a https://paperpile.com/c/h ctd /ndo a https://paperpile.com/c/h ctd /oc w https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (trds) including amyotrophic lateral sclerosis and huntington’s disease (gatchel and zoghbi ) , and vntrs are associated with a wide spectrum of complex traits and diseases including attention-deficit disorder, type diabetes and schizophrenia (hannan ) . while str variation has been profiled in human populations (mallick et al. ) and to find expression quantitative trait loci (eqtl) (fotsing et al. ; gymrek et al. ) , and variation at vntr sequences may be detected for targeted loci (bakhtiari et al. ; dolzhenko et al. ) , the landscape of vntr variation in populations and effects on human phenotypes are not yet examined genome-wide. large scale sequencing studies including the genomes project ( genomes project consortium et al. ) , topmed (taliun et al. ) and dna sequencing by the genotype-tissue expression (gtex) project (g. consortium and gtex consortium ) rely on high-throughput sequencing (srs) characterized by srs reads up to bases. alignment and standard approaches for detecting single-nucleotide variant (snv) and indel variation ( insertions and deletions less than bases) using srs are unreliable in ssr loci (li et al., n.d.) , and the majority of vntr svs are missed using sv detection algorithms with srs (chaisson et al. ) . the full extent to which vntr loci differ has been made more clear by single-molecule sequencing (lrs) and assembly. lrs assemblies have megabase scale contiguity and accurate consensus sequences (koren et al. ; chin et al. ) that may be used to detect vntr variation. nearly % of insertions and deletions discovered by lrs assemblies greater than bases are in str and vntr loci (chaisson et al. ) , accounting for up to mbp per genome. furthermore, lrs assemblies reveal how vntr sequences differ kilobases in length and by motif composition (song, lowe, and kingsley ) . here we propose using a limited number of human lrs genomes sequenced for population references and diversity panels (chaisson et al. ; audano et al. ; seo et al. ; shi et al. ) to improve how vntr variation is detected using srs. it has been previously demonstrated that vntr variation discovered by lrs assemblies may be genotyped using srs (hickey et al. ; audano et al. ) . however, the genotyping accuracy for vntr svs is considerably lower than accuracy for genotyping other svs, owing to the complexity of representing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /p a https://paperpile.com/c/h ctd / k ci https://paperpile.com/c/h ctd /t pi https://paperpile.com/c/h ctd / dguv+qanj https://paperpile.com/c/h ctd / dguv+qanj https://paperpile.com/c/h ctd / gs +qaf https://paperpile.com/c/h ctd / gs +qaf https://paperpile.com/c/h ctd /jzbjy https://paperpile.com/c/h ctd /jzbjy https://paperpile.com/c/h ctd /crk v https://paperpile.com/c/h ctd /lyx d https://paperpile.com/c/h ctd /ymn z https://paperpile.com/c/h ctd /rpd https://paperpile.com/c/h ctd /pj xm+q ll https://paperpile.com/c/h ctd /pj xm+q ll https://paperpile.com/c/h ctd /rpd https://paperpile.com/c/h ctd /jel https://paperpile.com/c/h ctd /rpd +k rob+xd +b ifz https://paperpile.com/c/h ctd /rpd +k rob+xd +b ifz https://paperpile.com/c/h ctd /jzyin+k rob https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / vntr variation and mapping reads to sv loci. most existing tools support a limited description of the complexity of tandem repeats using a single motif, such as in gangstr (mousavi et al. ) and advntr (bakhtiari et al. ) . while expansionhunter (dolzhenko et al. ) allows the repeat structure to be defined by a regular expression, it is mostly restricted to str genotyping and has not been extended to vntrs. additionally, gangstr and advntr are designed to estimate the number of a repeat unit, which leaves the variation in motif sequences unexplored. furthermore, traditional genotyping tests (chen et al. ) for the presence of a known variant, and does not reveal the spectrum of copy number variation that exists in tandem repeat sequences. repeat length estimation in tools specialized for tandem repeat genotyping allows more biological meaningful analyses (gymrek et al. ; saini et al. ; gymrek et al. ) . an alternative approach to tackle the vntr genotyping problem is to use lrs assemblies as population-specific references that improve srs read mapping by adding sequences missing from the reference (du et al. ; shi et al. ) . because missing sequences are enriched for vntrs (audano et al. ) , haplotype-resolved lrs genomes may help improve alignment to vntr regions, as well as facilitate the development of a model to discover vntr variation by serving as a ground truth. the hypervariability of vntrs prevents a single assembly from serving as an optimal reference. instead, to improve both alignment and genotyping, multiple assemblies may be combined into a pangenome graph (pgg) (hickey et al. ; eggertsson et al. ; garrison et al. ; chen et al. ) composed of sequence-labeled vertices connected by edges such that haplotypes correspond to paths in the graph. sequences shared between haplotypes are stored in the same vertex, and genetic variation is represented by the structure of the graph. a conceptually similar construct is the repeat graph (pevzner, tang, and tesler ) , with sequences repeated multiple times in a genome represented by the same vertex. graph analysis has been used to encode the elementary duplication structure of a genome (jiang et al. ) and for multiple sequence alignment of repetitive sequences with shuffled domains (raphael et al. ) , making them well-suited to represent vntrs that differ in both repeat count and composition. here we propose the representation of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /akii https://paperpile.com/c/h ctd / gs https://paperpile.com/c/h ctd /qaf https://paperpile.com/c/h ctd /hn t https://paperpile.com/c/h ctd /qanj+ xl+yulf https://paperpile.com/c/h ctd /eix e+b ifz https://paperpile.com/c/h ctd /eix e+b ifz https://paperpile.com/c/h ctd /k rob https://paperpile.com/c/h ctd /jzyin+n kax+lmbav+hn t https://paperpile.com/c/h ctd /tdftw https://paperpile.com/c/h ctd /wqpb https://paperpile.com/c/h ctd /xhkpd https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / human vntrs as a repeat-pangenome graph (rpgg), that encodes both the repeat structure and sequence diversity of vntr loci (figure c). the most straight-forward approach that combines a pangenome graph and a repeat graph is a de bruijn graph, and was the basis of one of the earliest representations of a pangenome by the cortex method (iqbal, turner, and mcvean ; iqbal et al. ) . the de bruijn graph has a vertex for every distinct sequence of length k in a genome ( k- mer), and an edge connecting every two consecutive k -mers, thus k -mers occurring in multiple genomes or in multiple times in the same genome are stored by the same vertex. while the cortex method stores entire genomes in a de bruijn graph, we construct a separate locus-rpgg for each vntr and store a genome as the collection of locus-rpggs, which deviates from the definition of a de bruijn graph because the same k -mer may be stored in multiple vertices. we developed a toolkit, tan d em repe a t ge n otyping b ased on haplotype-der i ved pange n ome g raphs (danbing-tk) to identify vntr boundaries in assemblies, construct rpggs, align srs reads to the rpgg, and infer vntr motif composition and length in srs samples. this enables the alignment of srs datasets into an rpgg to discover population genetics of vntr loci, and to associate expression with vntr variation. results. repeat pan-genome graph construction our approach to build rpggs is to de novo assemble lrs genomes, and build de bruijn graphs on the assembled sequences at vntr loci, using srs genomes to ensure graph quality. we used public lrs data for individuals with diverse genetic backgrounds, including genomes from individual genome projects (seo et al. ; zook et al. ) , structural variation studies (chaisson et al. ) , and diversity panel sequencing (audano et al. ) (figure a, supplementary table ). each genome was sequenced by either pacbio single long read (slr) between, or high-fidelity (hifi) sequencing between and -fold coverage along with matched - -fold illumina sequencing (table ). this data reflects a wide range of technology revisions, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /jmtrf+cjaux https://paperpile.com/c/h ctd /jmtrf+cjaux https://paperpile.com/c/h ctd /xd +cclhp https://paperpile.com/c/h ctd /xd +cclhp https://paperpile.com/c/h ctd /rpd https://paperpile.com/c/h ctd /k rob https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sequencing depth, and data type, however subsequent steps were taken to ensure accuracy of rpgg through locus redundancy and srs alignments. we developed a pipeline that partitions lrs reads by haplotype based on phased heterozygous snvs and assembles haplotypes separately by chromosome. when available, we used existing telomere-to-telomere snv and phase data provided by strand-seq and/or x genomics (porubsky et al. ; chaisson et al. ) with phase-block n size between . - . mb. for other datasets, long-read data were used to phase snvs. while this data has lower phase-block n (< . - mb), the individual locus-rpgg do not use long-range haplotype information and are not affected by phasing switch error. reads from each chromosome and haplotype were independently assembled using the flye assembler (kolmogorov et al. ) for a diploid of . - . mb n , with the range of assembly contiguity reflected by the diversity of input data. in this study, the number of resolved vntr loci is a more accurate measurement of useful assembly contiguity than n because a disjoint rpgg is generated for each vntr locus. an initial set of , vntr intervals with motif size > bp, minimal length > bp and < k bp (mean length= bp in grch , methods, supplementary table ) were annotated by tandem repeats finder (trf) (benson ) , and then mapped onto contig coordinates using pairwise contig alignments. long vntr loci tended to have fragmented trf annotation, which can cause erroneous length estimates in downstream analysis and fail to properly interpret repeat structures as a whole such as in advntr-nn (supplementary fig. ). during locus assignment, danbing-tk expands boundaries and merges loci to ensure boundaries of all vntrs are well-defined and harmonized across genomes (methods) (figure b). in practice, we found that , / , ( %) of the vntr loci are subject to boundary expansion, with an average expansion size of bp. the set of vntrs that can be properly annotated ranges from , - , depending on the assembly quality, with a final set of , loci (mean length= bp) across genomes (supplementary fig. ). the rpggs are constructed as disjoint bi-directional de bruijn graphs of each vntr locus and flanking bases from the haplotype-resolved assemblies. in a bi-directional de bruijn graph, each distinct sequence of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /haw +rpd https://paperpile.com/c/h ctd /haw +rpd https://paperpile.com/c/h ctd /r by https://paperpile.com/c/h ctd /r by https://paperpile.com/c/h ctd /pgh u https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / length k ( k -mer) and its reverse complement map to a vertex, and each sequence of length k + connects the vertices to which the two composite k -mers map. there was little effect on downstream analysis for values of k between and , and so k = was used for all applications. to remove spurious vertices and edges from assembly consensus errors, srs from genomes matching the lrs samples were mapped to the rpgg, and k -mers not mapped by srs were removed from the graph (average of per locus). using the number of vertices as a proxy for sampled genetic diversity, we find that % ( , , new nodes) of the sequences novel with respect to grch ( , , nodes) are discovered after the inclusion of genomes, with diversity linearly increasing per genome after the first four genomes are added to the rpgg ( , , nodes, figure c). the alignment of a read to an rpgg may be defined by the path in the rpgg with a sequence label that has the minimum edit distance to the read among all possible paths. we used error-free bp paired end reads simulated from six genomes (hg , hg , hg , hg , na and na ) to evaluate how reads are aligned to the rpgg. while several methods exist to find alignments that do not reuse cycles (garrison et al. ; rakocevic et al. ) , alignment with cycles is a more challenging problem recently solved by the graphaligner method to map long reads to pangenome graphs (rautiainen, mäkinen, and marschall ) . although > . % of the reads simulated from vntr loci were aligned, . % of reads matched with less than % identity, indicating misalignment. we developed an alternative approach tuned for rpgg alignments in danbing-tk (figure d) to realign all srs reads within a bam/fastq file to the rpgg in two passes, first by finding locus-rpggs with a high number (> in each end) shared k -mers with reads, and next by threading the paired-end reads through the locus-rpgg, allowing for up to two edits (mismatch, insertion, or deletion) and at least matched k-mers per read against the threaded path (methods). using danbing-tk, . % of vntr-simulated reads were aligned with > % identity. when reads from the entire genome are considered, for . % of the loci, danbing-tk can map > % of the reads back to their original vntr regions. misaligned reads from either other vntr loci or untracked regions target relatively few loci; . % .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /lmbav+jqzsb https://paperpile.com/c/h ctd /uke r https://paperpile.com/c/h ctd /uke r https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ( , / , ) loci have at least one read misaligned from outside the locus. the graph pruning step is the primary cause of missed alignments, and affects on average , loci per assembly. on real data, danbing-tk required . gb of memory to map base paired-end reads at . mb/sec on cores. read-to-graph alignment in vntr regions alignment of srs reads to the rpgg enables estimation of vntr length and motif composition. the count of k -mers in srs reads mapped to the rpgg are reported by danbing-tk for each locus. for samples and vntr loci, the result of an alignment is count matrices of dimension , where is the number of vertices in the de bruijn graph on the locus , excluding flanking sequences. if srs reads from a genome were sequenced without bias, sampled uniformly, and mapped without error to the rpgg, the count of a k -mer in a locus mapped by an srs sample should scale by a factor of read depth with the sum of the count of the k -mer from the locus of both assembled haplotypes for the same genome. the quality of alignment (aln- ) and sequencing bias were measured by comparing the k -mer counts from the matched illumina and lrs genomes (figure a). in total, % ( , / , ) loci had a mean aln- ≥ . between srs and assembly k -mer counts, and were marked as “valid” loci to carry forward for downstream diversity and expression analysis (figure b). valid had an average length of bp, compared to bp in the entire database (figure c). vntr loci that did not align well (invalid) were enriched for sequences that map within alu ( , ), sva ( , ), and other , mobile elements (supplementary fig. ); loci with false mapping in the simulation experiment are also enriched in the invalid set (supplementary table ) . specifically, . % ( , / , ) of loci with fp mapping, . % ( , / , ) of loci with fn mapping are not marked as valid. loci with false mapping but retained in the final set have lower but still decent length prediction accuracy ( . versus . ). the complete rpgg on valid loci contains , , vertices, in contrast to the corresponding rpgg only on grch (repeat-grch ), which has , , vertices. we validate that the additional vertices in the rpgg are indeed important for accurately recruiting reads pertinent to a vntr locus, using the cacna c vntr as .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=m% ctimes% n_i# https://www.codecogs.com/eqnedit.php?latex=n_i# https://www.codecogs.com/eqnedit.php?latex=i# https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / an example (figure d). it is known that the reference sequence at this locus is truncated compared to the majority of the populations ( bp in grch versus , bp averaged across genomes). the limited sequence diversity provided by repeat-grch at this locus failed to recruit reads that map to paths existing in the rpgg but missing or only partially represented in repeat-grch . a linear fit between the k -mers from mapped reads and the ground truth assemblies shows that there is a -fold gain in slope, or measured read depth, when using rpgg compared to repeat-grch (figure e). the k -mer counts in the rpggs also correlate better with the assembly k -mer counts compared to the repeat-grch (aln- = . versus . ). new genomes with arbitrary combinations of motifs and copy numbers in vntrs should still align to an rpgg as long as the motifs are represented in the graph. we used leave-one-out analysis to evaluate alignment of novel genomes to rpggs and estimation of vntr length. in each experiment, an rpgg was constructed with one lrs genome missing. srs reads from the missing genome were mapped into the rpgg, and the estimated locus lengths were compared to the average diploid lengths of corresponding loci in the missing lrs assembly. the locus length is estimated as the adjusted sum of k -mer counts mapped from srs sample : , where is sequencing depth of , is a correction for locus-specific sampling bias (lsb). because the srs datasets used in this study during pangenome construction were collected from a wide variety of studies with different biases, there was no consistent lsb in either repetitive or nonrepetitive regions for samples from different sequencing runs (supplementary fig. - ). however, principal component analysis (pca) of repetitive and nonrepetitive regions showed highly similar projection patterns (supplementary fig. ), which enabled using lsb in nonrepetitive regions as a proxy for finding the nearest neighbor of lsb in vntr regions (supplementary fig. ). leveraging this finding, a set of nonrepetitive control regions were used to estimate the lsb of an unseen srs sample (methods), giving a median length-prediction accuracy of . for unrelated genomes (figure a left, supplementary fig. ). the read depth of a repetitive region correlates to the locus length when aligning short reads to a linear reference .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=kms# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=kms% f(cov_s% ctimes% % chat% bb% d)# https://www.codecogs.com/eqnedit.php?latex=cov_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=% chat% bb% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genome. however, estimation of vntr length from read depth has an accuracy of . (figure a left). we also compared the performance for length prediction using the rpgg versus repeat-grch , and observed a % improvement in accuracy ( . versus . , figure a left, supplementary fig. ). the overall error rate, measured with mean absolute percentage error (mape), of all loci (n= , ) are also significantly lower when using rpggs (mape= . , figure a right) compared with the repeat-grch ( . , paired t -test p = . ⨉ - ) or reference-aligned read depth ( . , paired t -test p = . ⨉ - ). furthermore, a % reduction in error size is observed for the , loci poorly genotyped (mape > . ) using repeat-grch (figure b, mape= . versus . ). profiling vntr length and motif diversity to explore global diversity of vntr sequences and potential functional impact, we aligned reads from , individuals from diverse populations sequenced at -fold coverage sequenced by the -genomes project ( kgp) (fairley et al. ; genomes project consortium et al. ) , and gtex genomes (g. consortium and gtex consortium ) to the rpgg. the fraction of reads from these datasets that align to the rpgg ranges from . %- . %, similar to the matched lrs/srs data ( . %). pca on the lsb of both datasets showed the kgp and gtex genomes as separate clusters in both repetitive and nonrepetitive regions (supplementary fig. ), indicating experiment-specific bias that prevents cross data set comparisons. consistent with the finding in previous leave-one-out analysis, genomes from the same study cluster together in the pca plot of lsb, and so within each dataset and locus, k -mer counts from srs reads normalized by sequencing depth were used to compare vntr content across genomes. the k -mer dosage: , was used as a proxy for locus length to compare tandem repeat variation across populations in the kgp genomes. the kgp samples contain individuals from african ( . %), east asian ( . %), european ( . %), admixed american ( . %), and south asian ( . %) populations. when .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd / q kl+jzbjy https://paperpile.com/c/h ctd /lyx d https://www.codecogs.com/eqnedit.php?latex=kms% fcov# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / comparing the average population length to the global average length, . % ( , / , ) have differential length between populations (fdr= . on anova p values), with similar distributions of differential length when loci are stratified by the accuracy of length prediction (figure a). population stratification was calculated using the v st statistic (redon et al. ) on vntr length (figure b). previous studies have used > standard deviations above the mean to define for highly stratified copy number variants (sudmant et al. ) . under this measure, variants are highly stratified, including that overlap genes, however this is not significantly enriched (p= . , one-sided permutation test). two of the top five loci ranked by v st are intronic: a base vntr in plcl (v st = . ), and a base locus in spata (v st = . ) (figure c,d). these values for v st are lower than what are observed for large copy number variants (redon et al. ) and may be the result of neutral variation, however this may be affected by the high variance of the length estimate, as v st decreases as the variance of the copy number/dosage values increase (supplementary methods). vntr loci that are unstable may undergo hyper-expansion and are implicated as a mechanism of multiple diseases (hannan ) . to discover new potentially unstable loci, we searched the kg genomes for evidence of rare vntr hyper-expansion. loci were screened for individuals with extreme (> standard deviations) variation, and then filtered for deletions or unreliable samples (methods) to characterize loci as potentially unstable. these loci are inside genes and are significantly reduced from the number expected by chance (p< ⨉ - , one-sided permutation test; n= , ). of these loci, have an individual with > standard deviations above the mean, of which two overlap genes, kcna , and grm (supplemental fig. ). alignment to an rpgg provides information about motif usage in addition to estimates of vntr length because genomes with different motif composition will align to different vertices in the graph. to detect differential motif usage, we searched for loci with a k -mer that was more frequent in one population than another and not simply explained by a difference in locus length, comparing african (afr) and east asian populations for maximal genetic diversity. lasso regression against locus length was used to find the k -mer with the most variance explained (vex) in eas genomes, denoted as the most informative k -mer (mi-kmer). two .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /pjd https://paperpile.com/c/h ctd /n ru https://paperpile.com/c/h ctd /pjd https://paperpile.com/c/h ctd / k ci https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / statistics are of interest when comparing the two populations: the difference in the count of mi-kmers ( ) and the difference between proportion of vex ( ) by mi-kmers. describes the usage of an mi-kmer in one population relative to another, while indicates the degree that the mi-kmer is involved in repeat contraction or expansion in one population relative to another. we observe that , loci have significant differences in the usage of mi-kmers between the two populations (two-sided p < . , bootstrap, supplementary fig. ). among these, the mi-kmers of , loci are crucial to length variation in the eas but not in the afr population (two-sided p < . , bootstrap) (figure e, supplementary fig. ). a top example of these loci with at least . in the eas population was visualized with a heatmap of relative k -mer count from both populations, and clearly showed differential usage of cycles in the rpgg (figure f). association of vntr with nearby gene expression because the danbing-tk length estimates showed population genetic patterns expected for human diversity, we assessed whether danbing-tk alignments could detect vntr variation with functional impact. genomes from the gtex project were mapped into the rpgg to discover loci that have an effect on nearby gene expression in a length-dependent manner. a total of / genomes with matching expression data passed quality filtering (methods). similar to the population analysis, the k -mer dosage was used as a proxy for locus length. methods previously used to discover eqtl using str genotyping (fotsing et al. ) were applied to the danbing-tk alignments. in sum, , vntrs within kb to , gtex gene-annotations (including genes, lncrna, and other transcripts) were tested for association, with a total of , tests and approximately . vntrs tested per gene. using a gene-level fdr cutoff of %, we find eqtl (evntrs) (figure a), among which ( . %) discoveries are novel (supplementary table ), indicating that the spectrum of association between tandem repeat variation and expression extends beyond the lengths and the types of ssr considered in previous str (mousavi et al. ) and vntr (bakhtiari et al. ) studies. both .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=r% e # https://paperpile.com/c/h ctd / dguv https://paperpile.com/c/h ctd /akii https://paperpile.com/c/h ctd /s lm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / positive and negative effects were observed among evntrs (figure b). more evntrs with positive effect size were found than with a negative effect size ( versus , binomial test p = . ), with an average effect of + . (from + . to + . ) versus − . (from − . to − . ), respectively. evntrs tend to be closer to telomeres relative to all vntrs (mann–whitney u test p = . ⨉ - , supplementary fig. ). because many exons contain vntr sequences, expression measured by read depth should increase with length of the vntr, and there is an . -fold enrichment of evntrs in coding regions as expected. the evntrs have the potential to yield insight to disease. in one example, an intronic evntr at chr : , , - , , flanks exon of erap (figure d, supplementary fig. ). the evntr has a - . effect size and was reported across tissues. it colocalizes with a regulatory hotspot with peaks of histone markers, dnase and different chip signals. the protein product of erap , or endoplasmic reticulum aminopeptidase , is a zinc metalloaminopeptidase involving in the process of class i mhc mediated antigen presentation and innate immune response. it has been reported to be associated with several diseases including ankylosing spondylitis (wellcome trust case control consortium et al. ) and crohn’s disease (franke et al. ) . abnormal expansion of the vntr might increase autoimmune disease risk through reducing erap expression, leaving longer and more antigenic peptides, yet potentially higher fitness against virus infection (ye et al. ) . this vntr is a unique sequence in grch that is a bp tandem duplication in / of the haplotypes. another example is an intergenic vntr at chr : , , - , , that associates with the expression of kansl ~ kb upstream (figure c, supplementary fig. ). the evntr has a maximal effect size of + . and is significant across tissues. the protein product of kansl , or kat regulatory nsl complex subunit , is a part of the histone acetylation machinery. deletion of this gene is linked to koolen-de vries syndrome (koolen et al. ) , and the locus is associated with parkinson disease (witoelar et al. ) . the evntr colocalizes with strong chip signals the association of this vntr with the epigenetic landscape warrants further investigation. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd / gyl https://paperpile.com/c/h ctd / me https://paperpile.com/c/h ctd / me https://paperpile.com/c/h ctd /str https://paperpile.com/c/h ctd /str https://paperpile.com/c/h ctd /cq b https://paperpile.com/c/h ctd /gwpe https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion. previous commentaries have proposed that variation in vntr loci may represent a component of undiagnosed disease and missing heritability (hannan ) , which has remained difficult to profile even with whole genome sequencing (mousavi et al. ) . to address this, we have proposed an approach that combines multiple genomes into a pangenome graph that represents the repeat structure of a population. this is supported by the software, danbing-tk and associated rpgg. we used danbing-tk to generate a pangenome from haplotype-resolved assemblies, and applied it to detect vntr variation across populations and to discover eqtl. the structure of the rpgg can help to organize the diversity of assembled vntr sequences with respect to the standard reference. in particular, % of the graph structure is novel after the addition of genomes to the rpgg relative to repeat-grch . combined with the observation that using the -genome rpgg gives a % decrease in length prediction error, this indicates that the pan-genomes add detail for the missing variation. with the availability of additional genomes sequenced through the pangenome reference consortium ( https://humanpangenome.org/ ) and the hgsvc ( https://www.internationalgenome.org ), combined with advanced haplotype-resolve assembly methods (porubsky et al. ) , the spectrum of this variation will be revealed in the near future. while we anticipate that eventually the full spectrum of vntr diversity will be revealed through lrs of the entire kg, the rpgg analysis will help organize analysis by characterizing repeat domains. for example, with our approach, we are able to detect , loci with differential motif usage between populations, which could be difficult to characterize using an approach such as multiple-sequence alignment of vntr sequences from assembled genomes. there are several caveats to our approach. in contrast to other pangenome approaches (garrison et al. ; rakocevic et al. ) , danbing-tk does not keep track of a reference (e.g. grch ) coordinate system. furthermore, because it is often not possible to reconstruct a unique path in an rpgg, only counts of mapped reads are reported rather than the order of traversal of the rpgg. an additional caveat of our approach is that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /iybxb https://paperpile.com/c/h ctd /akii https://humanpangenome.org/ https://www.internationalgenome.org/ https://paperpile.com/c/h ctd /jlne https://paperpile.com/c/h ctd /lmbav+jqzsb https://paperpile.com/c/h ctd /lmbav+jqzsb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genotype is calculated as a continuum of k -mer dosage rather than discrete units, prohibiting direct calculation of linkage-disequilibrium for fine-scale mapping (lapierre et al., n.d.) . finally this approach only profiles loci where k -mer counts from reads and assemblies are correlated; loci for which every k -mer appears the same number of times are excluded from analysis (on average , / , per genome). the rich data provided by danbing-tk and pangenome analysis provide the basis for additional association studies. while most analysis in this study focused on the diversity of vntr length or association of length and expression, it is possible to query differential motif usage using the rpgg. the ability to detect motifs that have differential usage between populations brings the possibility of detecting differential motif usage between cases and controls in association studies. this can help distinguish stabilizing versus fragile motifs (braida et al. ) , or resolve some of the problem of missing heritability by discovering new associations between motif and disease (song, lowe, and kingsley ) . finally, this work is a part of ongoing pangenome graph analysis (paten et al. ; li, feng, and chu ) , and represents an approach to generating pangenome graphs in loci that have difficult multiple sequence alignments or degenerate graph topologies. additional methods may be developed to harmonize danbing-tk rpggs with genome-wide pangenome graphs constructed from other methods. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/h ctd /lgtuz https://paperpile.com/c/h ctd /yrlys https://paperpile.com/c/h ctd /jel https://paperpile.com/c/h ctd /gdid+n qw https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . sequence diversity of vntrs in human populations. a , global diversity of sms assemblies. b, dot-plot analysis of the vntr locus chr : - (ski intron vntr) in genomes that demonstrate varying motif usage and length c , diversity of rpgg as genomes are incorporated, measured by the number of k -mers in the , vntr graphs. total graph size built from grch and an average genome are also shown. d, danbing-tk workflow analysis. (top) vntr loci defined from the reference are used to map haplotype loci. each locus is converted to a de bruijn graph, from which the collection of graphs is the rpgg. the de bruijn graphs shown illustrate sequences missing from the rpgg built only on grch . the alignments may be either used to select which loci may be accurately mapped in the rpgg using srs that match the assemblies (red), or may be used to estimate lengths on sample datasets (blue). genome continental population study assembly n (mb) fraction of vntr annotated ancestry cov ak eas kg . . korean hg eur dp . . finnish hg eas hgsvg . . han chinese hg eas hgsvg . . han chinese .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . source genomes for rpgg. continental populations represented are east asian (eas), european (eur), admixed amerindian (amr), south asian (sas), and african (afr). coverage is estimated diploid coverage based on alignment to grch . assembly n is of haplotype-resolved assemblies. the fraction of vntr annotated are all vntr with at least flanking bases assembled. figure . mapping short reads to repeat-pangenome graphs. a, an example of evaluating the alignment quality of a locus mapped by srs reads. the alignment quality is measured by the of a linear fit between the k -mer counts from the ground truth assemblies and from the mapped reads (methods). b, distribution of the alignment quality scores of , loci. loci with alignment quality less than . when averaged across samples are removed from downstream analysis (methods). c, distribution of vntr lengths in grch hg eas hgsvg . . han chinese hg amr hgsvg . . puerto rican hg amr hgsvg . . puerto rican hg amr hgsvg . . puerto rican hg amr dp . . colombian hg eas dp . . vietnamese hg amr dp . . peruvian hg afr dp . . gambian hg sas dp . . telugu na eur dp . . central european na afr hgsvg . . yoruba na afr hgsvg . . yoruba na afr hgsvg . . yoruba na afr dp . luhya na eur giab . . ashkenazim .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=r% e # https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / removed or retained for downstream analysis. d-e , comparing the read mapping results of the cacna c vntr using rpgg or repeat-grch . the k -mer counts in each graph and the differences are visualized with edge width and color saturation ( d ). the k -mer counts from the ground truth assemblies are regressed against the counts from reads mapped to the rpgg (red) and repeat-grch (blue), respectively ( e ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . vntr length prediction. a , accuracies of vntr length prediction measured for each genome (left) and each locus (right). mean absolute percentage error (mape) in vntr length is averaged across loci and genomes, respectively. lengths were predicted based on repeat-pangenome graphs (rpgg), repeat-grch (rhg) or naive read depth method (rd), respectively. boxes span from the lower quartile to the upper quartile, with horizontal lines indicating the median. whiskers extend to points that are within . interquartile range (iqr) from the upper or the lower quartiles. b, relative performance of rpgg versus repeat-grch . loci are ordered along the x-axis by genotyping accuracy in repeat-grch . the y-axis shows the decrease in mape using rpgg versus repeat-grch . the subplot shows loci poorly genotyped (mape> . ) in repeat-grch . the red dotted line indicates the baseline without any improvement. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . population properties of vntr loci. a , ratios of median length between populations for loci with significant differences in average length. loci are stratified by accuracy prediction (< . ), medium ( . - . ), and high ( . +). b , manhattan plot of v st values. c-d , the distribution of estimated length via k -mer dosage in continental populations for plcl and spata vntr loci, selected to visualize the distribution of dosage in different populations. each point is an individual. e, differential usage and expansion of motifs between the eas and afr populations. for each locus, the proportion of variance explained by the most informative k -mer in the eas is shown for the eas and afr populations on the x and y axes, respectively. points are colored by the difference in normalized k -mer counts, with red and blue indicating k -mers more abundant in eas and afr populations, respectively. f, an example vntr with differential motif usage. edges are colored if the k -mer count is biased toward a certain population. the black arrow indicates the location of the k -mer that explains the most variance of vntr length in the eas population. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . cis -eqtl mapping of vntrs. a, evntr discoveries in human tissues. the quantile-quantile plot shows the observed p value of each association test versus the p value drawn from the expected uniform distribution. black dots indicate the permutation results from the top % associated (gene, vntr) pairs in each tissue. the regression plots for erap and kansl are shown in c and d. b, effect size distribution of significant associations from all tissues. c-d , genomic view of disease-related (egene,evntr) pairs ( erap , chr : - ) (c) and ( kansl , chr : - ) (d) are shown. red boxes indicate the location of egenes and evntrs. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / materials and methods pangenome construction. initial discovery of tandem repeats: trf v . (option: -f -d -h) (benson ) was used to roughly annotate the ssr regions of five pacbio assemblies (ak , hg , hg , na , na ). the scope of this work focuses on vntrs that cannot be resolved by typical short read sequencing methods. we selected the set of ssr loci with a motif size greater than bp and a total length greater than bp and less than kbp. for each haplotype, the selected vntr loci were mapped to grch reference genome to identify homologous vntr loci. to maintain data quality, vntr loci that could not be assigned homology were removed from datasets. boundary expansion of vntrs: the biological boundaries of a vntr are ill-defined; vntrs with sparse recurring motifs or transition between different motifs or a nested motif structure often fail to be fully annotated by trf. a misannotation of vntr boundaries can cause erroneous length estimates. to avoid the propagation of this error to downstream analysis, we developed a multiple boundary expansion algorithm to recover the proper boundary for each vntr across all haplotypes, including the the remaining genomes (hg , hg , hg , hg , hg , hg , hg , hg , hg , hg , na , na , na and na ). the algorithm maintains an invariant: the flanking sequence in any of the haplotypes does not share k -mers with the vntr regions from all haplotypes. vntr boundaries in each haplotype are iteratively expanded until the invariant is true or if expansion exceeds kbp in either ’ or ’ direction. the size of the flanking regions is chosen to be bp, which is approximately the upper bound of the insert size of typical srs reads. the following qc step removes a haplotype if its vntr annotation is within bp to breakpoints or if the orthology mapping location to grch is different from the majority of haplotypes. a vntr locus with the number of supporting haplotypes less than % of the total number of haplotypes is also removed. adjacent vntr loci within bp to each other in any of the haplotypes will .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / induce a merging step over all haplotypes. haplotypes with distance between adjacent loci inconsistent with the majority of haplotypes are removed. finally, vntr loci with the number of supporting haplotypes less than % of the total number of haplotypes are removed, leaving , of the initial , loci. read-to-graph alignment: for the two haplotypes of an individual, three data structures are used to encode the information of all vntr loci, including vntrs and their bp flanking sequences. the first data structure allows fast locus lookup for each k -mer ( k = ) by hashing each canonical k -mer in the vntrs and the flanking sequences to the index of the original locus. the second data structure enables graph threading by storing a bi-directional de bruijn graph for each locus. the third data structure is used for counting k -mers originating from vntrs. the read mapping algorithm maps each pair of illumina paired-end reads to a unique vntr locus in three phases: ( ) in the k -mer set mapping phase, the read pair is converted to a pair of canonical k -mer multisets. the vntr locus with the highest count of intersected k -mers is detected with the first data structure. ( ) in the threading phase, the algorithm tries to map the k -mers in the read pair to the bi-directional de bruijn graph such that the mapping forms a continuous path/cycle. to account for sequencing and assembly errors, the algorithm is allowed to edit a limited number of nucleotides in a read if no matching k -mer is found in the graph. the read pair is determined feasible to map to a vntr locus if the number of mapped k -mers is above an empirical threshold. ( ) in the k -mer counting phase, canonical k -mers of the feasible read pair are counted if they existed in the vntr locus. finally, the read mapping algorithm returns the k -mer counts for all loci as mapped by srs reads. alignment timing was conducted on an intel xeon e - v . ghz node. graph pruning and merging: pan-genome representation provides a more thorough description of vntr diversity and reduces reference allele bias, which effectively improves the quality of read mapping and downstream analysis. considering the fact that haplotypes assembled from long read datasets are error prone in vntr regions, it is necessary to prune the graphs/ k -mers before merging them as a pan-genome. we ran the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / read mapping algorithm with error correction disabled so as to detect k -mers unsupported by srs reads. the three data structures were updated by deleting all unsupported k -mers for each locus. by pooling and merging the reference regions corresponding to the vntr regions in all individuals, we obtained a set of “pan-reference” regions, each indicating a location in grch that is likely to map to a vntr region in any other unseen haplotype. by referencing the mapping relation of vntr loci across individuals, we encoded the variability of each vntr locus by merging the three data structures across individuals. alignment quality analysis: to evaluate the quality of the haplotype assemblies and the performance of the read mapping algorithm, vntr k -mer counts in the original assemblies were regressed against those mapped from srs reads. the of the linear fit was used as the alignment quality score (referred to as aln- ). to measure alignment quality in the pan-genome setting, only the k -mer set derived from the genotyped individual was retained as the input for regression. data filtering: a final set of , vntr regions was called by filtering based on aln- . the quality of a locus was measured by the mean aln- across individuals. loci with mean aln- below . were removed from the final call set. the final pan-genome graphs were used to genotype large illumina datasets, measure length prediction accuracy, analyze population structures and map eqtl. predicting vntr lengths : read depths at vntr regions usually vary considerably from locus to locus. furthermore, the sampling bias of different sequencing runs are also different, which limits our ability to genotype the accurate length of vntrs. to account for this, we compute locus-specific biases (lsbs) for each sample , a tuple of (genome , sequencing run) as follows: .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=b_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=b_s% d% cdfrac% b % d% bcov_s% ctimes% l_g% d% csum_% be% dw_% bs% ce% d# https://www.codecogs.com/eqnedit.php?latex=b_s% d% cdfrac% bkms_s% d% bcov_s% ctimes% l_g% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ,where is the ground truth vntr lengths of , loci in genome ; is the sum of k -mer counts in each locus mapped by samples ; is the global read depth of sample estimated by averaging the read depths of unique regions without any types of repeats or duplications. the ground truth vntr length of a locus in genome is averaged across haplotypes: ,where is the number of haplotype(s) in genome , i.e. for normal individuals and for complete hydatidiform mole (chm) samples. with the above bias terms, the vntr length of locus in sample can be computed by: ,where is same as described above; is the estimated lsbs computed from sample with ground truth vntr lengths; is the sum of k -mer counts of locus mapped by sample . we assume the lsbs that best approximates come from samples within the same sequencing run. without prior knowledge on the ground truth vntr lengths of and therefore , we determine the “closest” sample w.r.t. based on between the read depths, , of the unique regions as follows: , where is the set of samples with ground truths and within the same sequencing run as . we cross-validate our approach by leaving one sample out of the pan-genome database and evaluating the prediction accuracy on the excluded sample. for comparison, vntr lengths were also estimated by a read depth method. for each vntr region, the read depth, computed with samtools bedcov -j, was divided by the global read depth, computed from the nonrepetitive regions, to give the length estimate. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=l_g# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=kms_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=cov_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=l_% bg% cl% d% d% cdfrac% b % d% bh% d% csum_% bh% d % d% e% bh% dl_% bg% ch% cl% d# https://www.codecogs.com/eqnedit.php?latex=h# https://www.codecogs.com/eqnedit.php?latex=g# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=l_% bs% cl% d% d% cdfrac% bkms_% bs% cl% d% d% bcov_s% ctimes% b_% b% chat% bs% d% d% d# https://www.codecogs.com/eqnedit.php?latex=cov_s# https://www.codecogs.com/eqnedit.php?latex=b_% b% chat% bs% d% d# https://www.codecogs.com/eqnedit.php?latex=% chat% bs% d# https://www.codecogs.com/eqnedit.php?latex=kms_% bs% cl% d# https://www.codecogs.com/eqnedit.php?latex=l# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=b_s# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=b_s# https://www.codecogs.com/eqnedit.php?latex=% chat% bs% d# https://www.codecogs.com/eqnedit.php?latex=s# https://www.codecogs.com/eqnedit.php?latex=r% e # https://www.codecogs.com/eqnedit.php?latex=rd# https://www.codecogs.com/eqnedit.php?latex=% chat% bs% d% d% coperatorname*% bargmax% d_% bs% % c% s% % cin% gt% c% s% % cneq% s% d% r% e (rd_% bs% % d% crd_s)# https://www.codecogs.com/eqnedit.php?latex=gt# https://www.codecogs.com/eqnedit.php?latex=s# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / comparing with graphaligner: the compact de bruijn graph of each vntr locus was generated with bcalm v . . (option: -kmer-size -abundance-min ) using the vntr sequences from all assemblies as input. gfa files were then reindexed and concatenated to generate the rpggs for , loci. error-free paired-end reads were simulated from all vntr regions at x coverage with bp read length and bp insert size ( bp gap between each end). reads were aligned to the rpgg using graphaligner v . . with option -x dbg --seeds-minimizer-length . reads with alignment identity > % were counted from the output gam file. to compare in a similar setting, danbing-tk was run with option -gc -thcth -k -cth -rth . to assert > % identity for all reads aligned, given that . v st calculation: v st was calculated according to (redon et al. ) : top v st loci were considered as the sites with v st at least three standard deviations above the mean. identifying unstable loci: a locus was annotated as a candidate for being unstable if at least one individual had outlying k -mer dosage ≥ six standard deviations above the mean, using population and locus specific summary statistics on data discarding individuals with zero no individuals had dosage less than or a bimodal distribution was not detected (diptest v . - , p > . ). among this set, the number of times each genome appeared as an outlier was used to select a set of genomes with an over abundant contribution to fragile loci. any candidate locus with an individual that was an outlier in at least four other loci was removed from the candidate list. the loci were compared to gencode v , excluding readthrough, pseudogenes, noncoding rna, and nonsense transcripts. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=(read% c_% length-kmer% c_size% b )% ctimes% . % d # https://paperpile.com/c/h ctd /pjd https://www.codecogs.com/eqnedit.php?latex=v_% bst% d% bi% d% dmax( % c% % cfrac% bvar_% ball% d-% cfrac% b % d% bn% d% csum_% bp% cin% p% d% bvar_p% ctimes% n_p% d% d% bvar_% ball% d% d)# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / identifying differential motif usage and expansion : sample outliers in the genomes were detected from the read sampling biases over control regions and the tr dosages over , loci using dbscan. a total of / , samples were removed from downstream analysis. we use the eas population as the reference for measuring differential motif usage and expansion. initially, a lasso fit using the statsmodel.api.ols function in python statsmodel v . . (seabold and perktold ) was performed for each locus to identify the k -mer with the most variance explained (vex) in vntr lengths using the following formula: , where is the vntr length of individuals in the eas population; is the k -mer dosage matrix for individuals with k -mers; is the model coefficient, and is the error term. the lasso penalty weight was scanned starting at . with at a step size of − . until at least one covariate has a positive weight or is below . . the k -mer with the highest weight is denoted as the most informative k -mer (mi-kmer) for the locus. to identify loci with differential motif usage between populations, we subtracted the median count of the mi-kmer of the afr from the eas population for each locus, denoted as . the null distribution of was estimated by bootstrap. specifically, eas individuals were sampled with replacement times, matching the sample sizes of the eas and afr populations, respectively. the bootstrap statistics, , were computed by subtracting the median count of the mi-kmer of the last from the first bootstrap samples for each locus. the estimated null distribution is then used to determine the threshold for calling a locus having significant differential motif usage between populations (two-sided p < . ). to identify loci with differential motif expansion between populations, we subtracted the proportion of vex by mi-kmer in the afr from the eas population, denoted as . the null distribution of was estimated by bootstrap in a similar sampling procedure as , except for subtracting the proportion of vex by the mi-kmer in the last from the first bootstrap samples for each locus. the estimated null .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=y% dxb% b% cepsilon# https://www.codecogs.com/eqnedit.php?latex=y% cin% % cmathbb% br% d% en# https://www.codecogs.com/eqnedit.php?latex=n# https://www.codecogs.com/eqnedit.php?latex=x% cin% % cmathbb% br% d% e% bn% ctimes% m% d# https://www.codecogs.com/eqnedit.php?latex=n# https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=b% cin% % cmathbb% br% d% em# https://www.codecogs.com/eqnedit.php?latex=% cepsilon% csim% n( % c% csigma% e )# https://www.codecogs.com/eqnedit.php?latex=% calpha# https://www.codecogs.com/eqnedit.php?latex=% calpha# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=n_% beas% d% bn_% bafr% d# https://www.codecogs.com/eqnedit.php?latex=kmc_d% e*# https://www.codecogs.com/eqnedit.php?latex=n_% bafr% d# https://www.codecogs.com/eqnedit.php?latex=n_% beas% d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=r% e _d# https://www.codecogs.com/eqnedit.php?latex=kmc_d# https://www.codecogs.com/eqnedit.php?latex=n_% bafr% d# https://www.codecogs.com/eqnedit.php?latex=n_% beas% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / distribution is used to determine the threshold for calling a locus having significant differential motif expansion between populations (two-sided p < . ). eqtl mapping retrieving datasets : wgs datasets of individuals, normalized gene expression matrices and covariates of all tissues are accessed from the gtex analysis v (dbgap accession phs .v .p ). genotype data preprocessing : vntr lengths are genotyped using daunting-tk with options: -gc -thcth -cth -rth . . all the k -mer counts of a locus are summed and adjusted by global read depth and ploidy to represent the approximate length of a locus. sample outliers were detected from the read sampling biases over control regions and the tr dosages over , loci using dbscan. a total of / samples were removed from downstream analysis. adjusted values are then z-score normalized as input for eqtl mapping. expression data preprocessing : the downloaded expression matrices are already preprocessed such that outliers are rejected and expression counts are quantile normalized as standard normal distribution. confounding factors such as sex, sequencing platform, amplification method, technical variations and population structure are removed prior to eqtl mapping to avoid spurious associations. technical variations are corrected with the covariates, including peer factors, provided by the gtex consortium. population structures are corrected with the top principal components (pcs) from the snp matrix of all samples. particularly, principal component analysis (pca) was performed jointly on the intersection of the snp sets from gtex samples and kgp omni . snp genotyping arrays (ftp://ftp. genomes.ebi.ac.uk/vol /ftp/release/ /supporting/hd_genotype_chip/all.chip.omni_broa d_sanger_combined. .snps.genotypes.vcf.gz). this is done by first using crossmap v . . to liftover the snp sites from omni . arrays to grch , followed by extracting the intersection of the two snp sets .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / using vcftools isec. the snp set is further reduced by ld-pruning with plink v . b . using the options: --indep , leaving a total of , sites. finally, pca on the joint snp matrix was done by smartpca v . the normalized expression matrix are residualized with the above covariates using the following formula: , where is the residualized expression matrix; is the normalized expression matrix; is the projection matrix; is the identity matrix; is the covariate matrix where each column corresponds to a covariate mentioned above. the residualized expression values are z-score normalized as the input of eqtl mapping. association test : vntrs within kb to a gene are included for eqtl mapping. linear regression was done using the statsmodel.api.ols function in python statsmodel v . . (seabold and perktold ) with expression values as the dependent variable and genotype values as the independent variable. nominal p values are computed by performing t tests on slope. adjusted p values are computed by bonferroni correction on nominal p values. under the assumption of at most one causal vntr per gene, we control gene-level false discovery rate at %. specifically, the adjusted p values of the lead vntr for each gene are taken as input for benjamini-hochberg procedure using statsmodels.stats.multitest.fdrcorrection v . . . lead vntrs that passed the procedure are identified as evntrs. data availability the overall analysis pipeline is delivered in a software package at https://github.com/chaissonlab/danbing-tk . genomes acknowledgement: .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=y% d(i-h)y% # https://www.codecogs.com/eqnedit.php?latex=h% dc(c% etc)% e% b- % dc% et# https://www.codecogs.com/eqnedit.php?latex=y# https://www.codecogs.com/eqnedit.php?latex=y% # https://www.codecogs.com/eqnedit.php?latex=h# https://www.codecogs.com/eqnedit.php?latex=i# https://www.codecogs.com/eqnedit.php?latex=c# https://github.com/chaissonlab/danbing-tk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the following cell lines/dna samples were obtained from the nigms human genetic cell repository at the coriell institute for medical research: [na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na . na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na ,, na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na , na ]. these data were generated at the new york genome center with funds provided by nhgri grant um hg - s . data accession ids are given in supplementary table s . references. genomes project consortium, adam auton, lisa d. brooks, richard m. durbin, erik p. garrison, hyun min kang, jan o. korbel, et al. . “a global reference for human genetic variation.” nature ( ): – . audano, peter a., arvis sulovari, tina a. graves-lindsay, stuart cantsilieris, melanie sorensen, annemarie e. welch, max l. dougherty, et al. . “characterizing the major structural variant alleles of the human genome.” cell ( ): – .e . bakhtiari, mehrdad, jonghun park, yuan-chun ding, sharona shleizer-burko, susan l. neuhausen, bjarni v. halldórsson, kári stefánsson, melissa gymrek, and vineet bafna. . “variable number tandem repeats mediate the expression of proximal genes.” biorxiv . https://doi.org/ . / . . . . bakhtiari, mehrdad, sharona shleizer-burko, melissa gymrek, vikas bansal, and vineet bafna. . “targeted genotyping of variable number tandem repeats with advntr.” genome research ( ): – . benson, g. . “tandem repeats finder: a program to analyze dna sequences.” nucleic acids research . https://doi.org/ . /nar/ . . . braida, claudia, rhoda k. a. stefanatos, berit adam, navdeep mahajan, hubert j. m. smeets, florence niel, cyril goizet, et al. . “variant ccg and ggc repeats within the ctg expansion dramatically modify mutational dynamics and likely contribute toward unusual symptoms in some myotonic dystrophy type patients.” human molecular genetics ( ): – . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /jzbjy http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /k rob http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd /s lm http://dx.doi.org/ . / . . . http://paperpile.com/b/h ctd /s lm http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd / gs http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /pgh u http://dx.doi.org/ . /nar/ . . http://paperpile.com/b/h ctd /pgh u http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys http://paperpile.com/b/h ctd /yrlys https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / chaisson, mark j. p., ashley d. sanders, xuefang zhao, ankit malhotra, david porubsky, tobias rausch, eugene j. gardner, et al. . “multi-platform discovery of haplotype-resolved structural variation in human genomes.” nature communications ( ): . chen, sai, peter krusche, egor dolzhenko, rachel m. sherman, roman petrovski, felix schlesinger, melanie kirsche, et al. . “paragraph: a graph-based structural variant genotyper for short-read sequence data.” genome biology ( ): . chin, chen-shan, paul peluso, fritz j. sedlazeck, maria nattestad, gregory t. concepcion, alicia clum, christopher dunn, et al. . “phased diploid genome assembly with single-molecule real-time sequencing.” nature methods ( ): – . consortium, gtex, and gtex consortium. . “genetic effects on gene expression across human tissues.” nature . https://doi.org/ . /nature . consortium, international human genome sequencing, and international human genome sequencing consortium. . “initial sequencing and analysis of the human genome.” nature . https://doi.org/ . / . dolzhenko, egor, viraj deshpande, felix schlesinger, peter krusche, roman petrovski, sai chen, dorothea emig-agius, et al. . “expansionhunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions.” bioinformatics ( ): – . du, zhenglin, liang ma, hongzhu qu, wei chen, bing zhang, xi lu, weibo zhai, et al. . “whole genome analyses of chinese population and de novo assembly of a northern han genome.” genomics, proteomics & bioinformatics ( ): – . eggertsson, hannes p., snaedis kristmundsdottir, doruk beyter, hakon jonsson, astros skuladottir, marteinn t. hardarson, daniel f. gudbjartsson, kari stefansson, bjarni v. halldorsson, and pall melsted. . “graphtyper enables population-scale genotyping of structural variation using pangenome graphs.” nature communications . https://doi.org/ . /s - - - . fairley, susan, ernesto lowy-gallego, emily perry, and paul flicek. . “the international genome sample resource (igsr) collection of open human genomic variation resources.” nucleic acids research (d ): d – . fotsing, stephanie feupe, jonathan margoliash, catherine wang, shubham saini, richard yanicky, sharona shleizer-burko, alon goren, and melissa gymrek. . “the impact of short tandem repeat variation on gene expression.” nature genetics ( ): – . franke, andre, dermot p. b. mcgovern, jeffrey c. barrett, kai wang, graham l. radford-smith, tariq ahmad, charlie w. lees, et al. . “genome-wide meta-analysis increases to the number of confirmed crohn’s disease susceptibility loci.” nature genetics ( ): – . garrison, erik, jouni sirén, adam m. novak, glenn hickey, jordan m. eizenga, eric t. dawson, william jones, et al. . “variation graph toolkit improves read mapping by representing genetic variation in the reference.” nature biotechnology ( ): – . gatchel, jennifer r., and huda y. zoghbi. . “diseases of unstable repeat expansion: mechanisms and common principles.” nature reviews. genetics ( ): – . gymrek, melissa, thomas willems, audrey guilmatre, haoyang zeng, barak markus, stoyan georgiev, mark j. daly, et al. . “abundant contribution of short tandem repeats to gene expression variation in humans.” nature genetics ( ): – . gymrek, melissa, thomas willems, david reich, and yaniv erlich. . “interpreting short tandem repeat variations in humans using mutational constraint.” nature genetics . https://doi.org/ . /ng. . hannan, anthony j. . “tandem repeat polymorphisms: modulators of disease susceptibility and candidates for ‘missing heritability.’” trends in genetics . https://doi.org/ . /j.tig. . . . ———. . “tandem repeats mediating genetic plasticity in health and disease.” nature reviews. genetics ( ): – . hickey, glenn, david heller, jean monlong, jonas a. sibbesen, jouni sirén, jordan eizenga, eric t. dawson, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /rpd http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /hn t http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /q ll http://paperpile.com/b/h ctd /lyx d http://paperpile.com/b/h ctd /lyx d http://paperpile.com/b/h ctd /lyx d http://dx.doi.org/ . /nature http://paperpile.com/b/h ctd /lyx d http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /ndo a http://dx.doi.org/ . / http://paperpile.com/b/h ctd /ndo a http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /qaf http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /eix e http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd /n kax http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /n kax http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / q kl http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / dguv http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd / me http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /lmbav http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /p a http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /qanj http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /yulf http://dx.doi.org/ . /ng. http://paperpile.com/b/h ctd /yulf http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd /iybxb http://dx.doi.org/ . /j.tig. . . http://paperpile.com/b/h ctd /iybxb http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd / k ci http://paperpile.com/b/h ctd /jzyin https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / erik garrison, adam m. novak, and benedict paten. . “genotyping structural variants in pangenome graphs using the vg toolkit.” genome biology ( ): . iqbal, zamin, mario caccamo, isaac turner, paul flicek, and gil mcvean. . “de novo assembly and genotyping of variants using colored de bruijn graphs.” nature genetics ( ): – . iqbal, zamin, isaac turner, and gil mcvean. . “high-throughput microbial population genomics using the cortex variation assembler.” bioinformatics . https://doi.org/ . /bioinformatics/bts . jiang, zhaoshi, haixu tang, mario ventura, maria francesca cardone, tomas marques-bonet, xinwei she, pavel a. pevzner, and evan e. eichler. . “ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution.” nature genetics ( ): – . kolmogorov, mikhail, jeffrey yuan, yu lin, and pavel a. pevzner. . “assembly of long, error-prone reads using repeat graphs.” nature biotechnology ( ): – . koolen, d. a., a. j. sharp, j. a. hurst, h. v. firth, s. j. l. knight, a. goldenberg, p. saugier-veber, et al. . “clinical and molecular delineation of the q . microdeletion syndrome.” journal of medical genetics ( ): – . koren, sergey, brian p. walenz, konstantin berlin, jason r. miller, nicholas h. bergman, and adam m. phillippy. . “canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.” genome research ( ): – . lapierre, nathan, kodi taraszka, helen huang, rosemary he, farhad hormozdiari, and eleazar eskin. n.d. “identifying causal variants by fine mapping across multiple studies.” https://doi.org/ . / . . . . li, heng, jonathan m. bloom, yossi farjoun, mark fleharty, laura gauthier, benjamin neale, and daniel macarthur. n.d. “new synthetic-diploid benchmark for accurate variant calling evaluation.” https://doi.org/ . / . li, heng, xiaowen feng, and chong chu. . “the design and construction of reference pangenome graphs with minigraph.” genome biology ( ): . mallick, swapan, heng li, mark lipson, iain mathieson, melissa gymrek, fernando racimo, mengyao zhao, et al. . “the simons genome diversity project: genomes from diverse populations.” nature ( ): – . mousavi, nima, sharona shleizer-burko, richard yanicky, and melissa gymrek. . “profiling the genome-wide landscape of tandem repeat expansions.” nucleic acids research ( ): e . paten, benedict, adam m. novak, jordan m. eizenga, and erik garrison. . “genome graphs and the evolution of genome inference.” genome research ( ): – . pevzner, pavel a., haixu tang, and glenn tesler. . “de novo repeat classification and fragment assembly.” genome research ( ): – . porubsky, david, shilpa garg, ashley d. sanders, jan o. korbel, victor guryev, peter m. lansdorp, and tobias marschall. . “dense and accurate whole-chromosome haplotyping of individual genomes.” nature communications ( ): . porubsky, david, human genome structural variation consortium, peter ebert, peter a. audano, mitchell r. vollger, william t. harvey, pierre marijon, et al. . “fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.” nature biotechnology . https://doi.org/ . /s - - - . rakocevic, goran, vladimir semenyuk, wan-ping lee, james spencer, john browning, ivan j. johnson, vladan arsenijevic, et al. . “fast and accurate genomic analyses using genome graphs.” nature genetics . https://doi.org/ . /s - - - . raphael, benjamin, degui zhi, haixu tang, and pavel pevzner. . “a novel method for multiple alignment of sequences with repeated and shuffled elements.” genome research ( ): – . rautiainen, mikko, veli mäkinen, and tobias marschall. . “bit-parallel sequence-to-graph alignment.” bioinformatics ( ): – . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /jzyin http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /cjaux http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /jmtrf http://dx.doi.org/ . /bioinformatics/bts http://paperpile.com/b/h ctd /jmtrf http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /wqpb http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /r by http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /cq b http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /pj xm http://paperpile.com/b/h ctd /lgtuz http://paperpile.com/b/h ctd /lgtuz http://paperpile.com/b/h ctd /lgtuz http://dx.doi.org/ . / . . . http://paperpile.com/b/h ctd /lgtuz http://paperpile.com/b/h ctd /ymn z http://paperpile.com/b/h ctd /ymn z http://paperpile.com/b/h ctd /ymn z http://dx.doi.org/ . / http://paperpile.com/b/h ctd /ymn z http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /n qw http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /t pi http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /akii http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /gdid http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /tdftw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /haw http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jlne http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /jlne http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /jqzsb http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /jqzsb http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /xhkpd http://paperpile.com/b/h ctd /uke r http://paperpile.com/b/h ctd /uke r http://paperpile.com/b/h ctd /uke r https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / redon, richard, shumpei ishikawa, karen r. fitch, lars feuk, george h. perry, t. daniel andrews, heike fiegler, et al. . “global variation in copy number in the human genome.” nature ( ): – . saini, shubham, ileena mitra, nima mousavi, stephanie feupe fotsing, and melissa gymrek. . “a reference haplotype panel for genome-wide imputation of short tandem repeats.” nature communications ( ): . seo, jeong-sun, arang rhie, junsoo kim, sangjin lee, min-hwan sohn, chang-uk kim, alex hastie, et al. . “de novo assembly and phasing of a korean human genome.” nature ( ): – . shi, lingling, yunfei guo, chengliang dong, john huddleston, hui yang, xiaolu han, aisi fu, et al. . “long-read sequencing and de novo assembly of a chinese genome.” nature communications (june): . song, janet h. t., craig b. lowe, and david m. kingsley. . “characterization of a human-specific tandem repeat associated with bipolar disorder and schizophrenia.” american journal of human genetics ( ): – . sudmant, peter h., swapan mallick, bradley j. nelson, fereydoun hormozdiari, niklas krumm, john huddleston, bradley p. coe, et al. . “global diversity, population stratification, and selection of human copy-number variation.” science ( ): aab . taliun, daniel, daniel n. harris, michael d. kessler, jedidiah carlson, zachary a. szpiech, raul torres, sarah a. gagliano taliun, et al. . “sequencing of , diverse genomes from the nhlbi topmed program.” biorxiv . https://doi.org/ . / . viguera, e., d. canceill, and s. d. ehrlich. . “replication slippage involves dna polymerase pausing and dissociation.” the embo journal ( ): – . wellcome trust case control consortium, australo-anglo-american spondylitis consortium (tasc), paul r. burton, david g. clayton, lon r. cardon, nick craddock, panos deloukas, et al. . “association scan of , nonsynonymous snps in four diseases identifies autoimmunity variants.” nature genetics ( ): – . witoelar, aree, iris e. jansen, yunpeng wang, rahul s. desikan, j. raphael gibbs, cornelis blauwendraat, wesley k. thompson, et al. . “genome-wide pleiotropy between parkinson disease and autoimmune diseases.” jama neurology ( ): – . ye, chun jimmie, jenny chen, alexandra-chloé villani, rachel e. gate, meena subramaniam, tushar bhangale, mark n. lee, et al. . “genetic analysis of isoform usage in the human anti-viral response reveals influenza-specific regulation of transcripts under balancing selection.” genome research ( ): – . zook, justin m., nancy f. hansen, nathan d. olson, lesley chapman, james c. mullikin, chunlin xiao, stephen sherry, et al. . “a robust benchmark for detection of germline large deletions and insertions.” nature biotechnology , june. https://doi.org/ . /s - - - . author contributions. t.y.l. and m.j.p.c. performed data analysis and wrote the manuscript. m.j.p.c. supervised the work. hgsvc generated sequencing data. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd /pjd http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd / xl http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /xd http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /b ifz http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /jel http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /n ru http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /crk v http://dx.doi.org/ . / http://paperpile.com/b/h ctd /crk v http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd /oc w http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd / gyl http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /gwpe http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /str http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://paperpile.com/b/h ctd /cclhp http://dx.doi.org/ . /s - - - http://paperpile.com/b/h ctd /cclhp https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a read count-based method to detect multiplets and their cellular origins from snatac-seq data a read count-based method to detect multiplets and their cellular origins from snatac-seq data asa thibodeau *, alper eroglu *, nathan lawlor , djamel nehar-belaid , romy kursawe , radu marches , george a. kuchel , jacques banchereau , michael l. stitzel , , , a. ercument cicek , , duygu ucar , , the jackson laboratory for genomic medicine, farmington, ct, , usa university of connecticut center on aging, uconn health center, farmington, ct, , usa department of genetics and genome sciences, university of connecticut health center, farmington, ct, , usa institute for systems genomics, university of connecticut health center, farmington, ct, , usa. computer engineering department, bilkent university, ankara, , turkey computational biology department, carnegie mellon university, pittsburgh, pa, , usa * these authors contributed equally to this work. correspondence: duygu.ucar@jax.org abstract similar to other droplet-based single cell assays, single nucleus atac-seq (snatac-seq) data harbor multiplets that confound downstream analyses. detecting multiplets in snatac-seq data is particularly challenging due to its sparsity and trinary nature ( reads: closed chromatin, : open in one allele, : open in both alleles), yet offers a unique opportunity to infer multiplets when > uniquely aligned reads are observed at multiple loci. here, we implemented the first read count-based multiplet detection method, atac-doubletdetector, that detects multiplets independently of cell-type. using pbmc and pancreatic islet datasets, atac-doubletdetector captured simulated heterotypic multiplets (different cell-types) with ~ . recall, showing ~ % improvement over state of the art. atac-doubletdetector detected homotypic multiplets with ~ . recall, representing the first method to detect multiplets originating from the same cell type. using our novel clustering-based algorithm, multiplets were annotated to their cellular origins with ~ % accuracy. application of atac-doubletdetector will improve downstream analysis of snatac-seq. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . main single nucleus atac-seq (snatac-seq) – technology is widely used to study epigenomes of diverse cells and tissues with increased resolution , . however, as with other droplet based single cell technologies, snatac-seq data harbor multiplet nuclei . the presence of multiplets can confound downstream analyses by introducing combined epigenomic profiles that originate from two or more nuclei, increasing the difficulty of clustering and comparing different cell types within a sample. compared to other single cell assays, the difficulty of detecting multiplets in snatac-seq is further increased due to data sparsity and the trinary nature of chromatin accessibility levels (e.g., reads: closed chromatin, : open in one allele, : open in both alleles). the current state of the art for detecting multiplets in snatac-seq data adapt detection methods developed for single cell rnaseq (scrna-seq). notably, two snatac-seq data analysis packages, snapatac and archr , either employ or implement a method similar to multiplet detection methods (i.e., doubletfinder and scrublet ) for scrna-seq. in these methods, synthetic heterotypic multiplets (i.e., originating from different cell types) are simulated by combining profiles of two or more cells, which are then used to detect putative multiplets based on cluster similarity. such algorithms assume that multiplets and singlets exhibit distinct genomic profiles, which becomes problematic when true singlets share genomic profiles with two or more cell types. under this assumption, these methods will fail to detect homotypic multiplets (i.e., originating from the same cell type) since their overall genomic profile is considered to be similar to that of the underlying cell type. however, homotypic multiplets are characterized by increased read counts compared to singlets, suggesting new methods that utilize read counts can detect them. in order to overcome the limitations of existing methods to detect both homotypic and heterotypic multiplets, we developed a novel multiplet detection method, atac- doubletdetector, that exploits read count distributions to infer multiplets in snatac-seq data. atac-doubletdetector’s efficacy was tested in two snatac-seq datasets generated from peripheral blood mononuclear cells (pbmcs) samples (n= ) and pancreatic islet (n= ) tissues. we identified multiplets in these tissues and quantified the algorithm’s efficacy using simulated homotypic and heterotypic multiplets. we found that when snatac-seq samples were adequately sequenced (e.g., > k valid read pairs per cell), atac- doubletdetector proved very effective for detecting both homotypic and heterotypic multiplets (recall ranging from . - . in pbmcs). in addition, atac-doubletdetector includes a novel clustering-based algorithm that accurately annotates the cellular origins of detected multiplets ( % average accuracy in our simulations), (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . providing further data quality insights. atac-doubletdetector is provided as a user-friendly computational framework with documentation and source code freely available at: https://github.com/ucarlab/atac- doubletdetector. results atac-doubletdetector leverages the fact that the expected number of uniquely aligned reads for a given locus ranges from to per nucleus in snatac-seq data: = closed chromatin, = open in one allele (i.e., from either maternal or paternal chromosomes), = open in two alleles (i.e., both maternal and paternal chromosomes) (fig. a). a locus can have more than two reads (> ) when: ) it contains repetitive sequences; ) there are sequencing or alignment errors; or ) reads stem from multiplet nuclei. in the case of multiplets, we expect to observe many loci with > reads since their epigenomic profiles are derived from two or more nuclei resulting in increased accessible dna. atac-doubletdetector identifies all loci with > reads for each cell/nucleus (fig. b) by utilizing sorted read alignments to detect their overlapping read intervals ( - bp on average across all samples). a unified list of these loci across all nuclei is then generated to quantify the number of occurrences where > reads align to a locus in a given nucleus (fig. c). as a proof of concept, highly significant multiplets (p-values < - ) can be clearly seen harboring many more loci with > reads ( - loci) than average (~ loci per nuclei) (extended data fig. ). random occurrences of loci with > reads (i.e., due to sequencing or alignment errors) were modeled with the poisson cumulative distribution function using the mean number of overlaps detected across all cells. nuclei that harbor significantly more loci with > reads are identified as multiplets based on their deviations from the distribution using false discovery rate (fdr) (fig. c). to trace multiplets back to their cellular origins, we employed a clustering-based algorithm as part of the atac- doubletdetector framework. marker peaks are detected to generate reference accessibility profiles for each cell type using single cell clustering. epigenomic similarity scores at marker peaks are then used to compare multiplet profiles with singlet profiles to differentiate between heterotypic and homotypic multiplets and annotate them. we demonstrate the utility and performance of our computational framework by applying our methods in pbmc and islet sample datasets (fig. d). first, we simulated artificial multiplets in pbmc and islet samples and quantified atac-doubletdetector’s ability to identify and annotate these multiplets. second, we compared atac-doubletdetector to archr, measuring their overall performances and their ability to detect simulated (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . heterotypic and homotypic multiplets. finally, we measure the efficacy of our annotation method and analyze multiplet cellular origins to understand whether cell type influences the rate of multiplet occurrences. atac-doubletdector detects heterotypic and homotypic multiplets in pbmc and islet samples. we generated snatac-seq libraries from two human pbmc and two human pancreatic islet samples using x genomics chromium platform . sequence reads were preprocessed using cell ranger atac pipeline (methods), resulting in an average of , and , nuclei per sample and an average of , and , valid read pairs per cell for pbmc and islet samples respectively (fig. a). valid read pairs refer to all pairs of paired end reads that align to autosomes and pass quality control flags/thresholds (methods). despite deeper sequencing for islet samples, fewer valid read pairs were observed in islet samples compared to pbmc samples (fig. b), which can be explained by increased mitochondrial reads in islets ( , , and , , total reads aligned to chrm) compared to pbmcs ( , , and , total reads aligned to chrm). nuclei clustering using an in-house implementation (methods) of a two-pass clustering method for snatac-seq data identified and clusters for pbmc and pbmc . correlating pseudo-bulk accessibility profiles of these clusters with accessibility maps from sorted bulk atac-seq data (extended data fig. a,b) grouped them into major cell types: myeloid (including cd +, cd monocytes and conventional dendritic cells), b, cd + t, cd + t, and nk cells (extended data fig. c,d). these annotations were confirmed based on chromatin accessibility patterns at cell-specific marker genes (extended data fig. a,b). the same clustering procedure identified and distinct clusters for islet and islet , which were then annotated as alpha, beta, delta, and ductal cells by integrating their accessibility profiles with in-house islet scrna-seq data (extended data fig. a,b). these annotations were confirmed by analyzing the chromatin accessibility patterns at known cell-specific marker genes (extended data fig. c,d). we applied atac-doubletdetector on pbmcs and human islet samples using an fdr cutoff of . (methods). nuclei detected as multiplets were distributed throughout all clusters (fig. c-d, extended data fig. ) and in one case (pbmc ) multiplets formed their own distinct cluster (see selected multiplets in fig. d). the percentage of detected multiplets were higher in pbmcs ( %, . %) compared to islets ( % for both samples) (fig. e), which is likely due to the lower valid read pairs per nuclei in islets as previously mentioned (fig. b). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to further study the biological relevance of these detected multiplets, we selected a cluster which exclusively encompassed multiplets (fig. d; pbmc selected multiplets) and analyzed their chromatin accessibility profiles (fig. f). the selected multiplets were characterized by a high chromatin accessibility at the promoters of both cd g (t cell marker gene) and lyz (monocyte marker gene), suggesting t cell-monocyte multiplets. these results demonstrate how read count distribution information from snatac-seq can be used to effectively detect multiplets. atac-doubletdetector effectively detects simulated heterotypic and homotypic multiplets. to quantify the efficacy of atac-doubletdetector, we generated artificial multiplets by randomly selecting % of nuclei in a sample and pairing them together to artificially form multiplets (repeated times per sample). this resulted in artificial multiplets at . % of the total number of nuclei within a sample. these artificial multiplets serve as positive multiplet examples and enable us to measure recall (i.e., the fraction of detected artificial multiplets among all artificial multiplets introduced in the sample). we first evaluated atac-doubletdetector’s ability to detect heterotypic, homotypic, and a combination of both multiplet types. we then compared it’s performance in comparison to another method archr . atac-doubletdetector detected heterotypic multiplets introduced in pbmc samples with high recall (average recall . for pbmc and . for pbmc over runs), outperforming archr ( . and . respectively) (fig. a). average recall for atac-doubletdetector was lower in islet and islet than pbmcs ( . and . average recall respectively) whereas the average recall showed improvement for archr ( . and . average recall respectively). decreased performance of atac-doubletdetector’s in islets can be explained by low number of valid read pairs per nuclei in islet samples compared to pbmcs (fig b). notably, atac-doublet detector was equally effective for detecting homotypic multiplets (average recall . and . for pbmc and pbmc , . and . for islet and islet ) (fig. b), demonstrating the utility of using read counts to detect multiplets. as expected, archr had low recall for detecting homotypic multiplets (average between . and . for all samples), as this algorithm identifies multiplets with distinct genomic profiles from singlets. finally, we measured the efficacy to simultaneously detect both types of multiplets by introducing a more realistic- heterotypic and homotypic multiplet : ratio (extended data fig. a). as expected, the average recall values of atac-doubletdetector’s were similar ( . and . for pbmc and pbmc , . and . for islet and islet (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . respectively), while, those of archr were lower ( . and . for pbmc and pbmc , . and . for islet and islet ), likely due to its poor homotypic multiplet detection performance. to further study how the valid read pairs influence atac-doubletdetector’s performance, we generated artificial multiplets using cells with ranging reads per nucleus (fig c-d, extended data fig. b). we observed a noticeable increase in average recall (> . recall) for atac-doubletdetector, when the number of valid read pairs was above . k, corresponding to an average of . k valid reads pairs per nucleus. in contrast, archr did not show significant differences in performances with respect to the number of valid read pairs per nucleus (extended data fig. b), as it relies more on genomic profile similarity to detect multiplets. more exhaustive analyses of repetitions per sample further confirmed that the majority ( %, % for pbmc and pbmc and %, % for islet and islet ) of multiplets with > k valid read pairs (i.e., multiplets formed from nuclei with k valid read pairs each) were detected with this method (extended data fig. ). together, these analyses suggest that when > k valid read pairs are captured per nucleus, atac-doubletdetector is very effective in detecting both homotypic and heterotypic multiplets from snatac-seq data. to compare atac-doubletdetector and archr performances, we ran archr with recommended parameter settings (i.e., k= nearest neighbors and . filter ratio). only to multiplets across all samples were detected by both methods (fig e-f, extended data fig. , extended data fig. a-b) and majority of these multiplets were among the ones that formed their own clusters (i.e., heterotypic multiplets). for example, the majority of selected multiplets detected in cluster in fig d were detected by both methods (extended data fig. ), which are multiplets that have unique epigenomic profiles; hence easier to detect with the synthetic multiplet- based method employed by archr. notably, . % of delta cells were identified as multiplets by archr for islet (figure f, extended data fig. ). delta cells resemble both alpha and beta cells in their genomic profile, hence these cells were mistakenly detected as multiplets by archr, demonstrating a pitfall for synthetic multiplet- based methods. multiplets are expected to have higher read counts than singlets since they combine chromatin accessibility profiles of more than one nucleus. in alignment with this, multiplets detected by atac- doubletdetector had significantly higher valid read pair counts compared to singlets (average valid read pairs of , for multiplets and , for singlets for all samples) (p-values < . x - ). in contrast, read counts for archr multiplets were significantly lower (average p-values < . x - ) than atac-doubletdetector multiplets, observing read counts closer to that of singlets (average read count per cell , for archr (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiplets and , for singlets) (extended data fig. c). in summary, these analyses showed that when there is sufficient number of valid read pairs per cell (> k), count based methods are advantageous over synthetic multiplet-based methods as they can accurately detect both homotypic and heterotypic multiplets. marker peaks can effectively annotate cellular origins of multiplets. cellular origin annotations of multiplets were inferred using a three-step algorithm (fig. a). first, nuclei were clustered and annotated to their respective cell types. second, marker peaks were detected for each cluster/cell type. third, we calculated epigenomic similarity of each multiplet to different cell types by counting marker peak reads for the multiplet and the k= nearest neighbor nuclei (methods). cluster similarity scores were then used to annotate multiplets. for example, in pbmcs, for each multiplet we calculated scores, where each score represents the similarity of the multiplet epigenome to that of the five studied clusters (figure b). the distribution of these similarity scores are used to first distinguish heterotypic and homotypic multiplets, by comparing their profiles to annotated singlets (methods). for example, in pbmc , nuclei in b cell cluster (cluster ) had high similarity score for b cell marker peaks and low scores for all other cell types (figure b). in contrast, nuclei in cluster had high similarity scores for nk, cd + t, cd + t and myeloid cells, a signature of heterotypic multiplets (fig. b). once the multiplet type is identified, their cellular origins are annotated using the highest scoring cell type(s). we evaluated the efficacy of this annotation pipeline using artificial multiplets, where cells were randomly selected and paired together to form both heterotypic and homotypic multiplets. using these artificial multiplets, we categorized multiplets as homotypic or heterotypic and annotated multiplets with respect to the number of cell types associated with them. we identified the cellular origins of both types of multiplets with an average accuracy of . %, . % in pbmc , pbmc and . %, . % in islet , islet (fig. c). for example, in pbmc , % of all simulated b and myeloid multiplets were correctly annotated. cell types that have similar functions, hence similar epigenomes, observed lower annotation accuracies; such as % for simulated nk and cd + t cells. our annotations were equally effective for annotating both homotypic and heterotypic multiplets, showing . % accuracy on average to annotate homotypic multiplets and . % accuracy to annotate heterotypic multiplets. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiplet cell-type compositions reflect cellular compositions of the underlying tissue. using atac- doubletdetector’s annotation pipeline, we annotated all detected multiplets in pbmcs and islets. inspection of aggregate accessibility profiles at marker gene promoters (ms a , cd g, cd , cd a, trem , nkg , and klrf ) for each cell type in pbmc (fig. a) revealed that annotated multiplets have accessibility at relevant marker gene promoters. for instance, homotypic b cell multiplets had strong signal at the promoter of b cell marker gene ms a , whereas heterotypic multiplets originating from cd + t cell and b cells had high accessibility signals for both b cell marker gene ms a and cd + t cell marker gene cd a. as expected, homotypic multiplets clustered together with the underlying cell type, whereas heterotypic multiplets typically formed their own clusters (fig. b-c, extended data fig. a-b). the majority of heterotypic multiplets for islet were found between major cell type clusters and near the delta cell cluster while homotypic multiplets resided within the boundaries of singular cell type clusters (fig. d). for pbmc , the majority of multiplets resided within multiplet cluster we previously identified and as a subcluster of cd + t cells (fig. e). as before, homotypic multiplets were found within corresponding cell type clusters. overall, the majority of detected multiplets were homotypic ( . - . % in islets, - . % in pbmcs), with cell types being distributed with respect to their cell proportions for both homotypic and heterotypic multiplet types (fig. d-e, extended data fig. c-d). indeed, in both tissues, the propensity of a cell type to form a multiplet was positively correlated with the percent of that cell type within the tissue (pearson’s r = . , . , p-value < . , . for pbmc and pbmc , pearson’s r = . , . p-value < . , . for islet and islet ) (fig. f-g, extended data fig. e-f), suggesting that snatac-seq multiplets are more likely to occur randomly than through specific interactions between nuclei. for example, the most abundant cell type in islet was beta cells ( . % of the cell population) which contributed to . % of multiplets (fig. f). heterotypic multiplet annotations in islet samples mostly originated from alpha, beta and delta cells. in pbmcs, the most frequent heterotypic multiplets were the ones stemming from cd + t and cd + t cells (fig. f, extended data fig. e). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion detecting and discarding multiplets from snatac-seq data is a critical step for improving data quality as multiplets can form their own clusters and can confound downstream analyses. atac-doubletdetector exploits read count distributions for a given nucleus to effectively detect and eliminate multiplets without requiring prior knowledge of cell-type information. it accomplishes this by first efficiently counting loci with > uniquely aligned reads per nucleus and identifying nuclei with read count distributions deviating from expectations. unlike other methods that utilize artificial multiplet examples to identify putative multiplets (i.e., archr), atac- doubletdetector is capable of detecting both homotypic (i.e., multiplets originating from the same cell type) and heterotypic multiplets (i.e., multiplets originating from different cell types). eliminating heterotypic multiplets is essential for improved clustering and differential analyses between clusters and samples, whereas homotypic multiplets introduce bias in allele-specific analyses. hence, detecting and removing both types of multiplets will improve downstream analyses. the number of valid read pairs per cells is the most important factor affecting the performance of atac- doubletdetector. when read depth per nucleus is sufficiently high (e.g., > k read pairs per nucleus), atac- doubletdetector is very effective in detecting both heterotypic and homotypic multiplets (average recall = . to detect artificial multiplets in pbmcs). since atac-doubletdetector does not depend on artificial multiplet examples, it is not inherently biased towards cell types that resemble others. for example, in islets, delta cells transcriptionally resemble alpha and beta cells, hence artificial multiplets generated by combining alpha and beta cells have genomic profiles that resemble delta cells. these instances are particularly challenging for methods that depend on artificial multiplet examples (e.g., archr for snatac , doubletfinder and scrublet for scrna- seq). in alignment with this, archr categorized . % of delta cells as multiplets in islet . given the success of atac-doubletdetector for identifying multiplets from snatac-seq data with enough reads per nuclei, it can also be effective in detecting and eliminating multiplets in recent multi-ome transcriptome and epigenome assays . epigenomic signal at marker peaks is an effective way to annotate cellular origins of multiplets, where we achieved . % accuracy on average in simulations. annotations of detected multiplets showed that majority are homotypic. furthermore, the propensity of nuclei to form multiplets was positively correlated with the abundance of that cell type within the tissue. since cells are lysed and nuclei are profiled in snatac-seq protocols ; these assays will likely not be prone to biological multiplets due to cell-cell interactions). therefore, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . snatac-seq multiplets likely occur randomly among all cells; hence the most abundant cells are the most likely to form multiplets. quantifying the efficacy of multiplet detection methods is a challenging task since true examples of singlet and multiplets are not known. to overcome this challenge, we evaluated atac-doubletdetector’s ability to capture multiplets by simulating artificial multiplets, enabling us to measure recall. atac-doubletdetector identified - . % of cells as multiplets in islet and pbmc samples, which was in alignment with expectations. hence, we believe false positive calls are also restricted in our method. although we quantified our method by forming artificial multiplets, atac-doubletdetector pipeline can be easily extended to capture and annotate multiplets that include data from multiple nuclei. multiplets are inevitable in single cell sequencing and performing better data analyses calls for their removal. atac-doubletdetector introduces a novel and effective count-based solution for detecting multiplets and provides a framework for annotating their cellular origins, improving future downstream analyses. atac- doubletdetector code and documentation is freely available at https://github.com/ucarlab/atac- doubletdetector, providing an easy to use interface for all backgrounds. our multiplet detection algorithm is fast and can be incorporated into data analyses pipelines, where processing of an average library (i.e., ~ , cells at ~ , valid read pairs per cell) takes < minutes. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . methods snatac-seq cell labeling, capture, library preparation, and sequencing. for single nucleus atac sequencing (snatacseq) experiments, viable single cell suspensions from each sample were used to generate snatacseq data using the x chromium platform according to the manufacturer’s protocols (demonstrated protocol nuclei isolation for atac sequencing document cg ; chromium single cell atac_user guide revb document cg ). briefly, > , cells of interest were centrifuged, the supernatant was removed without disrupting the cell pellet, lysis buffer was added for minutes on ice to generate isolated and permeabilized nuclei, followed by quenching by dilution with wash buffer. after centrifugation to pellet the washed nuclei, diluted nuclei buffer was used to re-suspend nuclei at the desired nuclei concentration as determined using a countess ii fl automated cell counter and combined with atac buffer and atac enzyme to form a transposition mix. transposed nuclei were immediately combined with barcoding reagent, reducing agent b and barcoding enzyme and loaded onto a x chromium chip e for droplet generation, followed by library construction. the barcoded sequencing libraries were subjected to bead clean-up and checked for quality on an agilent tapestation, quantified by qpcr (kapa biosystems library quantification kit for illumina platforms), and pooled for sequencing on an illumina novaseq s flow cell (paired-end libraries x bp). human islet isolation human islets were obtained through partnerships with the integrated islet distribution program (iidp, http://iidp.coh.org/). assessment of human islet function was performed by islet gsis static incubation assay on the day after arrival, following the iidp protocol. primary human islets were cultured in prodo media (pim-s + supplements pim-g + pim-abs) in % co at oc for ~ hours prior to beginning studies. in preparation of single cell suspension for x platform, human islets were dispersed with stempro accutase (thermo fisher scientific) ml/ ieq for min at oc. islet single cell suspension was washed three times in pbs- . % bsa and cell number determined using countess ii fl automated cell counter (life tech). nuclei isolation for single cell atac sequencing was performed following the x protocol (https://assets.ctfassets.net/an im xiti/ g d ngcw ab dfqppho/ a fb ea a c cb d /cg _demonstratedprotocol_nucleiisolation_atac_sequencing_revd.pdf, based on the omni nucleiprep by corces et al. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . identifying snatac-seq loci with > reads. position sorted paired-end read alignments from snatac-seq data are compared to detect all loci with > unique reads per nucleus. to avoid instances where reads overlap due to technical reasons, we removed all read pairs that are marked using the following parameters in the htsjdk library: ) readpairedflag = true, ) readunmappedflag = false, ) mateunmappedflag = false, ) secondaryorsupplementary = false, ) duplicatereadflag = false, and referenceindex != matereferenceindex (i.e., read pairs map to the same chromosome). to reduce overlaps due to alignment errors, reads are excluded based on i) mapping quality scores less than or equal to , and ii) insert sizes (i.e., the end to end distance between ’ and ’ read positions) greater than bp (~ nucleosomes) in length. to identify instances of > reads overlapping at any specific locus, all intervals are identified for which an overlap was observed for at least two valid read pairs. reads defining each interval are then compared to one another to identify all subintervals that exceed the specified overlap threshold (i.e., ). to efficiently identify these subintervals, for each subset, interval breakpoints were defined at the start and end positions of each paired end read. for each interval breakpoint, an integer value of was assigned to all breakpoints originating from start positions, and - to all breakpoints originating from an end position. interval breakpoints are then visited in start position sorted order to generate a cumulative sum based on the assigned values at each breakpoint. the cumulative sum indicates the total number of overlaps between two interval breakpoints and efficiently identifies all sub-intervals with a number of overlaps greater than the specified threshold. once all subintervals satisfying the threshold are identified for a subset of reads, the algorithm repeats this process for the remaining paired end read subsets. each step is performed using a linear time algorithm (i.e., o(n), n is the number of total reads), with an additional o(log(m)) (m equals the number of nuclei) overhead for each read to identify their respective nucleus origin, resulting in o(n*log(m)) runtime. the runtime can be reduced to an expected o(n) runtime by instead using an appropriate hash function for cell identifiers/barcodes. note that this algorithm assumes that reads are sorted beforehand and is otherwise superseded by time it takes to sort reads by their chromosome and start positions (i.e., o(n*log(n)). detecting significant multiplets from snatac overlap counts. loci with > reads were first filtered using simple repeats, segmental duplications, repeat masker and blacklist regions obtained from ucsc genome (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . browser and encode , . next, filtered regions from all nuclei were merged if they overlapped by at least one base pair. using this unified list of loci, a binary matrix was generated where rows in the matrix represent loci with > reads for at least one nucleus, and the columns represent the individual cells within the sample. values within the matrix were assigned to if the cell and genomic region combination observed > reads overlapping, and otherwise. from this matrix, multiplets can be detected using column sums (i.e., the total number of > read overlap instances for each nucleus) while repetitive element sequences can be inferred using row sums (i.e., the total number of cells observing > reads at the same locus). the events of observing > reads overlapping within the same region for multiple cells or across multiple regions within the same cell can be modeled using the poisson distribution. occurrences of these events are independent, counted within set intervals (i.e., counting regions across the entire genome within cells or counting cells within the same genomic regions), are either present or not within these intervals, and have a constant average rate of occurring, satisfying the assumptions of the poisson distribution. we therefore detected significant multiplets and inferred repetitive sequences using the poisson cumulative distribution function, using respective mean row and column sum counts as the expected values to calculate poisson probabilities. in this process, we first use poisson probabilities to infer repetitive sequences where a significant number of nuclei observe > reads at the same genomic region. all inferred repetitive sequence loci are removed from further analysis. next, we calculate the poisson probability of observing more loci with > reads than expected in a nucleus(i.e., multiplets) using column sums. poisson probabilities for both inferring repetitive sequence and multiplet detection were corrected using the benjamini hochberg procedure to adjust for multiple hypothesis testing. repetitive sequence inferences and multiplets were predicted by selecting regions or cells with adjusted poisson probabilities less than . . multiplet annotation pipeline. detected multiplets are annotated using clusters identified for snatac-seq samples, merging them with respect to specific cell types present in the cell population. in our study, pbmc clusters were merged to represent cd +t, cd +t, natural killer (nk), myeloid and b cells and islet clusters were merged to represent alpha, beta, delta and ductal cells. marker peaks for all cell type clusters with at least cells were identified with the findmarkers function in seurat , using the logistic regression setting. for the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sake of unison, the top marker peaks are then identified for each cell type cluster based on bonferroni adjusted p-value of average log fold changes. to account for data sparsity in snatac-seq data, aggregate read profiles are calculated for each cell and marker peak. aggregate read profiles are found by taking average read counts for each cell’s nearest neighbors using the top singular value decomposition (svd) components. the cumulative distribution function in r (i.e., ecdf) is then used to find the abundance of reads for each cluster’s marker peaks. distribution scores represent the percent of each cell type’s accessibility profiles present within the cell. in order to distinguish multiplet types (i.e., heterotypic or homotypic) singlet profiles were calculated for each cell type in the sample. for each cell type’s singlet cells, abundance scores at every marker peak were averaged to find the representive abundance score profile for that cell type. multiplets that have a profile close to their abundant cell type’s singlet profile were classified as homotypic. euclidean distance was used to measure the similarity between the profiles of multiplets and singlets. mixture models were then fitted to the distances with the mclust r package to group the closeness of the multiplets to their corresponding cell type’s singlet profile. multiplets in the group with largest distance to the singlet profile are considered heterotypic. multiplets are then annotated using the top (for homotypic) or (for heterotypic) abundance scores. snatac-seq nuclei clustering. to cluster nuclei from snatac-seq data, we employed an in-house implementation (https://github.com/ucarlab/snatacclusteringpipeline) of a two pass clustering method previously described with notable differences. first, we restrict the number of . kb bins in the first pass clustering to the top k bins, up from k bins. for second pass clustering, we increase the number of peaks to include all peaks identified in pass up to k. integration of scrna-seq and snatac-seq data. integrative clustering and analysis of single cell transcriptomes and single nucleus epigenomes was performed using the r package seurat , . first, gene activity scores were derived from the resultant snatac-seq peak count-matrix using the creategeneactivitymatrix function with default parameters. next, single nuclei with < , total read counts were discarded from analyses. the resultant single nuclei and gene activity scores were log normalized and scaled. using the processed scrna-seq data (also analyzed with seurat), we identified anchors between the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . snatac-seq gene activity score matrix and scrna-seq gene expression matrix following the methodology described by butler et al. ( ) . after identifying anchors between the datasets, cell-type labels from the scrna-seq dataset were transferred to the snatac-seq dataset and a prediction and confidence score was assigned for each cell. simulating artificial multiplets to measure multiplet detection performances. to measure recall for detecting multiplets, artificial multiplets were simulated by combining accessibility profiles of nuclei within each sample population tested. for each sample, cells were randomly selected equal to % of the total cell population and paired together to introduce artificial multiplets equivalent to . % of the total population. introducing . % artificial multiplets ensured that they were not the majority compared to real multiplets ( - % of cells across all samples) present in the data. cell pairs were randomly reselected until they formed heterotypic, homotypic, or : ratio of heterotypic and homotypic multiplets based on cell type annotations. simulations measuring the number of valid read pairs per nucleus did not have restrictions based on cell type and were selected based on read depth when stratifying by number of valid read pairs (i.e., fig. c-d, extended data fig. b) or completely at random (i.e., extended data fig. ). once cell pairs were identified, artificial multiplets were introduced by generating modified barcode mappings (for atac-doubletdetector) or barcodes in fragment files (for archr ), which assigned artificial multiplet reads to the same cell identifier (i.e., the first nucleus in the pair). artificial multiplets were simulated or runs depending on the analysis. code availability atac-doubletdetector is provided as a user-friendly computational framework with documentation and source code freely available at: https://github.com/ucarlab/atac-doubletdetector. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references . buenrostro, j. d. et al. single-cell chromatin accessibility reveals principles of regulatory variation. nature , – ( ). . cusanovich, d. a. et al. multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. science , – ( ). . satpathy, a. t. et al. massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral t cell exhaustion. nat. biotechnol. , – ( ). . rai, v. et al. single-cell atac-seq in human pancreatic islets and deep learning upscaling of rare cells reveals cell-specific type diabetes regulatory signatures. mol. metab. , – ( ). . lareau, c. a., ma, s., duarte, f. m. & buenrostro, j. d. inference and effects of barcode multiplets in droplet-based single-cell assays. nat. commun. , ( ). . fang, r. et al. snapatac: a comprehensive analysis package for single cell atac-seq. https://www.biorxiv.org/content/ . / v ( ). . granja, j. m. et al. archr: an integrative and scalable software package for single-cell chromatin accessibility analysis. http://biorxiv.org/lookup/doi/ . / . . . ( ) doi: . / . . . . . mcginnis, c. s., murrow, l. m. & gartner, z. j. doubletfinder: doublet detection in single-cell rna sequencing data using artificial nearest neighbors. cell syst. , - .e ( ). . wolock, s. l., lopez, r. & klein, a. m. scrublet: computational identification of cell doublets in single- cell transcriptomic data. cell syst. , - .e ( ). . ucar, d. et al. the chromatin accessibility signature of human immune aging stems from cd + t cells. j. exp. med. , – ( ). . lawlor, n. et al. single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type diabetes. genome res. , – ( ). . ma, s. et al. chromatin potential identified by shared single-cell profiling of rna and chromatin. cell , - .e ( ). . corces, m. r. et al. an improved atac-seq protocol reduces background and enables interrogation of frozen tissues. nat. methods , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . li, h. et al. the sequence alignment/map format and samtools. bioinforma. oxf. engl. , – ( ). . haeussler, m. et al. the ucsc genome browser database: update. nucleic acids res. , d – d ( ). . encode project consortium. an integrated encyclopedia of dna elements in the human genome. nature , – ( ). . davis, c. a. et al. the encyclopedia of dna elements (encode): data portal update. nucleic acids res. , d –d ( ). . butler, a., hoffman, p., smibert, p., papalexi, e. & satija, r. integrating single-cell transcriptomic data across different conditions, technologies, and species. nat. biotechnol. , – ( ). . scrucca, l., fop, m., murphy, t. b. & raftery, a. e. mclust : clustering, classification and density estimation using gaussian finite mixture models. r j. , – ( ). . stuart, t. et al. comprehensive integration of single-cell data. cell , - .e ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : overview of detecting multiplets in snatac-seq. a, tn transposase cleaves accessible dna at maternal and paternal chromosomes. number of atac-seq read counts per loci per nucleus are expected to be , , or . b, instances where more than (> ) reads are observed for any locus in a cell are identified using an efficient algorithm for counting the number of overlapping reads. c, poisson cumulative distribution function is used to detect multiplets based on deviations from expected number of loci with > reads. d, overview of downstream analyses: ) quantification of multiplet detection performances using artificial multiplets, ) comparison of atac- doubletdetector to alternative method archr, ) annotating cellular origins of multiplets using a clustering-based method. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : atac-doubletdetector identifies heterotypic and homotypic multiplets in human pbmc snatac-seq data. a, summary of snatac-seq samples generated and used in this study from human pbmc and islets. b, valid read pair distributions for pbmc and islet snatac-seq samples. c, pbmc clusters were annotated based on their correlations with sorted bulk atac-seq data (see. extended data fig. ). d, all multiplets (heterotypic and homotypic) detected by atac- doubletdetector in pbmc . selected multiplets refer to multiplets for which aggregated profiles are shown in panel f of this figure. e, the number of cells and percentage of multiplets detected by atac-doubletdetector in pbmc and islet samples. f, chromatin accessibility profiles of cd + t, myeloid, and selected multiplets around for t cell marker gene (cd g) and myeloid cell marker gene (lyz). cd + t and myeloid cells show strong accessibility signals for their relevant marker genes while selected multiplets have accessible chromatin for both marker genes. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : atac-doubletdetector detects multiplets with high recall when read depth is sufficient. a-b, recall for detecting heterotypic (a) and homotypic (b) artificial multiplets. atac-doubletdetector consistently detected both heterotypic and homotypic multiplets with similar recall, while archr was only effective for predicting heterotypic multiplets for data with high heterogeneity. c-d, performance of detecting artificial multiplets at increasing valid read pair (insertions) distributions for pbmc (c) and islet (d). atac-doubletdetector effectively detects multiplets at the > k valid read pairs per nucleus. archr’s performance did not observe the same level of effect for read depth. e, reference annotations for islet . islet annotations correspond to alpha, beta, delta and ductal cell types. f, representative umap plots for multiplets detected by atac- doubletdetector and archr for islet (other samples shown in extended fig. ). we identified islet clusters for alpha, beta, delta, and ductal cells. majority of multiplets detected were not shared between the two methods. heterotypic multiplets were the most common. note: archr detected the majority of delta cells as multiplets. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : multiplet cell-type origins are predicted with high accuracy. a, overview of the cell origin annotation pipeline. first, cells are clustered. second, marker peaks are identified. third, multiplets and their k-nearest neighbor cells are used to generate cluster similarity scores. b, example of aggregate cluster profiles for predicting cell origin annotations. clusters corresponding to cell types observe strong signal for their respective cell types (e.g., cluster ) while clusters corresponding to multiplets show a mixed profile of cell types (e.g., cluster ). c, heatmaps of cell origin annotation accuracies for predicting artificial multiplets derived from cells of the specific cell type pairings. multiplet annotations showed high accuracies for the majority of cell type compositions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. : majority of multiplets are homotypic and correspond to cell type proportions. a, accessibility maps for cell origin annotations for multiplets identified in pbmc . homotypic multiplets observe strong signal for their respective marker genes. heterotypic multiplets observe a combined signal at respective marker genes corresponding to the respective annotated cell types. b-c, umap clustering for heterotypic and homotypic multiplet annotations in pbmc (b) and islet (c). heterotypic multiplets are found between major cell type clusters. homotypic multiplets are observed on the periphery of major cell type clusters. d-e, heterotypic and homotypic multiplet cell distributions (left bars). homotypic cell type annotations (right bars) for pbmc (d) and islet (e) samples. majority of multiplets are annotated as homotypic. homotypic cell type distributions show similar distribution to the overall proportions of each cell type in their respective samples. f-g, cell and multiplet proportions for pbmc (f) and islet (g). multiplet cell type proportions are highly correlated with overall cell proportions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : multiplets observe many loci with > reads. the binary matrix of loci with > reads per cell reveals high confidence multiplet (marked by arrows) that harbor many loci with > reads. these multiplets can be clearly seen compared to the other cells in the subset. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : pseudo-bulk snatac-seq profile correlations with sorted bulk atac-seq revealed major cell types. a, b, spearman correlation heatmaps between pseudo-bulk (snatac) and sorted bulk atac-seq accessibility profiles for pbmc (a) and pbmc (b). pseudo-bulk profiles cluster with four major cell types: myeloid, b, cd + t, cd + t and natural killer (nk). c, d, annotated umap clusters for pbmc (c) and pbmc (d). myeloid, b form distinct clusters for both samples. cd +t, cd +t and nk cell types share more accessible loci and tend to cluster more closely to one another. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : annotated snatac-seq clusters reflect accessibility at cell specific promoters. a, b, annotated umaps for pbmc (a) and pbmc (b) at the promoters of cd g (t-cell marker), cd (cd + t cell marker), cd a (cd + t cell marker), ms a (b cell marker), nkg (nk cell marker), and trem (myeloid cell marker). accessibility was binarized to or based on the presence or absence of a read within these promoters. using these markers, b and myeloid cell types are clearly annotated with their respective markers. cd + t and cd + t cells can be observed by combining cd g with cd and cd a markers respectively whereas nk cells are can be seen using nkg and excluding nuclei with accessibility at cd g promoter. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : islet snatac-seq clusters correspond to scrna-seq and cell marker annotations. a, b, umap clusters of snatac-seq data for islet (a) and islet (b) annotated as alpha, beta delta or ductal cells via integration with annotated scrna-seq data. four distinct clusters are observed with these cell types. c, d. cell specific clusters correspond to their respective marker peaks for both islet (c) and islet (d). accessibility was binarized to or based on the presence or absence of a read within these promoters. alpha, beta, delta and ductal cells are clearly identified with their respective marker genes: gcg (alpha), ins (beta), sst (delta), and krt (ductal). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : multiplets are distributed throughout snatac-seq clusters. multiplet annotated umap clustering of pbmc , pbmc , islet and islet reveal that multiplets are distributed throughout all identified clusters and in some cases form their own multiplet clusters (i.e., center cluster in pbmc ). multiplets between major cell type clusters are likely to be heterotypic whereas multiplets at the periphery of annotated clusters are likely to be homotypic. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : atac-doubletdetector detects both homotypic and heterotypic multiplets at high read depth. a, recall for detected both homotypic and heterotypic artificial multiplets at a : ratio. atac-doubletdetector did not observe noticeable differences in performances due to its robustness for detecting both multiplet types. archr showed reduced performance compared to heterotypic multiplet only detection due to the inclusion of homotypic multiplets. b, recall for multiplets stratified by read count distributions (top for each sample) and valid read pair distributions for each multiplet subset (bottom for each sample). atac-doubletdetector performances increased when the number of valid read pairs exceeded ~ k valid read pairs per nuclei, suggesting multiplets can be reliably detected when nuclei have > k valid read pairs each. archr did not show significant differences in performance due to read depth. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : artificial multiplets are detected when combined valid read pairs exceed k. for each sample, multiplets were detected (top left for each sample) or not detect (top right for each sample), depending on whether one or both nuclei exceeded k valid read pairs. histogram of combined profiles revealed that the majority of detected multiplets (bottom left for each sample) had at least k valid read pairs while multiplets not detected were those with less than kb valid read pairs (bottom right for each sample). when nuclei are sequenced for k valid reads per nuclei, multiplets will harbor k valid read pairs and can be detected by atac- doubletdetector. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : atac-doubletdetector and archr identify different multiplet subsets. umap clusters annotating atac- doubletdetector multiplets (green), archr multiplets (orange), or their intersection (black). majority of multiplets detected by both atac- doubletdetector and archr were between major cell type clusters (i.e., heterotypic multiplets). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : atac-doubletdetector and archr multiplets comparisons reveal nature of their underlying algorithms. a, venn diagrams and total number of multiplets detected by atac-doubletdetector and archr. only a small subset of multiplets is detected by both methods. b, total number of nuclei and multiplets detected by each method. differences in number of nuclei are due to differences in inputs (i.e., alignment (bam) files for atac-doubletdetector and fragment files (cell ranger output) for archr). overall, archr detects more multiplets using default parameters than atac-doubletdetector. c, valid read pair distributions between multiplets and singlets detected by atac-doubletdetector and archr. differences in number of valid read pairs between multiplet and singlets were more significant for atac-doubletdetector than archr while the number valid read pairs for atac-doubletdetector were significantly greater than archr multiplet. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . extended data fig. : multiplet annotations correspond to cell proportions. a-b, umap clustering for heterotypic and homotypic multiplet annotations in pbmc (a) and islet (b). heterotypic multiplets are found between major cell type clusters. homotypic multiplets are observed on the periphery of major cell type clusters. c-d, heterotypic cell type annotations for pbmc (d) and islet (e) samples. majority of multiplets are annotated as homotypic. f-g, cell and multiplet proportions for pbmc (f) and islet (g). multiplet cell type proportions are highly correlated with overall cell proportions. islet observed more beta cell multiplets than other cell types/samples, reducing correlation and significance for islet . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . structural genetics of circulating variants affecting the sars-cov- spike / human ace complex structural genetics of circulating variants affecting the sars-cov- spike / human ace complex francesco ortuso , , daniele mercatelli , pietro hiram guzzi , federico manuel giorgi ,* department of health sciences, university “magna græcia” of catanzaro, catanzaro, italy net science srl, c/o university “magna græcia” of catanzaro, catanzaro, italy department of pharmacy and biotechnology, university of bologna, bologna, italy department of surgical and medical sciences, university “magna græcia” of catanzaro, catanzaro, italy * corresponding author e-mail: federico.giorgi@unibo.it (fmg) orcids francesco ortuso: - - - daniele mercatelli: - - - pietro hiram guzzi: - - - federico manuel giorgi: - - - classification biophysics and computational biology keywords sars-cov- , covid- , mutations, spike, ace author contributions fmg, phg and fo designed the study. fo designed and performed the structural analysis. fmg designed the genetics analysis. fmg and dm performed the genetics analysis. fmg financially supported the study. phg drafted the manuscript and performed literature search. all authors contributed to the writing of the final version of the manuscript. abstract .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sars-cov- entry in human cells is mediated by the interaction between the viral spike protein and the human ace receptor. this mechanism evolved from the ancestor bat coronavirus and is currently one of the main targets for antiviral strategies. however, there currently exist several spike protein variants in the sars-cov- population as the result of mutations, and it is unclear if these variants may exert a specific effect on the affinity with ace which, in turn, is also characterized by multiple alleles in the human population. in the current study, the gbpm analysis, originally developed for highlighting host-guest interaction features, has been applied to define the key amino acids responsible for the spike/ace molecular recognition, using four different crystallographic structures. then, we intersected these structural results with the current mutational status, based on more than , sequenced cases, in the sars-cov- population. we identified several spike mutations interacting with ace and mutated in at least distinct patients: s n, n k, n y, y f, e k, k n, s i and g s. among these, mutation n y in particular is one of the events characterizing sars-cov- lineage b. . . , which has recently risen in frequency in europe. we also identified five ace rare variants that may affect interaction with spike and susceptibility to infection: s p, e k, m i, e g and g v. significance statement we developed a method to identify key amino acids responsible for the initial interaction between sars-cov- (the covid- virus) and human cells, through the analysis of spike/ace complexes. we further identified which of these amino acids show variants in the viral and human populations. our results will facilitate scientists and clinicians alike in identifying the possible role of present and future spike and ace sequence variants in cell entry and general susceptibility to infection. abbreviations aa: amino acid ace : angiotensin-converting enzyme covid- : coronavirus disease gbpm: grid based pharmacophore model iep: interaction energy point mifs: molecular interaction fields orf: open reading frame pdb: protein data bank rbd: spike receptor binding domain with ace rmsd: root mean square deviation sars-cov- : severe acute respiratory syndrome coronavirus main text introduction the severe acute respiratory syndrome coronavirus (sars-cov- ) has emerged in late ( ) as the etiological cause of a pandemic of severe proportions dubbed coronavirus disease (covid- ). the disease has reached virtually every country in the globe ( ), with more than , , .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / confirmed cases and more than , , deaths (source: world health organization). sars-cov- is characterized by a , -long single stranded rna genome, densely packed in open reading frames (orfs); the orf encodes for a polyprotein which is furtherly split in proteins, for a total of proteins ( ). the second orf encodes for the spike (s) protein, which is the key protagonist in the viral entry into host cells, through its interaction with human epithelial cell receptors angiotensin converting enzyme (ace ) ( ), transmembrane serine protease (tmprss ) ( ), furin ( ) and cd ( ). investigators have focused their attention on the spike/ace interaction, trying to disrupt it as a potential anti-covid- therapy, using small drugs ( ) or spike fragments ( ). using x-ray crystallography, some models of the spike/ace have been generated ( – ), providing a structural instrument for the analysis of this key interaction. these models determined that the receptor binding domain (rbd) of spike, directly interacting with ace , is a compact structure of ~ amino acids (aas) over a total of aas of the full-length spike. the sars-cov- spike protein adapted from subsequent mutations from a wild bat beta-coronavirus ( ), in order to exploit the n-terminal ace peptidase domain conformation. as a result, sars-cov- spike can establish a strong interaction with the human cell surface, allowing the virus to fuse its membrane with that of the host cell, releasing its proteins and genetic material and starting its replication cycle ( ). while sars-cov- shows low mutability ( ), with less than predicted events/year ( ), the virus is in continuous evolution from the original wuhan reference sequence (nc_ . ) ( ), and there are currently at least major variants circulating in the population ( , ). some of these strains are characterized by a mutation in spike, at aa , whereas an aspartic acid (d) is substituted by a glycine (g) ( ). in fact, the spike d g mutation gives the name to the most frequent viral clade (g), which was first detected in europe at the end of january , and is currently present in all continents, with increasing frequency over time ( ). d g does not fall within the putative rbd (aa ~ - ), but some studies suggest it may have a clinically relevant role: d g is positively correlated with increased case fatality rate ( ), and it shows increased transmissibility and infectivity compared to the reference genome ( ). in vitro studies show that viruses carrying the d g spike mutation have an increased viral load and cytopathic effect in cultured vero cells ( ). despite these preliminary observations, there are still several doubts on the molecular effects of the d g variant ( ). other recurring spike mutations have been observed in the population worldwide, however at frequencies of % or below ( ); some of these mutations fall within the rbd and therefore may have a direct role in ace interaction. on the other hand, genetic variants of ace in human population may influence susceptibility or resistance to sars-cov- infection, possibly contributing to the difference in clinical features observed in covid- patients ( ). ace gene is located on chromosome xp . and consists of exons, coding for an aas long protein exposed on the cell surface of a variety of human organs, including kidneys, heart, brain, gastrointestinal tract, and lungs ( ). it is unclear if tissue-expression patterns of ace may be linked to the severity of symptoms or outcomes of sars-cov- infections; however, ace levels in lungs were found to be increased in patients with comorbidities associated to severe covid- clinical manifestations ( ), whereas polymorphisms of ace have been already described to play a role in hypertension and cardiovascular diseases ( ), particularly in association with type diabetes ( ), all conditions predisposing to an increased risk of dying from covid- ( ). despite early studies, the presence of spike mutations potentially altering the binding with .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ace is still largely under-investigated, as is the role of ace variants in the human population in determining patient-specific molecular interactions between these two proteins. in the present study, we aim at detecting which spike and ace aas are the most important in determining the sars-cov- entry interaction and analyze which ones have already mutated in the population. the task is clinically relevant, providing a functional characterization of present and future mutations targeting the ace /spike binding and detected by sequencing sars-cov- on a patient-specific basis. characterizing the variability of both proteins must be taken in consideration in the process of developing anti-covid- strategies, such as the spike-based vaccine currently deployed by the national institute of allergy and infectious diseases and moderna ( ). results we set out to analyze the key aas involved in the spike/ace interaction, in order to highlight which ones may alter the binding affinity and therefore etiological and clinical properties of different sars- cov- variants on different patients. following that, we determined which spike and ace aa variations relevant for this interaction have been observed in the sars-cov- and human population, respectively. structural analysis of spike/ace interaction we obtained structural models of the sars-cov- spike interacting with the human ace from three recent x-ray structures, deposited on the protein data bank: lzg ( ), m j ( ) and vw ( ). for vw , two spike/ace complexes were available, so we report results for both as vw -a and wv -b, separately. all models show the core domains of interaction, located in the region of aa - for spike and in the region aa - of ace . full length proteins would be aas (spike only known isoform, from reference sars-cov- genome nc_ . ) and aas (ace isoform , uniprot id q byf - ). selected pdb entries are wild type and their primary sequence and the higher order structures were identical. residues - were missed in vw -b. with the aim to investigate the conformation variability, pdb complexes were aligned by backbone and the root mean square deviation (rmsd) was computed on all equivalent not hydrogen atoms. rmsd data have shown some conformation flexibility that confirmed our idea to take into account all pdb structures in the next investigation (fig ). the gbpm method was originally developed for identifying and scoring pharmacophore and protein- protein interaction key features by combining grid molecular interaction fields (mifs) according to the grab tool algorithm ( ). in the present study, gbpm has been applied to all selected complex models considering spike and ace either as host or guest. dry, n and o grid probes were considered for describing hydrophobic, hydrogen bond donor and hydrogen bond acceptor interaction. for each probe a cut-off, required for highlighting the most relevant mifs points, was fixed above the % from the corresponding global minimum interaction energy value. with respect to the known gbpm application, where pharmacophore features are used for virtual screening .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / purposes, here these data guided us in the complex stabilizing aas identification. in fact, spike or ace- residues, within Å from gbpm points, were marked as relevant in the host-guest recognition and were qualitatively scored by assigning them the corresponding gbpm energy. if a certain residue was suggested by more than one gbpm point, its score was computed as summa of the related gbpm points energy (fig. ). finally, for each selected residue, the four models averaged score was considered for estimating the role in complex stabilization. taking into account their average scores, spike and ace aas were divided by quartiles to facilitate the interpretation of the results: quartile (q ) includes the strongest complex stabilization contributors; quartile (q ) contains residues less important than those reported in q but most relevant of those included in quartile (q ); quartile (q ) indicates the weakest predicted interacting aas. such an extension of the original approach allowed us to highlight known relevant interaction residues of both spike (table ) and ace- (table ). basically, the same number of aas was highlighted for spike ( aas) and ace ( aas). the average score was also in the same range. spike reported a population of q larger than ace : and aas, respectively. the opposite scenario was observed in the q that accounted for residues for spike and for ace . no remarkable difference can be addressed to the q and q spike-ace comparison. we reasoned that mutations and variants in q residues could have a more relevant impact in the complex stability. the analysis of all designed gbpm suggested the spike - ace molecular recognition is largely sustained by polar interactions, such as hydrogen bonds, and by very few putative hydrophobic contributions (table ). mutational analysis of sars-cov- spike we analyzed , publicly available sars-cov- full-length genome sequences collected worldwide and deposited on the gisaid database on december , ( ). from these, we obtained , samples containing at least one aa-changing mutation in the spike protein. a total of , different aa-changing mutations were detected in the , aa-long spike sequence. however, many of these are unique events (or possibly even sequencing errors), as only , mutations were found in more than one sample, were found in more than ten samples, and in more than one hundred samples (supplementary file ). we then focused on mutations located in the spike rbd (aa - ) with predicted interaction contribution, as assessed by our gbpm method. the majority of mutations here are found in only a handful of samples (table and fig a), with a few notable exceptions. the mutations s n and n k are the most frequent in the current population and were identified in , patients ( . %) and , patients ( . %) respectively. these two variants (n k and s n) are also amongst the top most frequent in the population and involve two positions productively contributing to the interaction between spike and ace , according to gbpm (see table and fig for locations and ). the graphical inspection of the pdb structures revealed that spike asparagine (n) , raked at gbpm q , is mainly involved in intra-protein interaction. in fact, by means of its backbone sp oxygen atom, n accepts one hydrogen bond from spike serine sidechain and, by its sidechain amide .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / group, donates one hydrogen bond to the spike proline backbone: all these aas are located into a random coil loop of spike so the n k could minimally modify the spike-ace recognition. on the other hand, after the theoretical mutation of the asparagine with a lysine, it is possible to predict a productive electrostatic interaction between the new net positively charged residue and the ace glutamate . such a long-distance interaction could improve the stabilization of the complex with respect to the spike wild type (figure s ). a similar effect could be addressed to the mutation at position . serine (s) is a weak contributor to the complex interaction. in all pdb entries we selected, serine is located into a solvent exposed random coil loop. no interaction with ace or spike residues can be observed. actually, the gbpm analysis included such a residue in q . conversely, its mutation to asparagine (s n), in our in silico model, revealed the possibility to establish hydrogen bond to the ace serine that can clearly result in a stabilization of the complex (figure s ). moreover, position is also affected by three other events with lower occurrence: s i, s r and s g, with , and observations (table ). among all, the s r could be the most interesting one. actually, a net positively charged residue, such as arginine (r), can establish a weak electrostatic interaction to ace glutamate , as suggested by a theoretical model we built. the s i and s g could modify the conformation of a random coil segment, so it does not appear very relevant. conversely, s n and s g could productively contribute to the spike ace complex stabilization. of course, deeper theoretical and experimental investigations should be carried out to confirm this hypothesis. unfortunately, full-scale simulations cannot be rigorously performed today because the available d structural models report only fragments of the complex between spike and ace . the third most common mutation, n y (fig ), targets an aa predicted to have a strong role in the interaction in all four models, sitting in the gbpm q . n y was detected in , patients ( . % of the dataset): the majority of which were located in the united kingdom ( ). from a structural point of view, we predict that a substitution, at position , of an asparagine (n) with a tyrosine (y) may have an effect: their total polar surface area (tpsa), equal to . and to . Å respectively, is different, however both their sidechains can donate/accept a hydrogen bond. therefore, their contribution to complex stabilization may be slightly different, also taking into account the chemical environment. in fact, the wild type asparagine donates one hydrogen bond to ace tyrosine : such an interaction could be possible also for n y mutant or, as we observed in our theoretical model, it could be replaced by pi-pi stacking (figure s ). the rapid increase in frequency of mutation n y has been recently observed in the united kingdom and other countries, as it is one of the variants characterizing lineage b . . ( ). the asparagine/tyrosine substitution in spike position could contribute to determine an evolutionary advantage for this lineage, based on differential affinity for the human receptor ace ( , ). a less frequent mutation amongst those predicted to contribute to the ace /spike interaction is g s, detected in samples ( . %), and supported by three out of four structural models (table , fig b). the glycine (g) was included by gbpm analysis in q : its contribution to the complex stabilization is weak. conversely to the other mutation described here, the replacement of glycine with a serine (s) could have more evident effects on spike ace molecular recognition. in fact, in all pdb entries, the alpha carbon of this glycine is very close, about Å, to the sidechain amide group of the ace glutamine . between these two aas no productive interaction can be established but the substitution of the spike glycine with a serine could allow one inter-protein hydrogen bond to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ace glutamine . moreover, g s could establish the same interaction with spike glutamine that could stabilize the conformation of a random coil segment of the viral protein resulting in a better pre-organization to the ace recognition (figure s ). another spike residue, predicted by our analysis for playing a relevant role in ace recognition, is the glutamine (table ). the gisaid data revealed that such an aminoacid is rarely replaced by a leucine (q l) or by an arginine (q r). these mutations could affect the recognition of ace in an opposite way. spike glutamine is involved in hydrogen bond with ace glutamate . the mutation q l cannot establish such a productive contribution and could only hydrophobically interact to spike leucine . conversely, q r could locate its net positively charged sidechain into an ace pocket delimited by aspartate , histidine and glutamate . such a positioning could produce a remarkable electrostatic stabilization of the complex (figure s ). in general, we could observe that aas with the strongest evidence for interaction contribution in the spike/ace interface tend not to diverge from the reference (fig b), which may indicate a solid evolutionary constraint to maintain the interface residues unchanged. for example, one of the most relevant st quartile aa in the ace /spike interaction, glutamine (q) , is rarely mutated, with cases of q l, of q * (the substitution of q with a stop codon), of q k, and of q r and q h. one possible exception is the aforementioned spike mutation n y, located in the strongest st quartile gbpm-predicted aa for ace binding, which was found in the considerable number of different patients. mutational analysis of human ace we also investigated the variants of human ace , since these could constitute the basis for patient- specific covid- susceptibility and severity. ace protein sequence is highly conserved across vertebrates ( ) and also within the human species ( ), with the most frequent missense mutation (rs , n d) present in . % of the world population (supplementary file ). our analysis shows that only variants of ace detected in the human population are also located in the ace /spike direct binding interface (table and fig ). of these, rs (causing a s p aa variant) is both the most frequent in the population ( . %) and the most relevant in the interaction with the viral protein, with a gbpm score of - . (q ) and support from all models (table ). the rs snp frequency is higher in the population of african descent ( . %). the second snp, rs (e g, table ) is a very rare allele ( . %) in the european (non-finnish) asian population. the rs (m i) snp is also a very rare allele ( . %) found in the african population. e k (rs ) is more frequent in the finnish ( . %) and g v (rs ) in the european non-finnish ( . %) population. none of these five snps have a reported clinical significance, according to dbsnp and literature search ( ). it must be mentioned that m i, together with s p, has been predicted to adversely affect ace stability ( ). m i, together with e g, has been simulated to increase binding affinity with spike when compared to wild type ace , hypothesizing greater susceptibility to sars-cov- for patients carrying these variants ( ). instead, e k ( ) and g v ( ) were predicted to possess a lower affinity with spike, suggesting lower susceptibility to the infection. however, while describing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / potential explanations to the existence of a possible predisposing genetic background to infection, all these studies remain inconclusive in linking allele variants to covid- susceptibility. structurally, the s p variant may greatly differ from the reference sequence in the interaction with ace : serine (s) is a polar residue, able to accept and donate, by means of its side chain alcoholic group, a hydrogen bond. proline (p), on the other hand, cannot be involved in hydrogen bonding, and therefore should establish a weaker interaction with spike. in fact, ace serine sidechain donates a hydrogen bond to spike alanine backbone (figure s ) and potentially could establish the same interaction with spike glycine (g) , which could also be mutated (table ). both methionine (m) and glutamate (e) are in q minimally contributing to spike ace recognition (figures s and s ). they are located within two alpha helices so their mutation could modify the secondary structure of ace corresponding to a different affinity against spike. such a possibility should be more evident in the case of e g because glutamate sidechain is involved in hydrogen bond with ace- glutamine . discussion sars-cov- spike evolved through a series of adaptive mutations that increased its affinity for the human ace receptor ( ). there is no reason to believe that the evolution and adaptation of the virus will stop, making continuous sequencing and mutational tracking studies of paramount importance to strategically contain covid- ( ). in our study, we highlighted which specific locations of spike can influence the ace molecular recognition, required for the viral entry into the host cell ( ). we further showed that some mutations are already present in the sars-cov- population that may weakly affect the interaction with the human receptor, specifically spike n k, s n and n y. these mutations are rising in the viral population (> %) and in particular n y is one of the key mutations characterizing lineage b. . . ( ), which has seen a recent dramatic increase in frequency in the united kingdom ( ). having identified this mutation proves that our combination of targeted mutation frequency and gbpm is a useful pipeline to monitor events in the key region used by sars-cov- to recognize and enter human bronchial cells. the same approach can be used to monitor, in the future, if any of these events will increase in frequency, suggesting an adaptation to the human host leveraging a higher affinity with ace . on the other hand, we studied the variants in the human ace population, identifying loci that can affect the binding with sars-cov- spike. they are all rare variants, with the most frequent, s p, present in . % of the population, and with no known clinical significance. however, other in silico studies have predicted their role in decreasing ace stability (s p and m i) ( ), and in altering the affinity with spike (increasing it: m i and e g ( ); decreasing it: e k ( ) and g v ( )). the most common ace variant, rs (n d), is not located in the binding region, and so far its predicted effects on the etiopathology of covid- are still largely conjectural and associated to neurological complications via mechanisms probably independent from direct interaction with spike ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / it remains to be seen whether, in the future, the combination of spike and ace sequences will produce novel and unexpected covid- specificities, that will require granular efforts in developing wider-spectrum anti-sars-cov- strategies, such as vaccines or antiviral drugs. so far, our analysis has shown a location on the spike/ace complex where both proteins vary in the viral/human population, specifically on ace s and spike a /g . while, as described in our results, these mutations on spike are not likely to strongly affect the interaction surface, future combinations of ace /spike variants may have peculiar effects that will require constant mutation monitoring. identifying single or multiple aas involved in this viral entry interaction will allow for personalized diagnosis and clinical prediction based on the specific combination of sars-cov- strain and ace variant. personalized covid- treatment will require targeted sequencing of the patient ace and spike, to identify the combination causing the specific case. this technical obstacle can be further complicated by the intra-host genetic variability of sars-cov- , which has recently been reported from rna-sequencing studies ( ). structural investigation will benefit, in the next future, from the availability of experimental structural models reporting the complete sequence of both spike and ace , or at least spike. this will allow more rigorous computational analyses (i.e. molecular dynamics simulation, free energy perturbation) on the effect of mutations on the spike/ace recognition. beyond the complex investigated in this manuscript, our approach can be fully extended to any other partners in the sars-cov- /human interactome, for example the recently discovered interaction between viral protease nsp ( ) and human histone deacetylase hdac ( ), which is indirectly responsible for the transcriptional activation of pro-inflammatory genes. our approach can also be extended to other viruses exploiting human receptors as an entry mechanism, such as cd for the human immunodeficiency virus (hiv) or tim- for the ebola virus ( ). materials and methods structural analysis the pdb ( ) was searched for high resolution spike/ace complexes. pdb entries lzg ( ), m j ( ) and vw ( ), reporting the spike rbd interacting to ace , have been retrieved and taken into account for our gbpm analysis ( ). such a computational approach compares grid ( ) molecular interaction fields (mifs) computed on a generic complex (a) and on its host (b) and guest (c) components, separately. actually, mifs describe the interaction between a certain probe and a certain target. if the target is represented by a complex, depending on the selected area, the mif energies can be referred to the interaction between the probe and one of the complex subunits or, at the host/guest interface, with both of them. the gbpm analysis, objectively, highlights these last. five steps are required: ( ) the complex a is disassembled in its subunits b and c; ( ) mifs are computed on a, b and c by using the most appropriate grid probes. a hydrogen bond acceptor/donor and a generic hydrophobic probe can describe the basic interaction. because grid mifs are stored as a d matrix of interaction energy points (iep), the same box dimensions are adopted in all calculations; ( ) each iep of b is compared with respect to the equivalent point of a generating a new mifs named d. the following algorithm, available into the grab tool, is applied: if iep(a) > and iep(b) > then iep(d) = ; if iep(a) > and iep(b) < then iep(d) = iep(b); if iep(a) < and iep(b) > then iep(d) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / = -iep(a); if iep(a) < and iep(b) < then iep(d) = iep(a)-iep(b). the resulting mif d reports as negative energy values the productive interaction between the grid probe and b and the interface a and b; ( ) in order to obscure the interaction between the probe and b, mifs d and c are compared, by using the grab approach, producing to a new mif e; ( ) the most relevant interaction points (gbpm features) of the mif e are, finally, selected taking into account an energy cutoff % above the global minimum. supplementary figures focusing on the most relevant mutation are available in supplementary file . before starting the gbpm analysis, co-crystalized water molecules were removed from pdb structures. in vw , showing two spike-ace complexes, namely chains a-e and b-f, both structures have been investigated and further reported as model a and b, respectively. all selected complexes have been conformationally compared one each other by alignment and computing the rmsd on the cartesian coordinates of equivalent not hydrogen atoms. dry, n and o original grid probes have been used to highlight hydrophobic, hydrogen bond donors and acceptors areas. in order to identify the most relevant residues of both spike and ace , we conceptually and technically extended the gbpm algorithm, originally designed for drug/target interactions ( ). in the gbpm analysis presented here, the two interacting proteins have been considered either as host and guest units, and relevant aas were selected if their distance from gbpm features was lower or equal to Å. for each pdb model, the selected residues were scored as summa of the corresponding gbpm features interaction energy. in order to prevent unrealistic distortion of the spike-ace complex, due the usage of structures not covering the full length of the interacting proteins, the mutations effect has been qualitatively estimated by means of the mutagenesis tool implemented in pymol software ( ). wild type residues have been replaced by the mutation and the new sidechain conformations have been optimized taking into account the neighboring aas. the graphical analysis was carried out onto the predicted most populated rotamers. on the basis of its better x-ray resolution, the m j pdb structure has been selected for the above reported investigation. genetical analysis sars-cov- genome sequences from human hosts and accounting for a total of , submissions were obtained from the gisaid database on october ( ). low quality (with more than % uncharacterized nucleotides) and incomplete (< , nucleotides, based on a total reference length of , ) sequences were removed. the resulting , genome sequences were aligned on the reference sars-cov- wuhan genome (ncbi entry nc_ . ) using the nucmer algorithm ( ). position-specific nucleotide differences were merged for neighboring events and converted into protein mutations using the coronapp annotator ( ). the results were further filtered for aa- changing mutations targeting the spike protein. ace variants in the human population were extracted from the gnomad database, v , july ( ). we considered only missense variants affecting specific aas in the protein sequence, for a total of entries (supplementary file ). graph generation was performed with the r statistical software and the corto package v . . ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / acknowledgments we thank the italian ministry of education and research for their financial support under the montalcini initiative. we thank prof. giovanni perini for his continued support and scientific enthusiasm, prof. massimo battistini for his lessons on logic and writing, prof. elena bacchelli for her suggestions on the use of gnomad, and prof. stefano alcaro who provided the computational resources required by the gbpm analysis. finally, we thank mr. george wolf for the final proofreading the manuscript. references .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figures and tables .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . conformational comparison of spike-ace pdb complexes: (a) alignment of pdb entries, spike and ace are respectively surrounded by cyan and orange fog, and (b) bar graph showing rmsd (in Å) computed on structures aligned without hydrogen atoms. , , , , , , , , , , , , , , , , , , , , , , lzg m j vw -a vw -b r m s d ( Å ) pdb entries lzg m j vw -a vw -bb a .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . summary of the pipeline adopted by gbpm to identify key residues contributing to the sars-cov- spike / human ace interface. spike is depicted in cyan, and ace in orange, based on the lzg pdb model ( ). residues highlighted by gbpm are then tested for mutation frequency in the worldwide sars-cov- population. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . d ribbon representation of the interaction domains of sars-cov- spike (left, orange) and human ace (right, green), based on the crystal structure lzg deposited on protein data bank and produced by ( ). the positions of the three most frequent spike mutations in the interacting region (aa - ) with a non-zero gbpm score are indicated: n k, n y and s n. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . (a) occurrence of aa-changing variants on sars-cov- spike protein. x-axis indicates the position of the affected aa. y-axis indicates the log of the number of occurrences of the variant in the sars-cov- dataset. labels indicate variants affecting ace /spike binding and detected in at least sars-cov- sequences. vertical dashed lines indicate crystalized region analyzed (aa – ). the d g variant, located outside the rbd, is also indicated. (b) scatter plot indicating the occurrence of the variant in the population (x-axis) and the gbpm score of the reference aa in the model (y-axis). mutations with non-zero gbpm score are indicated. cc indicates the pearson correlation coefficient and p indicates the p-value of the cc. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . frequency of mutations on ace . x-axis indicates the aa position in isoform (uniprot q byf - ). y-axis indicates the allele frequency in the global population according to the gnoma v database. labels indicate aa changes observed in the human population with non-zero gbp average score in the ace /spike interaction models. vertical dashed lines indicate the crystaliz region analyzed in this study (aa – ). id ad pm ed .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . gbpm scores, average values, and quartile distribution of spike relevant aas in three pdb models. gbpm scores and average values are reported in kcal/mol. residue # pdb entries gbpm lzg m j vw -a vw -b average score quartile lys - . - . . . - . q asn . . - . - . - . q gly - . - . . - . - . q gly - . . . . - . q tyr - . - . - . - . - . q tyr . . - . - . - . q leu - . - . - . - . - . q phe - . - . - . - . - . q ala - . - . - . - . - . q gly - . . - . - . - . q ser - . . - . - . - . q glu - . - . . . - . q phe - . - . - . - . - . q asn - . - . - . - . - . q tyr - . - . - . - . - . q phe - . - . - . - . - . q gln - . - . - . - . - . q gly - . - . - . - . - . q phe - . . - . - . - . q gln - . - . - . . - . q pro . . . - . - . q thr . - . - . - . - . q asn - . - . - . - . - . q gly - . - . - . - . - . q val . - . - . - . - . q tyr - . - . - . - . - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . gbpm scores, average values, and quartile distribution of ace relevant aas in three pdb models. gbpm scores and average values are reported in kcal/mol. residue # pdb entries gbpm lzg m j vw -a vw -b average score quartile ser - . - . - . - . - . q gln - . - . - . - . - . q thr - . - . - . - . - . q phe - . - . - . - . - . q asp . - . . . - . q lys - . - . - . - . - . q his . - . - . - . - . q glu - . . . - . - . q glu - . - . - . - . - . q asp - . - . - . - . - . q tyr - . - . - . - . - . q gln - . - . - . - . - . q leu - . - . . - . - . q leu . . . - . - . q met . . - . - . - . q tyr - . - . - . - . - . q glu . . . - . - . q asn - . - . - . - . - . q gly - . - . - . - . - . q lys - . - . - . - . - . q gly - . - . - . - . - . q asp - . - . - . - . - . q arg . - . . . - . q ala . . - . . - . q arg . . - . . - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . composition of the gbpm models designed. hbd = hydrogen bond donor; hba = hysdrogen bond acceptor; # = number of features; aie = average interaction energy (in kcal/mol). gbpm feature lzg m j vw -a vw -b host/guest # aie # aie # aie # aie hydrophobic - . - . - . - . spike/ace hbd - . - . - . - . hba - . - . - . - . hydrophobic - . - . - . - . ace /spike hbd - . - . - . - . hba - . - . - . - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . spike mutations located within the rbd (aa - ) with at least two cases in the population and non-zero gbpm average score in the ace /spike interaction models. the asterisk (*) indicates a stop codon. a lower gbpm score indicates a stronger effect in the ace /spike interaction. mutation position abundance frequency gbpm average score quartile s n . - . q n k . - . q n y . - . q y f . - . q e k . - . q k n . - . q s i . - . q g v . - . q f s . - . q s r . - . q n t . - . q l f . - . q g s . - . q e q . - . q a v . - . q f l . - . q f l . e- - . q yq wk . e- - . q q l . e- - . q v f . e- - . q e a . e- - . q g s . e- - . q e d . e- - . q q * . e- - . q y w . e- - . q g a . e- - . q s g . e- - . q f l . e- - . q v i . e- - . q y f . e- - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table . ace variants with non-zero gbpm score in the spike interaction model. variant rsid allele frequency gbpm average score quartile s p rs . - . q e g rs . e- - . q m i rs . e- - . q e k rs . e- - . q g v rs . e- - . q .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary files description supplementary file : table of sars-cov- spike mutations (source: gisaid database, december ), indicating position, frequency in the sequenced sars-cov- genome and gbpm score (lower: predicted stronger effect in the spike/ace interaction). supplementary file : table of human ace variants (source: gnomad database, v , july ), indicating position, frequency in the sequenced sars-cov- genome and gbpm score (lower: predicted stronger effect in the spike/ace interaction). supplementary file : supplementary figures focusing on the most relevant mutations described in this study, with structural, chemical and positional considerations. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / title taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships authors andrzej zielezinski ,*, jakub barylski , wojciech m. karlowski author affiliations: department of computational biology, faculty of biology, adam mickiewicz university poznan, uniwersytetu poznanskiego , - , poznan, poland molecular virology research unit, faculty of biology, adam mickiewicz university poznan, uniwersytetu poznanskiego , - , poznan, poland * address correspondence to: andrzej zielezinski: andrzejz@amu.edu.pl abstract motivation: similar regions in virus and host genomes provide strong evidence for phage-host interaction, and blast is one of the leading tools to predict hosts from phage sequences. however, blast-based host prediction has three limitations: (i) top-scoring prokaryotic sequences do not always point to the actual host, (ii) mosaic phage genomes may produce matches to many, typically related, bacteria, and (iii) phage and host sequences may diverge beyond the point where their relationship can be detected by a blast alignment. results: we created an extension to blast, named phirbo, that improves host prediction quality beyond what is obtainable from standard blast searches. the tool harnesses information concerning sequence similarity and bacteria relatedness to predict phage-host interactions. phirbo was evaluated on two benchmark sets of known phage-host pairs, and it improved precision and recall by percentage points, as well as the discriminatory power for the recognition of phage- host relationships by percentage points (area under the curve = . ). phirbo also yielded a mean host prediction accuracy of % and % at the genus and family levels, respectively, representing a % improvement over blast. when using only a fraction of phage genome sequences ( kb), the prediction accuracy of phirbo was - % higher than blast at all taxonomic levels. conclusion: our results suggest that phirbo is an effective, unsupervised tool for predicting phage-host relationships. availability: phirbo is available at https://github.com/aziele/phirbo. keywords phage-host prediction, phage, prokaryote, bacteria, virus, genome sequence .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:andrzejz@amu.edu.pl https://github.com/aziele/phirbo https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction prokaryotic viruses (phages) are the most abundant entities across all habitats and represent a vast reservoir of genetic diversity [ ]. phages mediate horizontal gene transfer and constitute a major selection pressure that shapes the evolution of bacteria [ ]. prokaryotic viruses also affect biogeochemical cycles and ecosystem dynamics by controlling microbial growth rates and releasing the contents of microbial cells into the environment [ , ]. moreover, phages play a key role in shaping the composition and function of the human microbiome in health and disease [ – ]. recently, there has been renewed interest in phage therapy and phage-based biocontrol of harmful bacteria [ , ] in medical treatment [ , ] and the food industry [ , ]. hence, characterizing phage–host interactions is critical to understanding the factors that govern phage infection dynamics and their subsequent ecological consequences [ ]. the scope of phage-host interactions is poorly understood, although it has been hypothesized that all prokaryotic organisms fall prey to viral attacks [ ]. methods for studying phage-host interactions primarily rely on cultured virus-host systems; however, recent in silico approaches suggest a much broader range of hosts may be susceptible to viral infections [ ]. these methods predict prokaryotic hosts based on sequence composition [ , ], direct sequence similarity between phages and hosts [ ], analysis of crispr spacers or trnas [ , ], as well as supervised approaches that integrate several sequence-based methods [ , ]. despite significant progress in phage-host predictions, the classic blast [ ] algorithm is currently the most effective, unsupervised method for identifying phage-host interactions [ , ]. depending on the dataset, the tool finds the correct genus level host for - % of phages [ , ]. the task of finding a host for a given phage using blast is conceptualized as obtaining the host sequence with the highest similarity to the query phage sequence. however, restricting host predictions to the first top-scored prokaryotic sequence has three limitations. first, the true host may not be the top-scoring match in the blast results. second, selecting a prokaryotic host based on the first sequence assumes that a phage infects a single host. although phages are generally host-specific, some may infect multiple host species [ , ]. finally, many distantly-related prokaryotic species may obtain a comparable blast score for a query phage due to spurious alignments. these ambiguous host predictions require further manual curation of the taxonomic or phylogenetic relationship between the top-scored prokaryotic species to select the true host(s). we have addressed these issues by developing a simple extension to blast, named phirbo, that exploits the information contained in the full blast results, rather than its top-ranking matches. phirbo improved the accuracy of finding hosts, beyond what is found from the best blast match, by relating phage and host sequences through intermediate, common reference sequences that are potentially homologous to both phage and host queries. subsequent quantification of the overlapping signals allows for the reliable prediction of phage-host interactions without the need .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for direct comparisons between the phage and host sequences and without any prior knowledge of their phylogenetic or taxonomic context. results phirbo algorithm overview this algorithm is based on the assumption that the degree of similarity between phage and host sequences is proportional to the overlap between ranked similarity matches of each sequence to the same reference data set of prokaryotic sequences. specifically, to compare a pair of phage (p) and host (h) sequences, we first perform two independent blast searches against the reference database of prokaryotic genomes (d)—one blast search for phage and the other for the host query (fig. a). the two lists of blast results (fig. b), p → d and h → d, contain prokaryotic genomes ordered by decreasing sequence similarity (i.e., bit-score). to avoid a taxonomic bias due to multiple genomes of the same prokaryote species, we rank prokaryotic species according to their first appearance in the blast list (fig. c). in this way, both lists represent phage and host profiles consisting of the ranks of top-score prokaryotic species. the properties of these lists (fig. c) closely resemble the outcome of an internet search and can be characterized by four features: (i) species listed at the top of each ranking are more important (similar) to the query than those listed at the bottom; (ii) the lists may not be conjoint (some species may appear in one ranking but not in the other); (iii) the ranking lists may vary in length (blast may return few prokaryotic matches in response to virus sequences in contrast to thousands of matches in cases of multiple-species prokaryotic families); (iv) two or more species from the database may achieve the same blast score and, therefore, occupy the same position on the ranking list (fig. c). a recently introduced similarity measure used for comparing the rankings of web search engine results [ ], the rank-biased overlap (rbo), satisfies these four conditions. the rbo algorithm starts by scoring the overlap between the sub-list containing the single top- ranked item of each list. it then proceeds by scoring the overlaps between sub-lists formed by the incremental addition of items further down the original lists. each consecutive iteration has less impact on the final rbo score as it puts heavier weights on higher-ranking items by using geometric progression, which weighs the contribution of overlaps at lower ranks (see ‘methods’). an overall rbo score falls between and , where signifies that the lists are disjoint (have no items in common) and means the lists are identical in content and order. our results indicate that the extent of the phage-host relationship can be estimated by the application of an rbo measurement to the ranking lists generated from blast results (fig. d). phirbo differentiates between interacting and non-interacting phage-host pairs to assess the discriminatory power of phirbo to recognize phage-host interactions, we used two published reference data sets: edwards et al. ( ) [ ], which contains , complete bacterial genomes and phages with reported hosts, and galiez et al. ( ) [ ] that has , complete .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / prokaryotic genomes and , phage genomes. for each data set, we compared the distribution of phirbo scores between all known phage-host interaction pairs and the same number of randomly selected non-interacting phage-prokaryote pairs (fig. ). the scores obtained by phirbo in both data sets separated the interacting from non-interacting phage-host pairs more than the blast scores. the median phirbo score across interacting phage-host pairs was nearly , times greater than for non-interacting pairs, while the median blast score was three times higher for interacting pairs than non-interacting pairs (supplementary table ). both methods, however, differentiated between interacting and non-interacting phage-host pairs with higher accuracy than wish — the state-of-the-art, alignment-free, host prediction tool [ ]. to further examine the discriminatory power of phirbo across all possible phage-prokaryote pairs, we used receiver operating characteristic (roc) curves (fig. a,b). the area under the roc (auc), which measured the discriminative ability between interacting and non-interacting phage- host pairs, was higher for phirbo (auc = . ) in the edwards et al. and galiez et al. data sets than for blast (auc = . ) and wish (auc = . - . ). an additional advantage of phirbo was its capacity to score phage-host pairs whose sequence similarity could not be established by a direct blast comparison but, instead, through other, ‘intermediate’ prokaryotic sequences that were detectably similar to both phage and host query sequences. for example, blast did not provide scores for % of the interacting phage-host pairs in the edwards et al. and galiez et al. data sets due to alignment score thresholds (supplementary table ). using the same blast lists, phirbo evaluated % of the interacting phage-hosts pairs. this high coverage indicated that nearly every pair of phage-prokaryote sequences could be related by at least one common prokaryotic sequence detectably similar to both the phage and host sequences. phirbo has the highest host prediction performance to evaluate host prediction performance, we used precision-recall (pr) curves, which provide more reliable information than roc when benchmarking imbalanced data sets for which the non- interacting pairs vastly outnumber the interacting pairs [ , ]. accordingly, we plotted pr curves for phirbo, blast, and wish predictions obtained from the edwards et al. (fig. a) and galiez et al. (fig. b) data sets. overall, phirbo performed better at host prediction at the species level than blast and wish, regardless of the data set. the area under the pr curve (aupr), which summarized overall performance, was higher in phirbo by percentage points (aupr = . - . ) than in blast (aupr = . - . ). phirbo also reported the highest f score (an average of precision and recall [see ‘methods’]) in the edwards et al. and galiez et al. data sets (fig. ). specifically, the precision and recall of phirbo were - % and - %, respectively, while blast had precision and recall in the range of - % (fig. ). furthermore, phirbo yielded slightly higher specificity ( . - . %) and accuracy ( . - . %) than blast or wish. phirbo preserves blast top-ranked host predictions .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / we further evaluated the host prediction accuracy of phirbo by selecting a top-scored prokaryotic sequence for each phage [ – , ]. briefly, host prediction accuracy is calculated as the percentage of phages whose predicted hosts have the same taxonomic affiliation as their respective known hosts (if multiple top-scoring hosts are present, the prediction is scored as correct if the true host is among the predicted hosts). phirbo restored all hosts predicted by blast in the datasets by edwards et al. and galiez et al., achieving the same prediction accuracy as blast across all taxonomic levels (table ). of note, blast found multiple different host species with equal scores for phage genomes. this was observed in phages infecting bacteria from the enterobacteriaceae family and the rhodococcus and bacillus genera. however, phirbo assigned the highest score to the correct host species (supplementary table ). additionally, it refined the host prediction for the cronobacter phage ent sequence, which blast assigned to the escherichia coli genome. phirbo revealed cronobacter sakazaki as the primary host species, as the blast list of the cronobacter phage is more similar in content and order to the blast list of c. sakazaki (phirbo score = . ) than e. coli (phirbo score: . ) (figure s ). as phirbo links phage to host through common sequences, the content of the sequence database was the main factor defining host prediction quality. since the similarity between viruses may indicate a common host [ , ], we expanded the two blast databases of prokaryotic sequences obtained from edwards et al. and galiez et al. by phage sequences (n = and n = , respectively), and recalculated phirbo scores between every phage-prokaryote pair. the phage- host linkage through homologous prokaryotic and phage sequences increased the host prediction accuracy of phirbo at all taxonomic levels, allowing correct identification of hosts at the genus level for - % of phages (table ). specifically, phirbo refined blast mis-predictions for phage genomes and showed which sequences demonstrated low similarity to the sequences of their host species. the direct blast alignments of these phage sequences, and the sequences of their corresponding hosts, obtained significantly lower scores than alignments obtained by the other known phage-host pairs (p = . × - , mann–whitney u test). notably, phirbo also assigned correct host species for phages whose hosts were not reported in the blast results, mainly chlamydia species, vibrio cholerae, and the opportunistic pathogen, acinetobacter baumannii. phirbo is suitable for incomplete phage sequences we tested the robustness of our host prediction algorithm to fragmentation of the phage sequence. following earlier studies [ , , ], phage genomes from edwards et al. and galiez et al. data sets were randomly subsampled to generate contigs of different lengths ( kb, kb, kb, kb, and kb) with replicates. host prediction accuracy was calculated as the mean percentage of phages whose predicted hosts had the same taxonomic affiliation as their respective known hosts (fig. ). although phirbo achieved equal host prediction accuracy with blast across all contig lengths, it had substantially higher overall performance in terms of auc and aupr (figure s ; p < − , wilcoxon signed-rank test). surprisingly, blast-based methods obtained higher host .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / prediction accuracy across all contig lengths compared to wish, a tool designed to predict the hosts of short viral contigs (fig. ). the host prediction accuracy of phirbo was examined using the expanded blast database of both prokaryotic and phage full-length sequences. to ensure fairness, for each tested phage contig we removed its corresponding full-length sequence from the blast database and recalculated phirbo scores between the phage contig and every prokaryotic sequence. this approach outperformed blast at every contig length across all taxonomic levels in both data sets (fig. ). generally, the host prediction accuracy of phirbo improved by - percentage points compared to the blast results. for example, when the contig length was kb, the prediction accuracy of phirbo was - % higher than blast at the family level, and - % higher than wish (fig. ; supplementary table ). phirbo also achieved the highest auc and aupr scores when discriminating between interacting and non-interacting phage-host pairs (figure s ). phirbo uses multiple protein and non-coding rna signals for host prediction we investigated the sequence information used by blast and phirbo for host prediction. for each phage that was correctly assigned to the host species by both tools (n = ), we calculated the fraction of the phage genome that was included in the segments aligned with prokaryotic sequences (sequence coverage). this analysis revealed that our tool used three times more phage sequence (median sequence coverage: %) than blast ( %) (figure s ; p < - , wilcoxon signed-rank test). this increased sequence coverage indicates that different genome regions of the phages map to the genomes of prokaryotic species other than the host species. for of the phages, more than half of their genomes were aligned to genomes of their host species (supplementary table ). such large regions of homology are likely prophages or phage debris left by large-scale recombination events during phage replication. the observed high sequence coverage points to the virus taxa, known for their temperate lifestyle and frequent recombination with host genomes (i.e., siphoviridae family as well as the peduovirinae and sepvirinae subfamilies). to further examine the properties of sequences that may be exchanged between a phage and its host, we selected a population of phages with sequence coverage below % (n = ). these phages, which are less likely to represent complete prophages, belong to viral families (supplementary table ). next, we re-annotated the genomic sequences of the phages to find putative protein and non-coding rna (ncrna) genes. phage sequence regions used by phirbo for host predictions were significantly enriched (p < - ) in more than a hundred protein families of known or probable function. in contrast, only half of the protein families were used in blast- based host predictions (supplementary table ). the protein families used by phirbo covered most of the processes of the viral life cycle including dna replication, cell lysis, recombination, and packaging of the phage genome (fig. ). in contrast to blast, phirbo also exploited the information contained in phage ncrnas while assigning phages to host genomes. the vast .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / majority of these ncrnas (> %) were trnas, which showed significant overrepresentation in the phage sequence fragments used by phirbo (p = × - ) (supplementary table ). the remaining ncrnas belonged to group i introns ( %), rnas associated with genes associated with twister and hammerhead ribozymes ( %), skipping-rope rna motifs ( %), and less abundant rna families. implementation and availability predicting hosts from phage sequences using blast is accomplished by querying phage sequences against a database of candidate hosts. however, phirbo also uses information about sequence relatedness among prokaryotic genomes. therefore, it requires ranked lists of prokaryote species generated by blast for the phage and host genomes. the computational cost of querying every host sequence against the database of all candidate hosts using blast may still be a limiting factor. however, for mass host searches, the computational cost of all-versus-all host comparisons becomes marginal, as it must be done only once. after the relatedness among host genomes is established, the time required for phirbo host predictions is negligibly higher than the time for typical blast-based host predictions. for example, running phirbo between ranked lists of host species for , phages and , candidate hosts from galiez et al. (resulting in ~ . million phage-host comparisons) took minutes on a -core . ghz intel xeon. as phirbo operates on rankings, blast can be replaced by an alternative sequence similarity search tool to reduce the time to estimate homologous relationships between host genomes. for instance, mash [ ] computed host relationships in minutes for the edwards et al. and galiez et al. data sets (see ‘methods’). the host prediction performance of phirbo using blast-based rankings for phages and mash-based rankings for host genomes is high compared to the performance of phirbo predictions using blast rankings for both phage and host genomes (supplementary table ). we envisage phirbo as a natural extension to standard blast-based host predictions. the phirbo tool is written in python and freely available at https://github.com/aziele/phirbo/. discussion the identification of similar sequence regions between host and phage genomes using blast has been a baseline for the identification of putative virus-host connections in numerous metagenomic projects [ , , ]. however, a blast search requires regions with significant similarity between the query phage and host [ – ]. yet, many phage and host sequences lack sufficient similarity and escape detection with standard blast searches. to tackle this issue, alignment- free tools have been developed to predict hosts from phage sequences [ – , ]. the rationale behind these tools is based on the observation that viruses tend to share similar patterns in codon usage or short sequence fragments with their hosts [ – ]. as virus replication is dependent on the translational machinery of its host, some phages adapt their codon usage to match the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/aziele/phirbo/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / availability of trnas during viral replication in the host cell [ – ]. similar oligonucleotide frequency use may be driven by evolutionary pressure on the virus to avoid recognition by host restriction enzymes and crispr/cas defense systems [ , ]. although state-of-the-art alignment-free tools (i.e., wish [ ] and virushostmatcher [ ]) can rapidly assess sequence similarity between any pair of phage and prokaryote sequences, they are less accurate for host prediction than blast [ , ]. the relatively high accuracy of blast suggests that localized similarities of genetic material may be a stronger indication of phage-host interactions than global convergence of their genomic composition. this evidence comes in the form of protein-coding dna fragments and non-coding rnas. the latter group is dominated by trna genes, which are strongly over-represented in direct blast alignments between phages and their hosts, and are even more prevalent among indirect connections used by phirbo. this may be important, as previous studies have shown that not all phage trna genes come directly from their hosts. some appear to be derived from genomes of other, often distantly related, bacteria and may be the result of earlier evolutionary events [ ]. for protein-coding genes, a more diverse picture emerges. proteins rich in phage-host blast alignments can be assigned into different functional categories including phage virion components, replication-related proteins, regulatory factors, and proteins involved in the metabolism of the host. the transfer of some over-represented families in phages and/or prophages has been previously reported (e.g., lytic proteins, dna replication and recombination proteins, and enzymes involved in nucleotide and energy metabolisms [ ]) and some of these genes are connected with the phage-host range [ , ]. however, no clear pattern emerges after analyzing the functions of the remaining, over-represented proteins. in this study, we attempted to expand the information content of a single local alignment of phage and host sequences by incorporating the results of multiple local alignments between a phage sequence and different prokaryotic genomes. this approach may more closely resemble a manual assignment of phage-host pairs, where an expert analyst not only considers a top-ranked matching prokaryote in the blast results, but also uses the information contained in other, less significant, matches and their sequence and taxonomic similarity. through a taxonomically-aware stratification scheme, this approach tracks the multilateral dynamics of horizontal gene transfer. therefore, we propose to relate phage and host sequences through multiple intermediate sequences that are detectably similar to both the phage and host sequences. by linking phage and host sequences through similar sequences, phirbo achieved a more comprehensive list of phage-host interactions than blast. simultaneously, phirbo was capable of assessing almost all phage-host pairs, bringing the method closer to alignment-free tools, which compute scores between all possible phage and host pairs. thus, our approach can be directly applied to different phage and prokaryote data sets without training or optimizing the underlying rbo algorithm. we intentionally avoided machine learning components in phirbo to ensure the general applicability of the approach and avoid possible overfitting. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / our results show that expanding the information obtained from plain similarity comparisons by incorporating taxonomically-grounded measurements of phage-host similarity leads to improved accuracy of phage-host predictions. the phirbo method provides the phage research community with an easy-to-use tool for predicting the host genus and species of query phages, which is usable when searching for phages with appropriate host specificity and for correlating phages and hosts in ecological and metagenomic studies. methods virus and prokaryotic host data sets the data sets analyzed in this study were retrieved from two previously published phage-host studies [ , ]. the first set (edwards et al. [ ]) contained , complete bacterial genomes obtained from ncbi refseq and refseq genomes of phages for which the host was reported. the data set encompassed , known virus-host interaction pairs and , , pairs for which interaction was not reported (non-interacting phage-host pairs). the second data set (galiez et al. [ ]) contained , complete prokaryotic genomes of the kegg database and phages for which host species were reported in the refseq virus database. the data set consisted of , interacting- and , , non-interacting virus-host pairs. phirbo score the interaction score for a given phage-host pair was calculated using the rbo metric. rbo [ ] is a measurement of rank similarity that compares two lists of different lengths (giving more attention to high ranks on the lists). rbo ranges from to , where a greater value indicates greater similarity between lists. equation was used for the calculation of the rbo value between two ranking lists, s and t. 𝑅𝐵𝑂(𝑆, 𝑇, 𝑝) = ( − 𝑝) ∑ 𝑝𝑑− 𝑛 𝑑= 𝐴(𝑆, 𝑇, 𝑑) where the parameter p ( < p < ) determines how steeply the weight declines (the smaller the p, the more top results are weighted). when p = , only the top-ranked item is considered, and the rbo score is either zero or one. in this study, we set p to . , which assigned ~ % of the weight to the first hosts. a(s, t, d) is the value of overlap between the two ranking lists, s and t, up to rank d, calculated by eq. . n is the number of distinct ranks on the ranking list. 𝐴(𝑆, 𝑇, 𝑑) = |𝑆:𝑑 ∩ 𝑇:𝑑 | |𝑆:𝑑 ∪ 𝑇:𝑑 | where s:d and t:d represents the elements present in the first d ranks of lists s and t, respectively. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / host prediction tools the host prediction tools blast [ ], wish [ ], and phirbo were run separately in the edwards et al. and galiez et al. data sets. for each tool, sequence similarity scores were calculated across all combinations of phage-host pairs. blast . . + [ ] was run with default parameters (task: blastn, e-value threshold = ) to query each phage sequence against a database of candidate host genomes. for each blast alignment, the highest bit-score between every phage-host pair was reported (for phage-host pairs that were absent in the blast results, a bit-score of was assigned). for rbo host prediction, an additional blast search was performed to establish ranked lists of genetically similar host genomes. specifically, a nucleotide blast was run with default parameters to query each host sequence against a database of candidate host genomes. as an alternative to blast, mash . [ ] was used with default parameters (k-mer size = , sketch size = , ) to establish ranked lists for each host by comparing its sequence against the database of candidate host genomes. rbo scores were calculated between all pairwise combinations of phage and host ranking lists. wish . [ ] was used with default parameters to calculate log- likelihood scores between all pairwise combinations of phage-host sequences. evaluation metrics the metrics of host prediction performance were calculated using sklearn (i.e., auc, aupr, recall, precision, specificity, and accuracy) [ ]. optimal score thresholds to calculate recall, precision, specificity, and accuracy was computed as maximizing the f score, an accuracy metric, which is the harmonic mean of precision and recall. host prediction accuracy was evaluated analogous to previous studies [ , , ]. specifically, for each query phage, the host with the highest score to the query virus was selected as the predicted host. in cases where multiple hosts were predicted, the prediction was scored as correct if the correct host was among the predictions. the prediction accuracy was calculated at each taxonomic level as the percentage of viruses whose predicted hosts shared a taxonomic affiliation with known hosts. phage genome annotation to define phage genes potentially exchanged between phage and host genomes, we re-annotated phage genomes that were correctly assigned to host species by both phirbo and blast. the genes were classified into predefined pvogs (prokaryotic virus orthologous groups) [ ] and rna families [ ]. briefly, open reading frames (orfs) in the analyzed phage genomes were identified using transeq from emboss [ ]. the orfs were then assigned to the respective orthologue group by hmmsearch (e-value < - ) against the database of hidden markov models (hmms) created for every of , pvog alignments using hmmbuild of hmmer v . . [ ]. non-coding rnas (ncrnas) were predicted in the phage genomes (e-value < - ) using rfam covariance models v . [ ] and the infernal tool v . . [ ]. we counted the number of times each pvog and rfam term was present in phage sequences used by blast and phirbo during host prediction. to determine whether the observed level of pvog/rfam counts was significant .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / within the context of all the terms within the phage genome, we calculated the p-value using the hypergeometric distribution implemented in scipy [ ]. acknowledgments we thank bas dutilh, rob edwards, clovis galiez, and johannes söding for providing us with the benchmark data sets used in their studies. we likewise acknowledge william webber for assistance with modifying the rbo formula to account for tied ranks. the computations were performed at the poznan supercomputing and networking center. author contributions az conceived the project and designed the experiments. az and jb wrote phirbo and tested its performance. wmk provided the conceptual framework for sequence comparisons through intermediate sequences and reviewed the software and manuscript. az and jb analyzed the results and wrote the paper. all authors read and approved the final manuscript. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure legends figure . calculation of the interaction score between phage and host sequences. a. the blast search of phage and prokaryote sequences against a reference dataset result in b. two blast lists containing prokaryote matches ordered by decreasing similarity (i.e., bit-score). c. blast lists were converted into rankings of prokaryote species. the ranked lists differ in content: yersinia rohdei and y. ruckeri are present in the first ranking list but absent in the second list, while shigella dysenteriae and erwinia toletana are only present in the second list. two species, y. rohdei and y. ruckeri, from the first blast search have the same scores and are consequently tied for the same rank. d. an interaction score was calculated between two ranking lists using rank-biased overlap. figure . discriminatory power of phirbo, blast, and wish scores to differentiate between interacting and non-interacting phage-host pairs. phage-host pairs were obtained from a. edwards et al. and b. galiez et al. data sets. box plots show the distribution of scores for all interacting phage-host pairs (n = , and n = , in edwards et al. and galiez et al., respectively) and the same number of randomly selected, non-interacting phage-host pairs. the horizontal line in each box displays the median; boxes display the first and third quartiles; whiskers depict lowest and highest non-outlier scores (details of distributions including outliers are provided in supplementary table ). receiver operating characteristic curves and the corresponding area under the curve (auc) display the classification accuracy of phage–host predictions across all possible phage-host pairs. dashed lines represent the levels of discrimination expected by chance. figure . host prediction performance of phirbo, blast, and wish. the performance is provided by precision-recall (pr) curves and statistical measures (i.e., f score, precision, recall, specificity, and accuracy) separately for a. edwards et al. and b. galiez et al. data sets. dashed lines in the pr-curve plots represent the levels of discrimination expected by chance. score cut-offs for each tool were set to ensure the highest f score. figure . host prediction accuracy over phage contig length. prediction accuracy is provided separately for a. edwards et al. and b. galiez et al. data sets. each complete virus genome was randomly subsampled times for different sequence lengths (i.e., kb, kb, kb, kb, and kb). hosts were predicted on each subsampling replicate by selecting a prokaryotic sequence with the highest similarity to the query viral sequence. points indicate the average of the resulting accuracies for all the viruses at a given subsampling length and host taxonomic level (i.e., species, genus, and family). an extended version of this figure containing host prediction accuracy values is provided in supplementary table . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . functional classification of phage coding sequences used by phirbo for host prediction. protein families (pvogs) were classified into functions related to phage-cycle (e.g., dna replication, transcription). numbers in the dark circles indicate the number of different pvogs related to a given function. an extended version of this figure containing the list of pvogs is provided in supplementary table . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / tables table . host prediction accuracies (%) for phage and host genomes from the data sets by edwards et al. [ ] and galiez et al. [ ]. dataset method species genus family order class phylum edwards et al. ( ) wish blast phirbo* phirbo (+phages)† galiez et al. ( ) wish blast phirbo* phirbo (+phages)† the highest accuracies among the methods for each taxonomic level are in bold. * interaction scores were calculated using rank-biased overlap (rbo) between blast lists containing prokaryotic sequences. specifically, the blast database contained , sequences of bacterial genomes in the edwards et al. data set, and , sequences of bacterial and archaeal genomes in the galiez et al. data set. † interaction scores were calculated using rbo between blast lists containing both prokaryotic and phage sequences. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figures supplementary figure . host predictions for cronobacter phage ent (refseq accession: nc_ ) using a. blast and b. phirbo. querying the cronobacter phage sequence with a blast search against the host database returned the genomic sequence of escherichia coli (nc_ ) as the best match (bit-score = , ), and cronobacter sakazakii (nc_ ) as the second-best match (bit-score = , ). phirbo predicted cronobacter sakazakii as the top-score host for the cronobacter phage due to the highest extent of overlap between the top-ranking blast matches of each sequence (nc_ and nc_ ) of the same database. for clarity, only the first ten blast matches are shown. supplementary figure . host prediction performance of phirbo, blast and wish over phage contig length in terms of a. area under the curve (auc) and b. area under the precision- recall curve (aupr). bars indicate the auc or aupr averaged across replicates at a given subsampling length of phage sequence. supplementary figure . scatter plot of the phage sequence coverage used in host predictions of phirbo versus that of blast. each dot represents a phage genome. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary tables supplementary table . distribution of phirbo, blast and wish scores among interacting and non-interacting phage-host pairs obtained from edwards et al. and galiez et al. data sets. score ranges were summarized separately for , interacting and non-interacting phage-host pairs from edwards et al., and , interacting and non-interacting phage-host pairs from galiez et al. supplementary table . number of phage-host pairs evaluated by phirbo, blast, and wish in edwards et al. and galiez et al. data sets. supplementary table . phages assigned by blast to multiple, equally-scored host species. phirbo differentiated between host species and provided the highest score to primary host species. supplementary table . host prediction accuracy of phirbo, blast, and wish over phage contig length. supplementary table . phage sequence coverage of phages correctly assigned by blast and phirbo to their host species. sequence coverage was calculated for each phage as the sum of the lengths of its non-overlapping high scoring pairs to the genome of the correct host species, divided by the size of the query-phage genome. prophages were assumed to have sequence coverage greater than or equal to %. supplementary table . summary of taxonomic affiliations of phages that had sequence coverage < % with the host species genomes. supplementary table . protein families present in sequence regions of phage genomes that were used by blast and/or phirbo in host prediction. the table provides information on each protein family (prokaryotic virus orthologous group (pvog)) used by blast and phirbo, including: (i) pvog description and functional assignment (manually curated), (ii) pvog count (number of times a given pvog was present in the phage genome, as well as in sequences used by blast or phirbo), (iii) pvog percentage (pvog count divided by pvog count in the genome), and (iii) p-value of pvog enrichment. supplementary table . rna families present in sequence regions of phage genomes that were used by blast and phirbo in host prediction. the table provides information on each rfam family used by blast and phirbo. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary table . comparison of phirbo’s host prediction performance between blast- based and mash-based rankings of prokaryotic species. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references . suttle ca. marine viruses--major players in the global ecosystem. nat rev microbiol. ; : – . . breitbart m, bonnain c, malki k, sawaya na. phage puppet masters of the marine microbial realm. nat microbiol. ; : – . . roux s, brum jr, dutilh be, sunagawa s, duhaime mb, loy a, et al. ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. nature. ; : – . . norman jm, handley sa, baldridge mt, droit l, liu cy, keller bc, et al. disease- specific alterations in the enteric virome in inflammatory bowel disease. cell. ; : – . . manrique p, bolduc b, walk st, van der oost j, de vos wm, young mj. healthy human gut phageome. proc natl acad sci u s a. ; : – . . meyer jr. sticky bacteriophage protect animal cells. proceedings of the national academy of sciences of the united states of america. proceedings of the national academy of sciences; . pp. – . . reardon s. phage therapy gets revitalized. nature. ; : – . . salmond gpc, fineran pc. a century of the phage: past, present and future. nat rev microbiol. ; : – . . svoboda e. bacteria-eating viruses could provide a route to stability in cystic fibrosis. nature. ; : s –s . . dedrick rm, guerrero-bustamante ca, garlena ra, russell da, ford k, harris k, et al. engineered bacteriophages for treatment of a patient with a disseminated drug-resistant mycobacterium abscessus. nat med. ; : – . . samson je, moineau s. bacteriophages in food fermentations: new frontiers in a continuous arms race. annu rev food sci technol. ; : – . . sulakvelidze a. using lytic bacteriophages to eliminate or significantly reduce contamination of food by foodborne bacterial pathogens. j sci food agric. ; : – . . paez-espino d, eloe-fadrosh ea, pavlopoulos ga, thomas ad, huntemann m, mikhailova n, et al. uncovering earth’s virome. nature. ; : – . . edwards ra, mcnair k, faust k, raes j, dutilh be. computational approaches to predict bacteriophage–host relationships. fems microbiol rev. ; : – . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . ahlgren na, ren j, lu yy, fuhrman ja, sun f. alignment-free d_ ^* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically- derived viral sequences. nucleic acids res. ; : – . . galiez c, siebert m, enault f, vincent j, söding j. wish: who is the host? predicting prokaryotic hosts from metagenomic phage contigs. bioinformatics. ; : – . . andersson af, banfield jf. virus population dynamics and acquired virus resistance in natural microbial communities. science. ; : – . . wang w, ren j, tang k, dart e, ignacio-espinoza jc, fuhrman ja, et al. a network-based integrated framework for predicting virus-prokaryote interactions. nar genom bioinform. ; : lqaa . . zhang m, yang l, ren j, ahlgren na, fuhrman ja, sun f. prediction of virus-host infectious association by supervised learning methods. bmc bioinformatics. ; : . . altschul sf, madden tl, schäffer aa, zhang j, zhang z, miller w, et al. gapped blast and psi-blast: a new generation of protein database search programs. nucleic acids res. ; : – . . lima-mendez g, faust k, henry n, decelle j, colin s, carcillo f, et al. ocean plankton. determinants of community structure in the global plankton interactome. science. ; : . . flores co, meyer jr, valverde s, farr l, weitz js. statistical structure of host-phage interactions. proc natl acad sci u s a. ; : e - . . webber w, moffat a, zobel j. a similarity measure for indefinite rankings. acm trans inf syst. ; : – . . saito t, rehmsmeier m. the precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. plos one. ; : e . . davis j, goadrich m. the relationship between precision-recall and roc curves. proceedings of the rd international conference on machine learning - icml ’ . new york, new york, usa: acm press; . doi: . / . . villarroel j, kleinheinz ka, jurtz vi, zschach h, lund o, nielsen m, et al. hostphinder: a phage host prediction tool. viruses. ; . doi: . /v . ondov bd, treangen tj, melsted p, mallonee ab, bergman nh, koren s, et al. mash: fast genome and metagenome distance estimation using minhash. genome biol. ; . doi: . /s - - -x . gao nl, zhang c, zhang z, hu s, lercher mj, zhao x-m, et al. mvp: a microbe–phage interaction database. nucleic acids res. ; : d –d . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . paez-espino d, roux s, chen i-ma, palaniappan k, ratner a, chu k, et al. img/vr v. . : an integrated data management and analysis system for cultivated and environmental viral genomes. nucleic acids res. ; : d –d . . roux s, hallam sj, woyke t, sullivan mb. viral dark matter and virus-host interactions resolved from publicly available microbial genomes. elife. ; . doi: . /elife. . lawrence jg, ochman h. amelioration of bacterial genomes: rates of change and exchange. j mol evol. ; : – . . pride dt, wassenaar tm, ghose c, blaser mj. evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. bmc genomics. ; : . . carbone a. codon bias is a major factor explaining phage evolution in translationally biased hosts. j mol evol. ; : – . . sharp pm, rogers ms, mcconnell dj. selection pressures on codon usage in the complete genome of bacteriophage t . j mol evol. ; : – . . morgado s, vicente ac. global in-silico scenario of trna genes and their organization in virus genomes. viruses. ; : . . sousa jam de, pfeifer e, touchon m, rocha epc. genome diversification via genetic exchanges between temperate and virulent bacteriophages. biorxiv. biorxiv; . doi: . / . . . . shapiro jw, putonti c. gene co-occurrence networks reflect bacteriophage ecology and evolution. mbio. ; . doi: . /mbio. - . hernandes coutinho f, zaragosa-solas a, lópez-pérez m, barylski j, zielezinski a, dutilh be, et al. rafah: a superior method for virus-host prediction. biorxiv. biorxiv; . doi: . / . . . . camacho c, coulouris g, avagyan v, ma n, papadopoulos j, bealer k, et al. blast+: architecture and applications. bmc bioinformatics. ; : . . pedregosa f, varoquaux g, gramfort a, michel v, thirion b, grisel o, et al. scikit-learn: machine learning in python. j mach learn res. ; : – . . grazziotin al, koonin ev, kristensen dm. prokaryotic virus orthologous groups (pvogs): a resource for comparative genomics and protein family annotation. nucleic acids res. ; : d –d . . kalvari i, nawrocki ep, ontiveros-palacios n, argasinska j, lamkiewicz k, marz m, et al. rfam : expanded coverage of metagenomic, viral and microrna families. nucleic acids res. . doi: . /nar/gkaa .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . rice p, longden i, bleasby a. emboss: the european molecular biology open software suite. trends genet. ; : – . . finn rd, clements j, eddy sr. hmmer web server: interactive sequence similarity searching. nucleic acids res. ; : w - . . nawrocki ep, eddy sr. infernal . : -fold faster rna homology searches. bioinformatics. ; : – . . virtanen p, gommers r, oliphant te, haberland m, reddy t, cournapeau d, et al. scipy . : fundamental algorithms for scientific computing in python. nat methods. ; : – . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / blast reference prokarote dna database (d) match score e. coli k e. coli o :h s. flexneri a s. boydii e. coli k e. coli o :h e. coli m s. flexneri a s. boydii e. toletana s. dysenteriae y. rohdei s. flexneri brank species compare rankings match match rank e. coli s. boydii y. rohdei, y. ruckeri s. flexneri s. flexneri e. coli s. dysenteriae e. toletana s. boydii match rank agtcgtgtactgcgcgccgcgcgccaggac ggttcggccaacgactgggtccttatcgat ccaacgacgacggctccaacgacgttaggc acgttaccgtttaggcgcgatgcgatgcgt phage dna sequence (p) a b c d score host dna sequence (h) rank-biased overlap (rbo) = . y. ruckeri .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / a s im ila ri ty s c o re interaction non-interaction . . . . s im ila ri ty s c o re interaction non-interaction s im ila ri ty s c o re interaction non-interaction - . - . - . - . - . - . phirbo blast wish s im ila ri ty s c o re interaction non-interaction . . . . s im ila ri ty s c o re interaction non-interaction - . - . - . - . - . phirbo wish b . . . . . . . . t ru e p o s it iv e r a te false positive rate auc = . auc = . auc = . . . . . . . . . t ru e p o s it iv e r a te false positive rate wishblastphirbo auc = . auc = . auc = . - . s im ila ri ty s c o re interaction non-interaction blast .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / a aupr = . aupr = . aupr = . . . . . . . . . recall wishblastphirbo p re ci si o n b aupr = . aupr = . aupr = . . . . . . . . . recall p re ci si o n f score recall precision specificity accuracy . . . . . . . . . . . . . . . phirbo blast wish f score recall precision specificity accuracy . . . . . . . . . . . . . . . wishblastphirbo score cut-off . - . score cut-off . - . phirbo blast wish .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / a species p re d ic ti o n a c c u ra c y ( % ) sequence length (kb) genus family b phirbo (+phages) blast / phirbo wish % % % % % sequence length (kb) % % % % % sequence length (kb) species p re d ic ti o n a c c u ra c y ( % ) sequence length (kb) genus family % % % % % sequence length (kb) % % % % % sequence length (kb) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / capsid head collar tail baseplate fiber spike amino acid metabolism po l dna replication genome packaging transcription cell lysis host defence systems energy metabolism nucleotide metabolism bacterial chromosome integration / recombination other functions a t g c t antibiotic resistance full phage assembly .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of sars-cov- genomes human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of sars-cov- genomes yuki iwasaki , takashi abe , toshimichi ikemura . department of bioscience, nagahama institute of bio-science and technology. shiga, japan . graduate school of science and technology, niigata university, niigata, japan abstract background when a virus that has grown in a nonhuman host starts an epidemic in the human population, human cells may not provide growth conditions ideal for the virus. therefore, the invasion of severe acute respiratory syndrome coronavirus- (sars- cov- ), which is usually prevalent in the bat population, into the human population is thought to have necessitated changes in the viral genome for efficient growth in the new environment. in the present study, to understand host-dependent changes in coronavirus genomes, we focused on the mono- and oligonucleotide compositions of sars-cov- genomes and investigated how these compositions changed time-dependently in the human cellular environment. we also compared the oligonucleotide compositions of sars-cov- and other coronaviruses prevalent in humans or bats to investigate the causes of changes in the host environment. results time-series analyses of changes in the nucleotide compositions of sars-cov- genomes revealed a group of mono- and oligonucleotides whose compositions changed in a common direction for all clades, even though viruses belonging to different clades should evolve independently. interestingly, the compositions of these oligonucleotides .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / changed towards those of coronaviruses that have been prevalent in humans for a long period and away from those of bat coronaviruses. conclusions clade-independent, time-dependent changes are thought to have biological significance and should relate to viral adaptation to a new host environment, providing important clues for understanding viral host adaptation mechanisms. keyword “covid- ”, “sars-cov- ”, “oligonucleotide composition”, “time-series analysis”, “big data”, “zoonotic virus”, “rna virus”, “viral adaptation”, “coronavirus” background severe acute respiratory syndrome coronavirus- (sars-cov- ), an rna virus belonging to the betacoronavirus genus, began to spread in the human population in . this viral strain is believed to have been originally prevalent in bats and transferred to the human population through intermediate hosts [ ]. viral growth requires a wide variety of host factors (nucleotide pools, proteins, rna, etc.) and should evade the diverse antiviral mechanisms of host cells (antibodies, killer t cells, interferon, rna interference, etc.) [ - ]. since ancestral sars-cov- strains are thought to be endemic in bats, they should be well adapted to their host environment; when the virus invades the human population, human cells may not provide growth conditions ideal for the virus. for efficient growth and rapid spread of the infection, changes in the viral genome should be required. analyses of time-dependent changes in sars-cov- in the human population can be used to characterize how and why viral genomes change to adapt to a new host environment. due to the great threat of covid- and remarkable development of sequencing technology, a massive number of sars-cov- genome sequences are .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / available in databases, even though the epidemic has lasted for approximately months. these sequence data have provided a wide range of insights into sars-cov- [ , ]. phylogenetic methods based on sequence alignment have been widely used in molecular evolution studies [ , ], and these methods are well refined and essential for studying phylogenetic relationships between different viral species and variations in the same viral species at the single-nucleotide level. however, when dealing with a massive number of genome sequences, methods based on sequence alignment become problematic because they require a large amount of computational resources. we have continued to develop sequence alignment-free methods focused on the oligonucleotide compositions of genome sequences [ - ]. notably, oligonucleotide composition varies widely among species, including viruses, and is designated as genome signatures [ ]. these compositions can be treated as numerical data, and a massive amount of sequence data can easily be subjected to various statistical analyses. furthermore, even genomic fragments without orthologous and/or paralogous pairs can be compared [ , , - ]. specifically, our previous work on influenza a-type virus genomes found that the oligonucleotide compositions of the viral genomes differed between hosts (e.g., humans and birds), even for viruses within the same subtype (e.g., h n and h n of type a) [ , , ]; we also examined changes in the oligonucleotide compositions of influenza h n / , which have been epidemic in humans beginning in , and found that their compositions changed to approach those of the seasonal flu strains h n and h n [ ]. furthermore, although epidemics of the h n and h n strains began several decades apart, these strains showed highly similar chronological changes from the start of these epidemics. these evolutionary yet reproducible changes suggest that mutations to adapt to a new host environment inevitably accumulate when the host species of a virus changes, and these changes can be efficiently detected by analyzing oligonucleotide compositions. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / several groups, including ours, have examined changes in sars-cov- genomes during the early stages of the sars-cov- epidemic and found clear directional changes in a group of mono- and oligonucleotides detectable on even a monthly basis [ , , ]. these directional changes will allow us to predict changes in the near future. notably, near-future prediction and verification should be the most direct ways to test the reliability of the obtained results, models and ideas (e.g., those discovered for influenza viruses), providing a new paradigm for molecular evolutionary studies. in this context, the present study analyzed the genome sequences of over seventy thousand sars-cov- strains isolated from december to september . results directional changes in the mononucleotide compositions (%) of sars-cov- for fast-evolving rna viruses, diversity within the viral population arises rapidly as the epidemic progresses and subpopulation structure forms; the gisaid consortium has defined at least seven main clades (g, gh, gr, l, v, s and others). notably, the elementary processes of molecular evolution are based on random mutations, and strains belonging to different clades are thought to have evolved independently. therefore, the observation of highly similar time-dependent changes independent of clade has certain biological meanings and may be inevitable for efficient growth in human cells. from this perspective, we first examined time-dependent changes in the mononucleotide compositions (%) of sars-cov- strains isolated from december to september . among the seven clades (g, gh, gr, l, v, s and others) reported by the gisaid consortium, we used six clades (g, gh, gr, l, v and s), excluding others, in the analysis. for the time-series analysis, we calculated the average mononucleotide .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / compositions (%) of the genomes in each clade collected monthly; in fig. a, the mononucleotide composition of each clade is shown as a colored line, while that for the monthly collected genomes belonging to all clades is shown as a dashed line. regardless of clade, the composition of c decreased, while that of u increased in a time-dependent manner, but the changes in a and g composition were less clear (fig. a). correlation coefficients between the mononucleotide composition and month from the start of the epidemic showed a high negative correlation for c and a high positive correlation for u for all clades, but there was no clear directionality for a and g (fig. a and tables , ). these results indicate that the mononucleotide composition of this virus may be prone to biased mutations that reduce c and increase u or the mutated strains tend to be more favorable for growth in human cells. directional changes in short oligonucleotide compositions oligonucleotides are known to act as functional motifs, such as binding sites for a wide variety of proteins and target sites for rna modifications. therefore, directional changes in some oligonucleotides independent of clade may relate to certain processes for adaptation to the new host environment. our previous work on influenza a viruses found that their oligonucleotide compositions varied among prevalent hosts [ , ]; notably, although influenza virus isolated from humans tended to prefer a and u (but not g and c) more than viruses isolated from birds, the human viruses showed a preference for ggcg and gggg, which are g- or c-rich. importantly, there are various examples of oligonucleotides whose changes in composition cannot be explained by changes in mononucleotide composition alone, and these changes may relate to the molecular mechanisms of viral adaptation to a new host. from this perspective, we next analyzed time-dependent changes in di- and trinucleotide compositions and found that a group of di- and trinucleotides showed a highly positive or negative correlation (figs. b, s and tables , ). interestingly, a .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / group of a- or g-rich oligonucleotides, such as gag and gga, showed a high positive correlation independent of clade, which was not expected from the changes in mononucleotide compositions alone. to confirm the extent of these changes, we also calculated the fold change in composition for the first isolated month and the last examined month (fig. ) and found clear increases and decreases in mono- and oligonucleotide compositions common among the six clades, which supports the result presented in fig. and tables and . changes towards the sequences of other coronaviruses prevalent in humans in a previous study of sars-cov- [ ], we analyzed mono- and dinucleotide compositions for the first four epidemic months without separating the sequences by clade. notably, the directional changes shown in figs. and and tables and were absolutely consistent with the previous results, even when the six clades were separately analyzed. in the previous study, time-series analysis of ebolavirus at the beginning of the epidemic in west africa in also showed directional changes in a group of mono- and dinucleotide compositions, but these directional increases/decreases tended to slow approximately months after the start of the epidemic. the increase/decrease trend for sars-cov- is far from slowing after months, and the next important questions are how long these directional changes in this virus will last and whether there are possible goals to these changes. to conduct this near-future prediction, the following information concerning influenza viruses should be useful. as mentioned before, mono- and oligonucleotide compositions in influenza h n / changed towards those of seasonal influenza strains such as the h n and h n subtypes [ ]. furthermore, all the human subtypes showed directional changes away from the compositions of all avian influenza a subtypes and closer to those of the human influenza b type, which has been prevalent only in humans [ ]. if we assume that changes similar to those in the influenza virus .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / will occur, the mono- and oligonucleotide compositions of interest for sars-cov- are expected to change towards those of other coronaviruses that have been prevalent in humans and away from those of coronaviruses prevalent in bats. to test this hypothesis, we analyzed the following coronaviruses: human-cov strains (alphacoronaviruses e and nl : betacoronaviruses hku and oc ) and bat-cov strains (alphacoronaviruses and betacoronaviruses, including the sars virus). as shown in fig. a, we compared the mononucleotide compositions of sars-cov- with those of the human- and bat-cov strains; the data for bat sars among bat-cov strains, which is thought to be the original strain that caused the current covid- pandemic, are marked in pink. interestingly, concerning the human- and bat-cov strains, differences in mononucleotide composition were more pronounced between hosts than between the alpha and beta linages, and the levels for all six clades of sars-cov- were between those for the two hosts. fig. b shows the results of di- and trinucleotides, for which the directional, time-dependent changes were primarily common among the six clades. the increases and decreases in nucleotide composition observed for sars-cov- in figs. and are indicated by hollow up and down arrows, respectively. interestingly, all changes of interest tended to move away from the compositions of bat sars and approach those of human-cov, supporting the view that the directional changes of interest have biological significance and are possibly inevitable, as observed for influenza viruses. assuming that approaching the levels in human-cov strains is the hypothetical goal of the directional change of sars-cov- , the current compositions are far from this hypothetical goal (fig. ); therefore, we predict that directional changes of interest will continue in the near future. then, assuming that the average value for all human-cov strains is a hypothetical goal, we investigated how sars-cov- has approached this possible goal. specifically, we calculated the square of the difference between the composition of each nucleotide in sars-cov- and the average value for human-cov strains and plotted .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the values of the difference according to the elapsed month for each nucleotide. changes in the compositions of both c and u clearly reduced this difference, as the compositions of these nucleotides approached the hypothetical goal (fig. a); their linear reduction supports the prediction that directional changes in the composition of c and u will continue for the foreseeable future. in contrast, a and g did not show directional changes in composition, which is most likely due to the absence of clear differences in the a and g compositions of human- and bat-cov, i.e., there is no possible target (fig. a). fig. b shows examples of di- and trinucleotides whose compositions have moved towards the hypothetical goal, but fig. c shows a few exceptional nucleotides whose compositions have not changed towards the hypothetical goal but have changed with a common directionality among the six clades. in fig. d, correlation coefficients between the above difference and the elapsed month are presented. most nucleotides of interest showed a negative coefficient (i.e., a directional change towards human-cov), but three oligonucleotides, gg, agc and cau, showed positive coefficients indicating an increase in the difference (i.e., moving away from the human-cov level). for these opposing directional changes, certain causes specific to sars-cov- may be assumed. motifs for rna-binding proteins next, we considered the mechanisms that move oligonucleotide compositions away from those of bat coronaviruses and closer to those of human coronaviruses. certain human cellular factors involved in viral growth may be candidates in such mechanisms. when considering possible protein factors, oligonucleotides longer than trinucleotides should be a focus. as an attempt, we here focused on host rna-binding proteins because their binding to hepatitis c virus is known to be involved in the growth of this rna virus [ ]. we thus searched for motifs for human rna-binding proteins in coronavirus genomes (see methods section) and found multiple loci with binding motifs .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for each protein. table (and table s ) lists the motifs for which a directional time- dependent change was primarily common among six clades. table and fig. a show that only elavl showed a positive correlation, but the other nine proteins in table showed a negative correlation for almost all clades; the results for other motifs are presented in table s . we next compared the numbers of these motifs in sars-cov- with the numbers of human- and bat-cov motifs (fig. b). of the ten proteins shown in table , the only elevated motif, that for elavl binding, was found in a significantly higher number of loci in human-cov than in bat-cov, but motifs for pcbp and srsf binding, which tended to decrease (table ), were found in significantly fewer loci in human-cov. these observations appear to be consistent with the features found in the mono-, di- and trinucleotide compositions of interest. however, unlike these changes, there was significant diversity within even a single clade, which appears to be greater than the differences between hosts, with the possible exception of elavl . in regard to long oligonucleotides, they should carry out a variety of functions, and mutations that accumulate in their functional motifs may have complex effects on the presence of functional motif sequences, so an analysis from a new perspective appears to become important. discussion we first discuss possible molecular mechanisms related to time-dependent directional changes in mononucleotide composition. fig. a shows that the frequency of c tended to decrease in sars-cov- , while that of u tended to increase. since a similar change was previously found for mers and all a-type influenza subtypes [ , ], these changes may have biological significance for a wide range of rna viruses that invade from nonhuman hosts. one possible mechanism is the host rna-editing function; simmonds ( ) proposed that the c→u hypermutation in sars-cov- may be due .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / to the influence of apobec family proteins in humans [ ]. apobec is an antiviral protein in various animal species, including humans, that can convert c to u by the deacetylation of c [ - ]. such rna editing is also known to act as a defense mechanism against various viruses, including retroviruses [ ]. the apobec gene family has generated various paralogs during mammalian evolution, with seven known apobec genes in humans and ten in bat families [ - ]. the prevalence c→u change in sars-cov- upon transfer of its host environment from bats to humans suggests that these changes may be due to human-specific apobec genes. we next discuss changes in short oligonucleotides. directional changes in some oligonucleotides, such as gag and gga, cannot be explained by apobec- induced c→u mutations alone. although the evidence is weak, these oligonucleotides are part of the binding motifs of several rna-binding proteins, such as srsf and pcbp (table s ); the number of loci for these motifs has decreased independently of clade. in contrast, the number of motif loci for only elavl among the ten proteins listed in table has increased independently of clade. as an rna-binding protein that binds a- or u-rich elements, elavl binding to mrna is known to contribute to rna stability [ , ]; sars-cov- and human-cov, which are prevalent in humans, may contain increased binding motifs for elavl for efficient growth in the human cellular environment. however, for further analysis, information on rna-binding proteins in bat cells is needed. conclusions in the present study, we found that the compositions of a group of mono- and oligonucleotide in sars-cov- genomes have changed in a host cell-dependent manner. this is totally consistent to our previous finding for influenza a and b viruses [ , , ], supporting the previous prediction that the host-dependent directional changes of various mono- and oligonucleotides should inevitably occur in zoonotic .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / rna viruses that have invaded from nonhuman hosts. phylogenetic methods based on sequence alignment [ , ] are well refined and undoubtedly essential for studying the phylogenetic relationships between viruses. the present alignment-free method to analyze mono- and oligonucleotide compositions can also serve as a powerful tool for molecular evolutionary studies of viruses, revealing directional changes in viruses and predicting the possible goals of these changes. methods sars-cov- genome sequences human sars-cov- genome sequences were downloaded from the gisaid database (https://www.gisaid.org/); sequences that were complete, showed high coverage and had been isolated from humans were downloaded on sep , . among the acquired sequences, strains with an unknown isolation month were excluded from the analysis, and the polya tail was removed. a list of all , strains used is provided in table s . genome sequences of coronaviruses prevalent in humans or bats the complete sequences of two types of human coronavirus (human-cov) strains, alphacoronaviruses ( e and nl strains) and betacoronaviruses ( hku and oc strains), were obtained from the ncbi virus database (https://www.ncbi.nlm.nih.gov/labs/virus/). the complete genome sequences of two types of bat coronavirus (bat-cov) strains, alphacoronaviruses ( strains) and betacoronaviruses ( strains, including sars-cov), isolated from three types of bats (chiroptera, vespertilionidae and rhinolophidae) were obtained from the ncbi virus database (https://www.ncbi.nlm.nih.gov/labs/virus/), and the polya tail of each sequence was removed. the strains are listed in table s . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / time-series analysis of changes in oligonucleotide compositions in the time-series analysis, the average mono- and oligonucleotide compositions (%) of viruses collected in each month were calculated for each clade. to avoid statistical fluctuations due to the small sample size, months in which fewer than strains had been collected were excluded from the monthly analysis. rna-binding motif analysis rna-binding motifs were obtained from the attract database [ ]. in this database, multiple binding motifs are registered as corresponding to one rna-binding protein; we calculated the total number of loci containing the binding motifs for each protein in the viral genomes. list of abbreviations sars-cov- : severe acute respiratory syndrome coronavirus- human-cov: human coronavirus bat-cov: bat coronavirus ethics approval and consent to participate not applicable consent for publication not applicable .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / availability of data and materials the sequence dataset analyzed in this study are stored in gisaid. other data are available from yi. competing interests the authors declared that there are no conflicts of interests. funding this work was supported by jsps kakenhi grant number k , by amed under grant number jp he and by covid- counterplan research project (supervised by prof. tatsumi hirata, nig) from the research organization of information and systems (rois). authors' contributions yi conceived the approach and conducted this analysis. ta developed the algorithm. ti supervised this study. acknowledgements we gratefully acknowledge the authors submitting their sequences from gisaid’s database and also the valuable comments of dr. yashushi hiromi of national institute of genetics (mishima). we thank springer nature author services for editing this manuscript for english language. figure legends fig. . time-dependent directional changes in nucleotide compositions. (a) average mononucleotide compositions (%) in the sars-cov- genomes of each clade isolated in each month are plotted against the elapsed month. to compare the four .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mononucleotides, the scale widths on the vertical axis are set to the same values. the colored lines distinguishing the clade (g, gh, gr, l, v and s) are shown at the bottom of the figure. the dashed line shows the averaged compositions for all strains isolated in each month. (b) the average di- and trinucleotide compositions that primarily undergo common directional changes among the six clades are plotted against the elapsed month. fig. . fold changes in nucleotide composition between the epidemic start and the last month of analysis. a bar plot shows the fold change in composition of each mono- or oligonucleotide; this value was calculated by dividing the nucleotide composition in the last month of analysis by that at the start of the epidemic. each bar is colored to indicate the clade, as described in fig. . since we analyzed strains belonging to different clades separately, data from the first or last month differed among clades; see also the methods section. fig. . nucleotide compositions of human and bat coronavirus sequences. a boxplot shows the nucleotide compositions in human-cov (alpha e, alpha nl , beta hku and beta oc ), bat-cov (bat sars, alphacoronavirus and betacoronavirus) and sars-cov- strains. bat sars are marked pink. a hollow arrow indicates the direction of change in oligonucleotide composition observed for sars- cov- in figs. and . (a) mononucleotides. to compare the four mononucleotides, the scale widths on the vertical axis scale are set to the same values. (b) di- and trinucleotides. fig. . differences in nucleotide composition between sars-cov- and human- cov. (a) values for the square of the difference in mononucleotide composition between sars-cov- isolated in each month and human-cov are plotted against the .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / elapsed month. the data are presented as colored or dashed lines, as described in fig. . (b and c) oligonucleotide compositions that approach and move from those of human- cov are presented, respectively. (d) the correlation coefficients between the elapsed month from the start of the epidemic and the above differences in mono- and oligonucleotides whose directionality of change is common among six clades are presented. the results for a and g mononucleotides, which show nondirectional change, are also presented. fig. . time-dependent changes in the numbers of rna-binding motif loci. (a) the numbers of loci containing rna-binding motifs per genome are plotted against the elapsed month. here, we selected rna-binding proteins for which the number of motif loci increased or decreased by at least one for all six clades from the epidemic start. the data are presented as colored or dashed lines, as described in fig. a. (b) a boxplot shows the number of loci containing rna-binding motifs in human-cov (alpha e and nl : beta hku and oc ), bat-cov (bat sars, alphacoronavirus and betacoronavirus) and sars-cov- strains. bat sars are marked pink. a hallow arrow indicates the direction shown in fig. a with which the oligonucleotide compositions of sars-cov- changed. table . correlation coefficients for time-dependent changes in mono- and oligonucleotide compositions in sars-cov- that have increased. table . correlation coefficients for time-dependent changes in mono- and oligonucleotide compositions in sars-cov- that have decreased. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . the number motif-containing loci for rna-binding proteins whose occurrences have increased or decreased between strains of the first and last month of the analysis. additional file fig. s : average di- and trinucleotide compositions (a and b) of for sars-cov- strains collected in each elapsed month. fig. s : oligonucleotide compositions of human and bat coronavirus sequences. fig. s : differences in oligonucleotide composition between sars-cov- and human- cov. additional file table s : list of sars-cov- strains used in the analysis. table s : list of human-and bat-cov strains used in the analysis. table s : number of sars-cov- strains in each clade isolated in each elapsed month. table s : average oligonucleotide compositions for sars-cov- strains in each clade isolated in each elapsed month. table s : correlation coefficients for time-dependent changes in oligonucleotide compositions of sars-cov- . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table s : fold change in compositions between strains of the first and last month of the analysis. table s : distance between the oligonucleotide composition of sars-cov- isolated in each elapsed month and that of human-cov. table s : correlation coefficients for time-series changes in the distance between oligonucleotide compositions of sars-cov- and human-cov. table s : list of rna-binding motifs. table s : numbers of motif-containing loci for rna-binding proteins whose abundance increases or decreases between strains of the first and last month of the analysis. table s : p-value from t-test to analyze the number of rna-binding motif loci whose abundance increases or decreases between strains of the first and last month of the analysis. table s : correlation coefficients for time-dependent changes in the number of loci containing rna-binding motifs. reference . singhal t: a review of coronavirus disease- (covid- ). indian j pediatr. ; : - . . garcía-sastre a: inhibition of interferon-mediated antiviral responses by influenza a viruses and other negative-strand rna viruses. virology. ; : – . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . voinnet o: induction and suppression of rna silencing: insights from viral infections. nat. rev. genet. ; : – . . randall re, goodbourn s: interferons and viruses: an interplay between induction, signalling, antiviral responses and virus countermeasures. j. gen. virol. ; : – . . konno y, kimura i, uriu k, et al: sars-cov- orf b is a potent interferon antagonist whose activity is increased by a naturally occurring elongation variant. cell rep. ; : . . zhou et al: a novel bat coronavirus closely related to sars-cov- contains natural insertions at the s /s cleavage site of the spike protein. curr biol. ; : - . . nei m: molecular evolutionary genetics. columbia university press: new york. . . kumar s, nei m, dudley j, tamura k: mega: a biologist-centric software for evolutionary analysis of dna and protein sequences, brief bioinform. ; : – . . abe t, kanaya s, kinouchi m, et al: informatics for unveiling hidden genome signatures, genome res. ; : – . . abe t, sugawara h, kinouchi m, kanaya s, ikemura t: novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples, dna res. ; : – . . iwasaki y, abe t, wada k, itoh m, ikemura t,: prediction of directional changes of influenza a virus genome sequences with emphasis on pandemic h n / as a model case. dna res ; : - . iwasaki y, abe t, wada y, wada k, ikemura t: novel bioinformatics strategies for prediction of directional sequence changes in influenza virus genomes and for surveillance of potentially hazardous strains. bmc infect dis. ; : - .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . karlin s, campbell am, mrazek j: comparative dna analysis across diverse genomes. annu. rev. genet. ; : – . . wada y, wada k, iwasaki y, kanaya s, ikemura t: directional and reoccurring sequence change in zoonotic rna virus genomes visualized by time-series word count. sci rep. ; : . . wada k, wada y, iwasaki y, ikemura t: time-series oligonucleotide count to assign antiviral sirnas with long utility fit in the big data era. gene ther. ; : – . . wada k, wada y, ikemura t: time-series analyses of directional sequence changes in sars-cov- genomes and an efficient search method for candidates for advantageous mutations for growth in human cells. gene. ; : . . qiu y, abe t, nakao r, satoh k, sugimoto c: viral population analysis of the taiga tick, ixodes persulcatus, by using batch learning self-organizing maps and blast search. journal of veterinary medical science, ; ( ): - . . mercatelli d, giorgi fm: geographic and genomic distribution of sars-cov- mutations. front microbiol. ; : : . . simmonds p: rampant c→u hypermutation in the genomes of sars-cov- and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories. msphere. ; :e - . . paek ky, kim cs, park sm, kim jh, jang sk: rna-binding protein hnrnp d modulates internal ribosome entry site-dependent translation of hepatitis c virus rna. j virol. ; : - . . harris rs, bishop kn, sheehy am, craig hm, petersen-mahrt sk, watt in, neuberger ms, malim mh: dna deamination mediates innate immunity to retroviral infection. cell. ; : – . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . mangeat b, turelli p, caron g, friedli m, perrin l, trono d: broad antiretroviral defence by human apobec g through lethal editing of nascent reverse transcripts. nature. ; : – . . zhang h, yang b, pomerantz rj, zhang c, arunachalam sc, gao l: the cytidine deaminase cem induces hypermutation in newly synthesized hiv- dna. nature. . : – . https://doi.org/ . /nature . . harris rs, dudley jp: apobecs and virus restriction. virology. ; – : – . . sawyer sl, emerman m, malik hs: ancient adaptive evolution of the primate antiviral dna-editing enzyme apobec g. plos biol. ; :e . . münk c, willemsen a, bravo ig: an ancient history of gene duplications, fusions and losses in the evolution of apobec mutators in mammals. bmc evol biol. ; : . . henry m, terzian c, peeters m, wain-hobson s, vartanian jp: evolution of the primate apobec a cytidine deaminase gene and identification of related coding regions. plos one. ; :e . . wang w, caldwell mc, lin s, furneaux h, gorospe m: hur regulates cyclin a and cyclin b mrna stability during cell proliferation. embo j. ; ( ): - . . lal a, mazan-mamczarz k, kawai t, yang x, martindale jl, gorospe m: concurrent versus individual binding of hur and auf to common labile target mrnas. embo j. ; ( ): - . . giudice g, sánchez-cabo f, torroja c, lara-pezzi e: attract-a database of rna-binding proteins and associated motifs. database (oxford). ; :baw . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table clade g clade gh clade gr clade l clade v clade s u . . . . . . ua . . . . . . auu . . . . . . cau . . . . . . ugu . . . . . . uua . . . . . . uug . . . . . . uuu . . . . . . table clade g clade gh clade gr clade l clade v clade s c - . - . - . - . - . - . ag - . - . - . - . - . - . ca - . - . - . - . - . - . cc - . - . - . - . - . - . cu - . - . - . - . - . - . ga - . - . - . - . - . - . gg - . - . - . - . - . - . uc - . - . - . - . - . - . agc - . - . - . - . - . - . ccc - . - . - . - . - . - . gac - . - . - . - . - . - . gag - . - . - . - . - . - . gga - . - . - . - . - . - . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table clade g clade gh clade gr clade l clade v clade s ptbp - . - . - . - . - . . hnrnpl - . - . - . - . - . . nova - . - . - . - . - . . srsf - . - . - . - . - . . zfp . - . - . - . - . . hnrnpa - . - . - . - . - . . elavl . . . . . . tia - . - . - . - . - . . pcbp - . - . - . - . - . . srsf - . - . - . - . - . . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / debar, a sequence-by-sequence denoiser for coi- p dna barcode data title: debar, a sequence-by-sequence denoiser for coi- p dna barcode data authors cameron m. nugent , ,* tyler a. elliott sujeevan ratnasingham paul d. n. hebert sarah j. adamowicz department of integrative biology, university of guelph. guelph, ontario, canada centre for biodiversity genomics, university of guelph. guelph, ontario, canada *corresponding author: nugentc@uoguelph.ca .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:nugentc@uoguelph.ca https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / abstract dna barcoding and metabarcoding are now widely used to advance species discovery and biodiversity assessments. high-throughput sequencing (hts) has expanded the volume and scope of these analyses, but elevated error rates introduce noise into sequence records that can inflate estimates of biodiversity. denoising —the separation of biological signal from instrument (technical) noise—of barcode and metabarcode data currently employs abundance-based methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase subunit i (coi) region employed as the animal barcode. this manuscript introduces debar, an r package that utilizes a profile hidden markov model to denoise indel errors in coi sequences introduced by instrument error. in silico studies demonstrated that debar recognized % of artificially introduced indels in coi sequences. when applied to real-world data, debar reduced indel errors in circular consensus sequences obtained with the sequel platform by %, and those generated on the ion torrent s by %. the false correction rate was less than . %, indicating that debar is receptive to the majority of true coi variation in the animal kingdom. in conclusion, the debar package improves dna barcode and metabarcode workflows by aiding the generation of more accurate sequences aiding the characterization of species diversity. keywords: coi, dna barcode, metabarcode, denoising, markov model, biodiversity .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction motivated by global biodiversity decline, conservation policies and strategies are being implemented to mitigate extinction rates (driscoll et al. ; baynham-herd et al. ). accurate assessments of biodiversity and its change over time are critical to support conservation strategies, to remediate environmental damage, and to manage natural resources, but this information is lacking for most ecosystems (sogin et al. ; hajibabaei et al. ; hebert et al. ; d’souza & hebert ). dna barcoding provides a technological solution to the problem of identifying organisms and characterizing biodiversity (hebert et al. ; hubert & hanner ). instead of identifying specimens through morphological study, standardized dna regions—termed dna barcodes—are used to identify specimens belonging to known species and to recognize new taxa. reflecting advances in sequencing technology, dna barcode studies are expanding in scale from analyzing single specimens to characterizing bulk samples, an approach termed metabarcoding, as well as multi-marker and metagenomics approaches (taberlet et al. ; cristescu ; hajibabaei et al. ; wilson et al. ). these advances are providing newly detailed information on species diversity in different geographic regions and habitats (hajibabaei et al. ; hebert et al. ; delabye et al. ; lopez-vaamonde et al. ) while also aiding the identification of invasive species (brown et al. ; xu et al. ), food web analysis (wirta et al. ; kanuisto et al. ), and environmental monitoring (hajibabaei et al. ; stat et al. ; cordier et al. ). despite the broad adoption of dna barcoding and metabarcoding, a fundamental problem persists. efforts to quantify biodiversity from barcode and metabarcode data can be .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / strongly affected by analytical methodology (clare et al. ; braukmann et al. ). for example, if high-throughput sequence (hts) data are cleaned suboptimally, the estimated number of taxa may be grossly inflated as variation introduced by sequencing (technical) errors are interpreted as biological variation (hardge et al. ). to reduce the impact of technical errors, sequence reads are often clustered into operational taxonomic units (otus) at specific identity thresholds (elbrecht et al. ). several software packages have attempted to increase the accuracy of this otu method by separating biological signal from technical noise (rosen et al. ; callahan et al. ; edgar ; amir et al. ; elbrecht et al. ; kumar et al. ; nearing et al. ). many standard denoisers, such as dada (callahan et al. ), deblur (amir et al. ), and unoise (edgar ), utilize cluster-based approaches, custom error models, or pre-clustering algorithms to account for and correct technical errors. comparative studies have shown that all three of these methods outperform threshold-based otu-clustering approaches (nearing et al. ). it has also been shown that they produce similar estimates of species richness and relative abundance, but significantly different values for alpha diversity (intra-habitat diversity) and the number of unique exact sequence variants (esvs) (nearing et al. ). when a highly conserved protein-coding region, such as cytochrome c oxidase subunit i (coi), is employed as the barcode, structural information can be leveraged to improve denoising. the adoption of this approach can improve the accuracy of alpha-diversity estimates and the quality of identified barcode sequences by ensuring barcodes conform to biological reality. additionally, rare sequences or important intra-species variants need not be discarded based solely on their abundance and can be retained with higher confidence if they conform to the expected gene structure. this latter benefit will be particularly valuable for work on hyper-diverse communities, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (e.g. tropical insects) and for analyses of metabarcode data, where uneven sampling is often the norm and the resolution of intra-species variation is challenging (elbrecht et al. ; nearing et al. ; braukmann et al. ; zizka et al. ). hidden markov models (hmms) are probabilistic representations of sequences that allow unobserved (hidden) states to be inferred through the observation of a series of non-hidden states (durbin et al. ; wilkinson ). hmms have been applied widely in the analysis of biological sequences, in areas such as sequence alignment and annotation (durbin et al. ; eddy ). profile hidden markov models (phmms) are a variant well suited for the representation of biological sequences with a shared evolutionary origin (durbin et al. ; eddy , ). they are probabilistic models that contain position-specific information about the likelihood of potential characters (base pairs or amino acid residues) at the given position in the sequence (emission probabilities) and the likelihood of the observed character given the previously observed character in the sequence (transition probabilities). once a phmm is trained on a set of sequences, the viterbi algorithm can be used to obtain the path of hidden states that align the novel sequences to the phmm (durbin et al. ). the viterbi path is comprised of hidden match states (indicating the observed character matches to a position in the phmm) and non-match states: either inserts or deletions. in the context of error correction, hidden non-match states identify the most likely positions at which novel sequences deviate from the phmm’s statistical profile. in this manner, individual sequences can be queried for evidence of insertion or deletion (indel) errors and adjusted in a statistically informed manner. the conserved protein- coding structure of the most common animal barcode gene, coi, and the wealth of available training sequences (ratnasingham & hebert ) for this region have allowed phmms to be successfully applied in the detection of technical errors in novel barcode sequences (nugent et .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / al. ). correction of technical indel errors in data from protein-coding barcode sequences is an important development as it maximizes the likelihood that both the nucleotide and amino acid sequences correspond to the true biological sequence. mitigation of indels arising due to technical errors also makes sequence reads from a given specimen more directly comparable, allowing low-frequency point mutations to be eliminated when multiple reads are available for a given biological sequence. here, we aim to extend the use of phmms in coi data processing to allow for the sequence-by-sequence correction (denoising) of technical errors. this study had four primary goals: ( ) design a denoising tool for coi barcode data that utilizes phmms to identify and correct insertion and deletion errors resulting from technical error; ( ) test the tool’s performance and optimize its default parameters by denoising a set of , barcode sequences with artificially introduced indel errors; ( ) develop, implement, and evaluate a workflow for denoising dna barcode data produced through single-molecule, real time (smrt) sequencing of , specimens on the sequel platform (pacific biosciences); and ( ) denoise a dna metabarcode mock community data set using debar and evaluate the improvement in quality of consensus sequences and the ability to resolve intra-otu haplotype variation. the denoiser resulting from this work, debar (denoising barcodes), is a free, publicly available package written in r that is available through cran (https://cran.r- project.org/package=debar) and github (https://github.com/cnuge/debar). materials and methods implementation .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://cran.r-project.org/package=debar https://cran.r-project.org/package=debar https://github.com/cnuge/debar https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the debar utility includes several customizable steps which denoise dna barcode and metabarcode data (figure ; supplementary file ). corrections with debar are based upon the comparison of input sequences with a nucleotide-based profile hidden markov model (phmm) (model training detailed in nugent et al. ) using the viterbi algorithm (durbin et al. ). briefly, debar’s phmm was trained using a curated set of , coi- p barcode sequences obtained from the barcode of life data systems (bold: www.boldsystems.org) public database that were checked to ensure: (i) the sequence was > bp in length, (ii) taxonomy was known to a genus level, (iii) there were no missing base pairs, (iv) the amino acid sequence did not contain stop codons, and (v) bold’s internal check for contaminants was negative (nugent et al. ). the viterbi path produced through alignment of the sequence to the phmms is used to match the input sequence to the phmm (by finding the first set of consecutive match states which indicate the absence of indels for the given base pairs). the read is then adjusted to account for detected insertions or deletions (figure ). three consecutive nucleotide insertions or deletions are permitted (not adjusted) as sequences of this kind are more likely to reflect true biological variants than technical errors (they do not result in reading frame shifts and may reflect an insertion or deletion of an amino acid in a functional protein-coding gene). the probability of such changes through sequencing error is relatively low (i.e. for the pacific biosciences sequel platform the baseline probability of three consecutive deletions would be . % (baseline delete probability) cubed, or . %). the denoising of sequences with debar is controlled using a suite of parameters (figure ). the censorship parameter is most important as it controls the size of the masks (substitution of nucleotides for placeholder n characters) applied around sequence adjustments. this option is designed to prevent the introduction of errors that would be caused if the denoising process .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / deleted the wrong base pair or inserted a placeholder in the incorrect position. derivation of the default value for the censorship parameter is detailed in the methods and results sections. the package also enables the translation of denoised sequences to amino acids to confirm that denoised outputs conform to the expected properties of the protein-coding gene region. because debar can interface directly with fasta and fastq files, it enables file-to-file denoising in addition to denoising within an r programming environment. the default phmm used for denoising by debar represents the complete bp barcode region of coi. the package also permits the use of customized phmms provided by a user, which allows the denosiser to be applied to data from other gene regions or for the denoiser to be targeted to a specific user-defined subsection of the coi barcode. training of a phmm for a new barcode or gene is supported by the r package aphid (wilkinson ), while sub-setting of debar’s default phmm is enabled by the r package coil (nugent et al. ). details of the package’s components together with a demonstration of its implementation is available in the package’s vignette (supplementary file ). quantification of package performance simulated error data the debar package was tested using a phylogenetically stratified random sample of publicly available coi- p sequences with artificially introduced indels. this test was designed to assess the accuracy of sequence corrections and to obtain a quantitatively informed set of default parameters for the denoising process. a random sample of , animal coi- p sequences (excluding those used in phmm model training) were obtained from bold and cleaned using the steps described in nugent et al. (methods section – bold data acquisition). errors were introduced into each sequence in accordance with the statistical error profile of the pacific .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / biosciences sequel based upon the error profile for coi barcode region in hebert et al. ( ). this profile indicated a baseline indel rate of . % (insertions and deletions equally likely), a baseline substitution rate of . %, and an elevated indel rate for long homopolymers (repeat length of , , and + with indel probabilities of . %, . %, and . %, respectively) (hebert et al. ). the location of all errors was recorded so that accuracy of subsequent corrections could be evaluated. sequences were iteratively processed, and errors were limited to a single insertion or deletion error of one base pair in length (with the error introduction process being repeated for the original sequence when more than one indel occurred), which allowed for the accuracy of corrections to be assessed without the need to consider interaction effects. the resultant sequences, each with one indel, were then denoised with debar (‘denoise’ function, using the parameter censor_length = ). the outputs of the denoise function were queried to determine the number and location of indel corrections applied by debar. this information was compared to the recorded ground truth error locations to quantify the following: ) the frequency with which debar located and exactly corrected indels, ) the miss distance (number of nucleotide positions) between introduced errors and corrections applied in instances where debar did not correct the indel errors in exactly the correct position, and ) the frequency at which debar applied an incorrect number of sequence corrections (i.e. correction or + corrections). if one correction was made and the distance between the correction and true indel position was , then the correction was considered accurate. corrections were also considered accurate if all base pairs between the correction location and the true indel position were the same (i.e. if base pair in the homopolymer "ttttt" was an insertion, but the th t in the sequence was removed by debar, this is functionally an exact correction as the true sequence is restored). all other corrections at inexact positions were considered inaccurate, and the distance .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (number of positions) between the correction and true indel location was recorded. the mean and standard deviation of the miss distance were determined and used to select the default censor_length parameter for the debar package, equal to the mean miss distance plus standard deviations (censor_length = ceiling( μmiss_distance + ( x σmiss_distance)) ). this value was selected as it would be expected to avoid the introduction of an error for > % of inexact corrections. sequences where no corrections or multiple corrections were made had their outputs inspected further to determine if other parts of the denoising pipeline (e.g. the check for stop codons in the translated amino acid sequence or trimming of sequence edges in the framing process) removed the error or led to the complete rejection of the sequence. false correction rate the performance of debar on sequences with no indel errors was also quantified to determine the frequency and cause of erroneous corrections applied to cleaned, publicly available coi- p barcode sequences with no known technical errors. a random sample of , sequences from all the animal coi- p barcode sequences available on bold was obtained (supplementary file ) meeting the following criteria was obtained: ) the barcode was publicly available on the bold database, ) the barcode was > bp in length, ) the barcode did not contain missing characters (“n”) in the folmer region, ) the corresponding amino sequence did not contain stop codons, ) the result of bold’s internal check for contaminants was negative, and ) the sequence was not used in phmm training and the simulated error dataset. sequences were processed using debar’s denoise function (censor_length = ). all sequences that had corrections applied, or that were flagged for rejection, were counted and examined in detail to search for evidence of the proximal cause of the false correction. to search for evidence of taxonomic bias, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the taxonomy associated with all falsely corrected sequences were tallied at the order level, and manually examined for evidence of bias. denoising pacbio sequel data we quantified the performance of debar on raw dna barcode sequence data by interfacing with the existing mbrave workflow (http://www.mbrave.net) used to process dna barcode circular consensus sequences (ccs) obtained with the sequel platform. a custom analysis pipeline (supplementary file ) was constructed to analyze and denoise the final set of ccs barcodes produced by the mbrave workflow (one ccs per otu) (figure ). the pipeline was designed to search the final barcodes produced by mbrave for evidence of indel errors (by considering the translated amino acid sequence with the r package coil (nugent et al. )), denoise all the associated ccs with detected errors using the debar package, and then regenerate a consensus barcode sequence using the denoised data to produce a final, denoised barcode sequence for each specimen (figure ). the outputs of this analysis were examined to determine if the debar pipeline decreased the number of technical errors in the barcode sequences and that those barcode sequences resulted in likely amino acid sequences when translated. initial quantification of the improvement was conducted by comparing the number of barcode sequences whose amino acid sequences were flagged by the r package coil (nugent et al. , default parameters) before and after denoising. barcodes are flagged by coil when they possess a stop codon when translated to amino acids or when the resultant amino acid sequence is improbable, both indicating that the sequence likely possesses an indel error. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://www.mbrave.net)/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / since the coil and debar packages both employ the same nucleotide profile hidden markov model (coil also utilizes an amino acid phmm), an independent test of pipeline effectiveness was also conducted. the effectiveness of the denoising pipeline was quantified by submitting both the original and denoised barcode sequences to bold. it was used to determine the number of original barcodes and denoised barcodes with evidence of stop codons after aligning the sequences using the bold’s hidden markov model (a model developed independently of the debar phmm) and translating the sequence using the appropriate translation table corresponding to the taxonomic information accompanying the sequence record. comparison of these numbers made it possible to quantify the increase in barcode-compliant sequences (i.e. those with no stop codon) produced by debar. additionally, the sequence quality report on bold was examined to determine the number of unknown nucleotides (“n”) in the barcode sequences after denoising. the report categorizes barcode quality as: high (< % ns), medium (< % ns), low (< % ns), or unreliable (> % ns), and the number of barcodes in these different categories was recorded. denoising metabarcode data to characterize debar’s performance on metabarcode data, we analyzed a metabarcode dataset for a mock arthropod community (braukmann et al. ). these data derived from a single sequencing run on an ion torrent s on coi amplicons generated by pooled dna extracts from abdomens from single specimens of arthropod species (methods described in detail in braukmann et al. ). sequences were from a bp fragment of the coi barcode region targeted using the primers mlepf and lepr (hebert et al. ; braukmann et al. ). following amplification and sequencing on the ion s , quality control, sequence dereplication, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / chimeric read filtering, matching to reference sequences, and clustering were performed on mbrave (braukman et al. ). two sets of data resulted from this process, a set of , unique sequences that were assigned to different barcode index numbers (bins) (ratnasingham and hebert ) through the comparison to reference sequences (matched at > % similarity), and a set of , unique sequences not matching to available references that were clustered into an additional , otus at a % similarity threshold (using clustering algorithm described in braukmann et al. ). all sequences were denoised using debar’s denoise_list function and a custom nucleotide phmm. the custom phmm was a bp subset of the complete coi phmm (phmm profile positions – ), corresponding to a segment of the folmer (folmer et al. ) region targeted by the metabarcoding primers. the phmm was created using coil’s ‘subsetphmm’ function (nugent et al. ). after denoising, two tests were conducted to determine if denoising improved the quality of the metabarcode pipeline’s output data. first, for each bin and otu consensus sequences were generated using denoised sequences and the debar function ‘consensus_sequence’. these consensus sequences were assessed for evidence of stop codons using coil and the same custom phmms used in denoising (function coi p_pipe with the additional parameter: trans_table = ). this test revealed the number of denoised consensus sequences which contained a stop codon when translated to amino acids, indicating an indel error persisted in the nucleotide sequence. the centroid sequences for the bins and otus were used as a baseline metric for the number of barcode- compliant sequences. for each bin, centroid sequences were obtained by clustering the sequences in the group using the r package kmer’s ‘otu’ function (parameters: k = , threshold = . ) (wilkinson , version . . ). for the otus, centroids were obtained from data .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / generated by mbrave. all centroids were assessed with coil (nugent et al. , version . ), and the number of barcode-compliant representative sequences for the original centroids and the final consensus sequences was compared. secondly, the individual sequences within each bin and otu were analyzed with coil to determine the number that were likely error free, as evidenced by the absence of stop codons after translation. this assessment was repeated on the denoised reads to determine the effectiveness of debar in correcting errors in individual sequences and to reveal if the denoising process improved the resolution of esvs for subsequent analysis of intra-species genetic variation by placing the esvs in reading frame and reducing the frequency of identified indel errors. results quantification of package performance simulated error data debar was used to correct , barcodes, each with a single indel error (supplementary file ). the denoised sequences and associated data were compared to the ground truth error locations to determine the accuracy of corrections applied by debar (figure ). for , sequences ( . %), a single correction was applied by debar, indicating that the package correctly identified the type of error in these sequences. however, debar either failed to recognize an indel or made too many corrections ( +) in the other sequences. no correction was made for most ( ) of these sequences, meaning that debar’s phmm did not identify the indel error. the overlooked indels were largely restricted to the terminal regions of the sequence; .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / % ( / ) of them were positioned within base pairs of the read termini (figure ), regions that only comprised % ( bp/ bp) of the sequences. the cause of this is that the debar denoising algorithm uses the first observation of consecutive bp matching to the phmm to establish the corrective window. errors on the periphery of sequences therefore lead to trimming of the sequence (via the keep_flanks function) instead of indel correction. a substantial fraction of the remaining uncorrected indel errors ( ) occurred between positions to (figure ), a region associated with a bp indel present in some animal groups and absent in others. its presence reduced the phmm’s indel detection ability in this region due to greater true variability. not all unidentified indels were retained in the final output sequences as double checks of debar (employing the keep_flanks and aa_check parameters) identified many ( / – %) of the uncorrected sequences and either omit the problem region or flag the sequence as likely to contain an error. therefore, debar’s double checks allow many false negatives to be trimmed or flagged as problematic. for sequences ( . %), two or more corrections were applied by debar when only a single indel existed (figure ). in contrast to the false negatives, debar’s double checks only captured three of the false positives. many of the false corrections appeared to be the presence of indels near codons that are not present in all animals. due to true biological variation in the training data, these regions of the phmm have higher probabilities of transitioning from a match state to an insert or delete state, and therefore indels in these locations are sometimes handled incorrectly (i.e. the sequence is characterized as having two deleted base pairs, when there was a bp insertion). because false corrections of this type result in sequences that conform to the structure of the protein-coding gene region (i.e. a lack of stop codons in the amino acid sequence), they are not identified by debar’s aa_check function. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the , sequences for which the presence of a single indel was correctly identified were further analyzed to determine how accurately they were located (figure ). the analysis showed that debar was able to exactly locate and correct , ( . % of sequences in single correction category) of the indel errors in the dataset. for the other , sequences ( . % of the single corrections category), the indel corrections were not placed in exactly the correct position (figure ). for these sequences, the average distance between the true indel location and the applied correction was . base pairs (standard deviation = . ). these results were used to select a default censorship value for debar to ensure that inexactly identified indel errors are masked in most sequences (figure ). a default censorship length of (the average miss distance plus two times the standard deviation, rounded up) was selected in order to mask the true error in > % of instances where inexact corrections were applied, thereby successfully denoising sequences, albeit with some associated loss of information in the sequences, which can be overcome by building a consensus sequence when multiple reads are available for an individual. overall, denoising of the , barcodes with the default censorship parameter (censor_length = ) resulted in , / , ( . %) of sequences with errors being successfully denoised. the additional double check parameters (aa_check = true, keep_flanks = false) captured, but did not correct, ( . %) errors. the debar package thereby corrected or removed . % of sequences with indel errors (figure ). false correction rate a set of , barcode sequences with no known indel errors was analyzed with debar to determine the incidence of erroneous corrections. nearly all sequences ( . %) were not altered .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / nor flagged as erroneous. nine sequences were erroneously corrected, and none were flagged for rejection. these sequences included a single sequence from each of five orders and four sequences from the order diptera (flies). interestingly, the four diptera sequences that were incorrectly altered all belonged to the same genus: culicoides. they represented / of all sequences from the family ceratopogonidae that were in dataset, indicating that the performance issue was isolated to this single genus. these results indicate that debar deals well with variation in coi sequences across most of the animal kingdom, but that it displays some taxonomic bias in performance. this is a limitation of debar, as any genus with a coi profile that systematically deviates from the coi phmm used in debar will be erroneously denoised. the benefit of the conservative censorship approach used in the package is that although these reads are erroneously adjusted, the corrections made are masked by ns, and the entire sequence is not rejected. rather, only a small section of the sequences is lost, as if it were to contain an indel error. most of any falsely corrected sequences can thereby be recovered, and in most instances, this would be sufficient to identify associated taxonomy and inform biological conclusions. denoising pacbio sequel data we applied debar in the analysis of real dna barcode data by developing a processing pipeline (figure – hereafter ‘the debar pipeline’) and compared the amount of technical noise in the barcodes before and after processing. a set of , consensus barcode sequences derived from processing data from four sequel runs were obtained from mbrave and were re-processed with the debar pipeline (table ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / analysis of the consensus barcodes with coil (step ii. of the debar pipeline) flagged , ( . % of total) of consensus sequences due to the detection of a stop codon in the translated sequence or due to the presence of an unexpected amino acid (log likelihood score below the default threshold). the large number of flagged sequences is likely reflective of false positives (sequences flagged by coil that lack indel errors due to the incorrect establishment of reading frame). in fact, , sequences ( . % of total, . % of flagged sequences) were flagged due to the presence of a stop codon, and , of them ( . % of total, . % of flagged sequences) contained a stop codon in all three forward reading frames, providing extremely strong evidence of an indel error (i.e. a low likelihood of being a false positive). after denoising, the output sequences were again assessed with coil (step viii. of the debar pipeline) and this analysis revealed that debar had corrected many indel errors (table , table ). only , ( . %) of the final barcode sequences were flagged by coil’s coi p_pipe function, suggesting that . % ( , ) of the flagged sequences were successfully denoised. when comparison was restricted to the , sequences with stop codons, only were still flagged as containing stop codons, indicating that . % ( , / , ) of the sequences in this subcategory were effectively denoised. a more conservative estimate of correction success was provided by the subset of flagged sequences with stop codons in all reading frames. of these sequences, / ( . %) passed the coil check following denoising, suggesting the successful correction of an indel error and improved representation of the true sequence. external quantification of the debar pipeline’s denoising ability was obtained by the submission of pre- and post- pipeline barcode sequences to bold (http://www.boldsystems.org). the sample size for this test was smaller as bold requires taxonomic designations and this information was only provided by mbrave for , sequences. the total number of original .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / sequences flagged by bold due to its detection of a stop codon was , ( . %), a considerably lower frequency than reported by coil on the initial pipeline inputs. of the , sequences with initial evidence of stop codons, were rejected outright by the debar pipeline, were flagged but not successfully corrected, were unflagged and not corrected, and , had no evidence of errors following denoising (table ). based on this assessment with bold, the debar pipeline produced a % reduction in the number of errors in the dataset from . % ( , ) to . % ( ). of the remaining errors, the majority ( ) were detected as problematic and flagged as erroneous by debar. as a consequence, the debar pipeline reduced the number of unidentified errors by > % (from , to ) in the barcode dataset (table ). the denoising of the barcodes with the debar pipeline did not result in sequences with large amounts of missing information. of the , output barcodes, , were high quality (< % ns), were medium quality (< % ns), were low quality (< % ns), and were unreliable (> % ns). there was a strong negative relationship between the number of ccs available for a sample and the amount of missing information in the final barcode sequence (figure ). denoising metabarcode data consensus sequence quality metabarcode data from a mock arthropod community were also denoised followed by comparison of original sequences to the denoised consensus sequences to determine if the debar improved sequence quality (table ). of the original centroid sequences for the bins, / ( . %) contained evidence of indel errors when analyzed with coil. following denoising and consensus sequence generation via debar, the number of barcode-compliant .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / outputs was considerably higher with only / ( . %) displaying evidence of indel errors. four bins had all their component sequences rejected by debar so no consensus sequences were generated. the rate of apparent indel errors was higher in the centroids of the otus; ( %) displayed evidence of a stop codon when analyzed with coil, suggesting the presence of indels in more than half of the sequences representing each otu. the consensus sequences produced through denoising and consensus sequence generation with debar were of apparent higher quality as only ( . %) displayed evidence of a stop codon when analyzed with coil. an additional otus ( . %) failed to produce a valid consensus sequence after denoising because all their component sequences were rejected by debar. the corrections did cause some loss of information; / ( . %) of the consensus sequences for the bin groups contained at least one ‘n’ due to ambiguous or censored base pairs in their component reads, and / ( . %) of the otu consensus sequences contained at least one ‘n’. the number of ‘ns’ per sequence was generally low for the bins (median = ; sequences with or more ‘ns’) but was higher for the otus (median number of ‘ns’ = ), indicating there was on average one correction per otu (correction of an indel, plus the seven bp mask in either direction result in (insertion) or (deletion) consecutive ‘ns’). there was a positive relationship between the number of sequences within an otu and the completeness of information in the final consensus sequence. esv data quality data analysis on mbrave revealed bins represented by , unique dereplicated reads as well as otus lacking taxonomic assignment that were represented by unique sequence reads. when original sequences were checked with coil, it indicated that .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / , / , ( . %) of bin sequences and / ( . %) of the otu sequences displayed strong evidence of an indel error as they contained a stop codon when translated. by contrast, following denoising with debar the incidence of stop codons was far lower as just / , ( . %) of the bin sequences and / , ( . %) of the otu sequences had evidence of indels. this result indicated that denoising of individual sequences reduced the incidence of apparent indel errors by over % for the bins ( , fewer indel errors) and by % for the otus ( fewer indel errors). most sequences were subjected to at least one indel correction by debar, with , / , ( . %) of the final bin sequences and / ( . %) of final otu sequences containing at least one ‘n’ character. low abundance otus in the data set represented by biologically valid sequences need not be discarded solely due to their low abundance and could be further inspected for putative evidence of rare community members. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / discussion this manuscript introduces debar, a phmm-based denoiser, and demonstrates how it can improve the quality of sequence data used for both dna barcode library construction and for metabarcode studies by correcting indels introduced by sequencing error. we first evaluated its effectiveness through an in silico study that tested its capacity to recognize and repair reference barcodes with artificially introduced indels. debar was shown to be effective, as it corrected > . % of the errors and applied erroneous adjustments to less than . % of correct sequences. this strong performance extended to real-world data sets. debar reduced the rate of frameshift indels by % in sequence records generated by the long-read sequel platform, generating more barcode-compliant sequences, most with little or no missing information. debar also improved the quality of metabarcode data generated by the ion s allowing for esvs to be considered with higher confidence and for the recovery of higher-quality representative sequences for otus. denoising sequences with artificial errors and known ground truths showed that the corrections performed by debar were imperfect, with the exact indel location being identified only . % of the time. the application of a default bp censorship on both sides of putative indel corrections proved to be an effective means of masking most errors, improving the denoiser’s error removal rate to > . %. this high error removal rate involves a tradeoff, as sequence adjustments are accompanied with a loss of base pairs of information. this information loss is an acceptable cost, as it ensures that all remaining base pairs can be considered with high confidence. the nature of high-throughput sequence data, namely that there are usually multiple sequencing reads for a given specimen available, can help mitigate the loss .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / of information. corrected sequences from a specimen or otu can be used in conjunction with one another, filling in the different censored locations and overcoming the loss of information. the censorship of bases adjacent to indel corrections is an optional parameter that users may alter to suit their needs. smaller censorship values, or no censorship at all, would result in less loss of information per sequence, but would come at the cost of more errors remaining in the final data. denoising of real dna barcode data obtained from sequencing of specimens on the pacific biosciences sequel platform resulted in higher-quality output sequences. an exact metric quantifying the improvement is, however, difficult to state with certainty, as the ground truth of the sequences is not known. the independent tests of the sequences through submission of consensus sequences to bold before and after denoising provided a conservative estimate of the debar package’s effectiveness. conservatively, this test showed a % reduction in the number of barcode sequences with technical indel errors after application of the debar pipeline and a low false negative rate ( unidentified errors out of , total putative errors). this is an important improvement because the pacific biosciences sequel platform is used at the centre for biodiversity genomics to produce high-quality reference barcodes for the barcoding research community (hebert et al. ). accuracy of these sequences is therefore important; the debar package is shown to improve sequence quality, yielding more biologically likely and therefore reliable outputs. the generation of barcode sequences is also made more efficient. by increasing the rate of barcode-compliant outputs from . % to %, fewer samples require reprocessing or resequencing. understanding within-species patterns of genetic diversity is an essential metric for characterizing community health. high intra-species genetic diversity is assumed to indicate .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / healthy ecosystems, comprised of large and stable populations with the standing genetic variation needed to survive environmental stressors (zizka et al. ). the characterization of esvs within otus can provide intra-species diversity measures for member species of a community (frøslev et al. ). the initial check of the sub-otu sequence data from the mock community sequenced with iontorrent revealed a high rate of putative indel errors ( % of sequences), which would lead to a gross over estimation of the number of esvs within the otus. the reduction of the error rate after denoising with debar allows for a more accurate examination of intra-otu esvs and therefore allows for more accurate assessments of intra- species diversity and community health, despite the fact that debar is not capable of eliminating non-indel errors from sequences. even with the improvements to esv quality by debar, intra- species diversity estimates will likely remain inflated to some extent, as the sequence-by- sequence corrections applied by debar exclusively account for indel errors while substitution errors could persist within the data. we have demonstrated that debar is an effective means of reducing technical errors in dna barcode and metabarcode data, but the package is not without limitations. the package is designed to correct insertion and deletion errors, but these are not the only technical issues that can lead to inflated biodiversity estimates. the program is not an effective means of identifying or correcting chimeric sequences or non-animal coi biological contaminants and should these exist within an input data set they are likely to go uncorrected. additionally, debar does not have the ability to correct substitution errors on a sequence-by-sequence basis. because of indel correction, denoised sequences are aligned, and nucleotide positions become directly comparable across different sequences from a given specimen or otu. random point substitution errors can thereby be corrected in consensus sequence generation, through the ‘majority rule’ approach .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / debar uses in base calling. however, if systematic errors exist (i.e. most sequences possess the same substitution), few sequences are available for consensus sequence generation, or esvs are being examined, then substitution errors may persist in the data. an additional source of error unaccounted for by debar is contaminant sequences. it has been demonstrated previously that the phmm utilized in debar is not an effective means of separating animal barcode sequences from off-target barcodes derived from bacteria, plant, fungi, or other origins (nugent et al. ). taken together, these limitations show that debar cannot single handedly address the technical challenges associated with dna barcoding. the tool is likely most effective when applied in conjunction with existing barcode and metabarcode workflows and improves the quality of final sequences if the inputs have been filtered based on quality, had primers removed, and been cleaned of chimeric and contaminant sequences. the sequence-by-sequence denoising approach of debar means that it is a flexible tool capable of integrating into analysis pipelines for sequencing data from various sources. application of debar in tandem with conventional, clustering-based denoising tools would likely lead to the highest quality assessment of biodiversity. following otu generation with other tools, using debar to denoise all reads within a given otu prior to consensus sequence generation would maximize accuracy of the consensus sequence while conforming to the conserved structure of the coi barcode region. the removal of intra-otu noise can also improve the accuracy of alpha-diversity estimates. additionally, application of debar in the denoising of rare, low-abundance sequences not present in the otus would allow these data to be further examined with higher confidence, revealing biological insights that would be overlooked in conventional workflows. the phmm denoising technique used by debar is an effective barcode-focused framework that can be extended to fit a variety of needs. data from only two sequencing .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / platforms were tested in this study: the pacific biosciences sequel and thermo iontorrent s . since the phmm used in debar is barcode specific and not sequencer specific, debar can be effectively applied in denoising of barcode data obtained from any sequencing platform. however, the effectiveness of the denoiser will depend on the types and rates of technical errors associated with a given platform. when applied to data from sequencers such as the illumina miseq, the rate of technical errors corrected by debar will be lower, as this platform is more prone to introduction of substitution, as opposed to indel, errors (schirmer et al. ). although the debar package contains a phmm for only the common animal barcode coi, the denoising algorithm can in the future be extended and applied in the correction of data for other dna barcodes with conserved structures. conclusion this study has described debar, an r package for denoising dna barcode data, and demonstrated its ability to correct indels in both barcode and metabarcode sequences due to instrument error. in each dataset, debar improved sequence quality. it reduced the apparent number of indels by % in data generated by sequel, increasing the proportion of sequences that met the quality standards required to qualify as a reference barcode. the merits of debar for metabarcode analysis were twofold, allowing more likely consensus sequences to be obtained for otus, and for intra-otu variation to be quantified with higher confidence. overall, debar is a robust utility for identifying deviations from the highly conserved protein-coding sequence of the coi barcode region. corrections informed by its use improve the separation of true biological variation from technical noise, with low frequencies of false corrections. integration of debar .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / into the workflows for processing barcode and metabarcode data will allow biological variation to be characterized with higher accuracy. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / acknowledgements this research was supported by grants from genome canada through ontario genomics and from the ontario ministry of economic development, job creation and trade. the funders played no role in study design or decision to publish. this research was enabled in part by resources provided by compute canada (www.computecanada.ca). we thank tony kuo and thomas braukmann for aid with data acquisition and interpretation and tony for helpful comments on the manuscript. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://www.computecanada.ca/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references amir, a., mcdonald, d., navas-molina, j. a., kopylova, e., morton, j. t., xu, z. z., ... & knight, r. ( ). deblur rapidly resolves single-nucleotide community sequence patterns. msystems, ( ). baynham-herd, z., amano, t., sutherland, w. j., & donald, p. f. ( ). governance explains variation in national responses to the biodiversity crisis. environmental conservation, ( ), - . braukmann, t. w., ivanova, n. v., prosser, s. w., elbrecht, v., steinke, d., ratnasingham, s., ... & hebert, p. d. n. ( ). metabarcoding a diverse arthropod mock community. molecular ecology resources, ( ), - . brown e.a., chain, f. j., zhan, a., macisaac, h. j., & cristescu, m. e. ( ). early detection of aquatic invaders using metabarcoding reveals a high number of non‐indigenous species in canadian ports. diversity and distributions, ( ), - . callahan, b. j., mcmurdie, p. j., rosen, m. j., han, a. w., johnson, a. j. a., & holmes, s. p. ( ). dada : high-resolution sample inference from illumina amplicon data. nature methods, ( ), . clare, e. l., chain, f. j., littlefair, j. e., & cristescu, m. e. ( ). the effects of parameter choice on defining molecular operational taxonomic units and resulting ecological analyses of metabarcoding data. genome, ( ), - . cordier, t., lanzén, a., apothéloz-perret-gentil, l., stoeck, t., & pawlowski, j. ( ). embracing environmental genomics and machine learning for routine biomonitoring. trends in microbiology, ( ), - . cristescu, m. e. ( ). from barcoding single individuals to metabarcoding biological communities: towards an integrative approach to the study of global biodiversity. trends in ecology & evolution, ( ), - . delabye, s., rougerie, r., bayendi, s., andeime-eyene, m., zakharov, e. v., dewaard, j. r., ... & mavoungou, j. f. ( ). characterization and comparison of poorly known moth communities through dna barcoding in two afrotropical environments in gabon. genome, ( ), - . durbin, r., eddy, s. r., krogh, a., & mitchison, g. ( ). biological sequence analysis: probabilistic models of proteins and nucleic acids. cambridge university press. driscoll, d. a., bland, l. m., bryan, b. a., newsome, t. m., nicholson, e., ritchie, e. g., & doherty, t. s. ( ). a biodiversity-crisis hierarchy to evaluate and refine conservation indicators. nature ecology & evolution, ( ), - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / eddy, s. r. ( ). profile hidden markov models. bioinformatics (oxford, england), ( ), - . eddy, s. r. ( ). a new generation of homology search tools based on probabilistic inference. in genome informatics : genome informatics series vol. (pp. - ). edgar, r. c. ( ). unoise : improved error-correction for illumina s and its amplicon sequencing. biorxiv, elbrecht, v., vamos, e. e., steinke, d., & leese, f. ( ). estimating intraspecific genetic diversity from community dna metabarcoding data. peerj, , e . folmer, o., black m., hoeh w., lutz r, vrijenhoek, r. ( ). dna primers for amplification of mitochondrial cytochrome c oxidase subunit i from diverse metazoan invertebrates. mol mar biol biotechnol, ( ), - . frøslev, t. g., kjøller, r., bruun, h. h., ejrnæs, r., brunbjerg, a. k., pietroni, c., & hansen, a. j. ( ). algorithm for post-clustering curation of dna amplicon data yields reliable biodiversity estimates. nature communications, ( ), - . hajibabaei, m., spall, j. l., shokralla, s., & van konynenburg, s. ( ). assessing biodiversity of a freshwater benthic macroinvertebrate community through non-destructive environmental barcoding of dna from preservative ethanol. bmc ecology, ( ), . hajibabaei, m., baird, d. j., fahner, n. a., beiko, r., & golding, g. b. ( ). a new way to contemplate darwin’s tangled bank: how dna barcodes are reconnecting biodiversity science and biomonitoring. philosophical transactions of the royal society b: biological sciences, ( ), . hebert, p. d. n., cywinska, a., ball, s. l., & dewaard, j. r. ( ). biological identifications through dna barcodes. proceedings of the royal society of london. series b: biological sciences, ( ), - . hebert, p. d. n., ratnasingham, s., zakharov, e. v., telfer, a. c., levesque-beaudin, v., milton, m. a., ... & dewaard, j. r. ( ). counting animal species with dna barcodes: canadian insects. philosophical transactions of the royal society b: biological sciences, ( ), . hebert, p. d. n., braukmann, t. w., prosser, s. w., ratnasingham, s., dewaard, j. r., ivanova, n. v., ... & zakharov, e. v. ( ). a sequel to sanger: amplicon sequencing that scales. bmc genomics, ( ), . hubert, n., & hanner, r. ( ). dna barcoding, species delineation and taxonomy: a historical perspective. dna barcodes, ( ), - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / kaunisto, k. m., roslin, t., sääksjärvi, i. e., & vesterinen, e. j. ( ). pellets of proof: first glimpse of the dietary composition of adult odonates as revealed by metabarcoding of feces. ecology and evolution, ( ), - . kumar, v., vollbrecht, t., chernyshev, m., mohan, s., hanst, b., bavafa, n., ... & golden, m. ( ). long-read amplicon denoising. nucleic acids research, ( ), e -e . lopez-vaamonde, c., sire, l., rasmussen, b., rougerie, r., wieser, c., allaoui, a. a., ... & lees, d. c. ( ). dna barcodes reveal deeply neglected diversity and numerous invasions of micromoths in madagascar. genome, ( ), - . nearing, j. t., douglas, g. m., comeau, a. m., & langille, m. g. ( ). denoising the denoisers: an independent evaluation of microbiome sequence error-correction approaches. peerj, , e . nugent, c. m., elliott, t. a., ratnasingham, s., & adamowicz, s. j. ( ). coil: an r package for cytochrome c oxidase i (coi) dna barcode data cleaning, translation, and error evaluation. genome. ( ): - . ratnasingham, s., & hebert, p. d. n. ( ). a dna-based registry for all animal species: the barcode index number (bin) system. plos one, ( ). rosen, g., garbarine, e., caseiro, d., polikar, r., & sokhansanj, b. ( ). metagenome fragment classification using 𝑁-mer frequency profiles. advances in bioinformatics, . schirmer, m., ijaz, u. z., d’amore, r., hall, n., sloan, w. t., & quince, c. ( ). insight into biases and sequencing errors for amplicon sequencing with the illumina miseq platform. nucleic acids research, ( ), e -e . sogin, m. l., morrison, h. g., huber, j. a., welch, d. m., huse, s. m., neal, p. r., … & herndl, g. j. ( ). microbial diversity in the deep sea and the underexplored “rare biosphere”. proceedings of the national academy of sciences, ( ), - . stat, m., huggett, m. j., bernasconi, r., dibattista, j. d., berry, t. e., newman, s. j., ... & bunce, m. ( ). ecosystem biomonitoring with edna: metabarcoding across the tree of life in a tropical marine environment. scientific reports, ( ), - . taberlet, p., coissac, e., hajibabaei, m., & rieseberg, l. h. ( ). environmental dna. molecular ecology, ( ), - . wilkinson sp. ( ) kmer: an r package for fast alignment-free clustering of biological sequences. r package version . . . https://cran.r-project.org/package=kmer wilkinson, s. p. ( ). aphid: an r package for analysis with profile hidden markov models. bioinformatics, ( ), - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://cran.r-project.org/package=kmer https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / wilson, j. j., brandon-mong, g. j., gan, h. m., & sing, k. w. ( ). high-throughput terrestrial biodiversity assessments: mitochondrial metabarcoding, metagenomics or metatranscriptomics?. mitochondrial dna part a, ( ), - . wirta, h. k., hebert, p. d. n., kaartinen, r., prosser, s. w., várkonyi, g., & roslin, t. ( ). complementary molecular information changes our perception of food web structure. proceedings of the national academy of sciences, ( ), - . zizka, v. m., weiss, m., & leese, f. ( ). can metabarcoding resolve intraspecific genetic diversity changes to environmental stressors? a test case using river macrozoobenthos. biorxiv. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / data accessibility statement dna barcode sequences used in training of the profile hidden markov models are available in the supplementary data of the following paper: https://doi.org/ . /gen- - . dna barcode sequences used in model testing are available in this manuscript’s supplementary files. the r source code for the debar package is available on github: https://github.com/cnuge/debar. additional data and code available on request from the authors. author contributions the study was conceived and designed by sja, pdnh, sr, and cmn. the programming of the debar package was performed by cmn. analyses of package performance were performed by cmn with resources, design, and other assistance provided by tae, sr, and sja. the initial draft of the manuscript was written by cmn and sja. all authors contributed to the editing of the manuscript. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /gen- - https://github.com/cnuge/debar https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / tables and figures table . summary of the results for the , barcode sequences (produced from pacbio sequel data analyzed using the mbrave platform) after processing with the debar pipeline. pacbio sequel run run run run run total consensus sequences generated , , , , , consensus sequences flagged by coil for indel error , ( . %) rejected by debar denoising ( . %) sequences flagged by coil post-denoising , ( . %) sequences corrected , ( . % of flagged sequences) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . assessment of the correction ability of the debar pipeline for the subset of sequences in the high-confidence error set. this set of sequences was flagged by coil and produced a stop codon when translated within all reading frames. the top half of the table indicates the number of sequences flagged by coil as likely to be erroneous, based on the log likelihood values of the sequences. results are shown for sequences both before and after the denoising process. the bottom half of the table contains the number of sequences flagged by coil as likely to be erroneous, based on the presence of a stop codon in the amino acid sequence resulting from the censored translation of the framed nucleotide sequence. this high success for the stop-codon metric ( . % of errors removed) indicates that the pipeline is an effective means of correcting frameshift-causing insertion or deletion errors. the relatively lower success at correcting sequences with low log likelihood values suggests that frameshift-causing errors are not the only set of errors being flagged by coil, and that non-frameshift errors are not effectively corrected by the debar pipeline. pacbio sequel run run run run run total original flagged , flagged post- denoising , corrected . % . % . % . % . % pacbio sequel run run run run run total original stop codon , stop codon post- denoising corrected stop codons . % . % . % . % . % .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . result of the bold data system evaluation of debar denoising workflow’s effectiveness. the number of sequences identified by bold as containing stop codons, before and after processing with the denoising pipeline (figure ). only the , specimens with barcodes and taxonomic information produced through the processing of pacbio sequel data on the mbrave platform were considered, as bold requires taxonomic information for assessing the presence of stop codons. the rows break the sequences down into categories, which indicate the source of the post-denoising sequence that was submitted to bold for assessment. sequence category total sequence count stop codon count percent error reduction original post-denoising unaltered , † - denoised, altered , , † % flagged for potential error, unaltered * - flagged and rejected - labelled as wolbachia by mbrave - total , , ( . %) ( . %) . % total, non- flagged only , , ( . %) ( . %) . % † the sum of these categories (shown in the final row of the column) represents the false negative rate for the denoising pipeline. these are the . % ( / , ) of sequences that appear to contain true stop codons that were not flagged for denoising, or that were denoised unsuccessfully and not flagged as potential errors. * the false positive rate of the denoising pipeline is the number of sequences in this category that do not in fact contain a stop codon. there is a total of ( - ) false positives and an overall false positive rate of . % ( / , ). since this set of sequences are flagged for potential errors, as opposed to being outright rejected, additional inspection of sequences in this category can separate the unsuccessfully denoised sequences with true errors from those that do not contain an error. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . assessment of the sequence quality for data from a mock community of arthropods sequenced in bulk using a thermo fisher ion torrent and processed on the mbrave platform. sequencing and processing results in two sets of data, groups of sequences assigned to bins and groups of sequences clustered into otus. the representative sequences (centroids before denoising, consensus after denoising) and all individual sequences were checked with the r package coil for evidence of frameshifts (stop codons in amino acid sequence) before and after denoising to see if processing the data with the debar package resulted in higher quality barcode sequences. original after debar denoising sequences analyzed sequence data source total count stop codon count total count stop codon count representative sequences assigned to bins ( . %) ( . %) otus , ( %) , ( . %) esvs assigned to bins , ( . %) , ( . %) otus , ( . %) , ( . %) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . diagram demonstrating the debar package’s denoising workflow. blue indicates nucleotides that are part of the barcode region and orange nucleotides in bold font indicate technical errors or sequence from outside of the barcode region. a. the debar package operates on a sequence-by-sequence basis, taking each input and constructing a custom dnaseq object. a dnaseq object can receive a dna sequence, an identifier, and optionally a sequence of corresponding phred quality scores. although not utilized in the denoising, indel-correcting adjustments to the sequence are applied to the phred scores as well, so that quality information can be carried from input to output. b. following dnaseq object construction, the sequence is compared to the phmm using the viterbi algorithm. by default, the full length ( bp) coi- p phmm contained in debar is used to evaluate the sequence. when required, a user may pass a custom phmm corresponding to a subsection of the coi- p region (specified using the coil package’s subsetphmm function) or a custom phmm trained on user-defined data (wilkinson ). the frame function isolates the correction window, which is the section of the sequence matching the phmm (the first consecutive base pairs matching to the phmm on the leading and trailing edges of the sequence establish the section of the input on which subsequent corrections are applied). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / c. the adjust function traverses the section of the sequence and viterbi path defined by the frame function. when evidence of an inserted base pair (‘ ’ label in the viterbi path) is encountered, the corresponding base pair is removed. when evidence of a deleted base pair is encountered (a ‘ ’ label in the viterbi path) a placeholder ‘n’ nucleotide is inserted. exceptions are made for triple inserts or triple deletes (three consecutive ‘ ’ or ‘ ’ labels), which are skipped by the adjustment algorithm, as they are indicative of mutations that would not have a large impact on the structure of the protein-coding gene region and could reflect biological amino acid indels. the total number of adjustments made by debar is limited by the parameter ‘adjust_limit’ (default = ), sequences requiring adjustments in excess of this number are flagged for rejection, as this high frequency of indels is likely not the result of technical error, but rather other sources of noise such as pseudogenes. following adjustment, a mask of placeholder ‘n’ nucleotides is applied to base pairs flanking the corrected indel (default is bp in each direction, see figure . for derivation of default). masking of bp flanks adjacent to each correction allows imprecise corrections to effectively correct sequence length and also mask true indel locations in the majority of instances. d. following adjustment, the denoised sequences are output by debar. by default, the outputs will include trailing sequence outside of the correction window. leading information outside of the correction window is dropped, so that sequences are aligned with a common starting position. a user can choose to keep only the correction window, or have both flanking regions appended back on to the sequence output. e. if multiple denoised sequences are available (for either a given specimen in the case of barcoding or a given otu in metabarcoding) then the consensus of the denoised sequences can be taken. the consensus function assumes the sequences have been denoised and their left flanks removed; as a result, they are aligned to one another. the modal base pair for each position is then taken to generate a consensus sequence, and in the case of ties, a placeholder “n” character is added to the consensus. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . diagram of the denoising workflow used to improve the quality of barcodes produced by processing pacific biosciences sequel circular consensus data on the mbrave platform. (i) pacific biosciences sequel data are processed on the mbrave platform, and an initial set of barcode sequences is produced. (ii) the set of consensus barcode sequences produced by the mbrave platform are obtained and analyzed with the coil package, using the ‘coi p_pipe’ function (default parameters). sequences displaying evidence of an indel (either the presence of a stop codon when translated to amino acids or an amino acid sequence with a low likelihood score) are retained for further denoising. (iii) for each barcode with evidence of an error, all component ccs reads (and associated metadata) derived from the given specimen are obtained from mbrave. (iv) based on the mbrave metadata, sequences are trimmed to remove primers, mid tags, and adapter sequence. the reverse complement of reads are taken when required. (v) the ‘denoise_list’ function of debar is used to denoise all ccs reads (options: dir_check = false, keep_flanks = ‘right’, censor_length = ). rejected reads (those flagged by the denoise_list function) are removed from the dataset. (vi) for each specimen, the reads are clustered into otus using the r package kmer (clustering threshold = . ). this is done to mitigate the influence of outlier ccs or contaminant sequences. (vii) for each otu, a consensus sequence is generated using debar’s ‘consensus’ function. for each specimen, otus are ranked based on the number of component ccs reads they contain. (vii) the consensus sequences are reassessed with coil. if the top-ranked consensus sequence now passes the coil check, it is deemed to have been successfully denoised, and it is selected as the output barcode. if not, the check is repeated for the second-ranked consensus sequence (when available), and this output is retained if it is barcode compliant. if neither the first nor second highest ranked consensus sequence passes the coil check, then the original (pre-denoising process) barcode is retained, as no meaningful improvement was made. in this situation the barcode is flagged as likely to contain an error. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . the debar package’s denoising of , coi sequences containing single insertion or deletion errors. so that exact error positions were known, errors were artificially introduced in accordance with known probabilities for coi dna barcode data from the pacbio sequel platform (hebert et al. ). denoising was accomplished through altering sequences in accordance with the viterbi path yielded by comparison to the phmm. the correct number of adjustments was made for , sequences, and . % of these corrections located the indel exactly. masking of bp flanks adjacent to each correction allowed imprecise corrections to correct sequence length and mask the true indel location % of the time. for the instances where an incorrect number of adjustments were made, were caught through query of the amino acid sequence for stop codons and the trimming of spurious matches at the edge of sequences. overall, . % of errors were effectively corrected or identified as erroneous. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . histogram indicating the position in the coi- p region of the uncorrected indel errors from the , -sequence artificial error dataset. the x axis indicates the base pair position in the coi- p profile, and the y axis displays the number of sequences that contained an uncorrected error at the given range of positions (bins of base pair positions). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . histogram showing number of base pairs between inexact corrections applied by debar and the ground truth error location for the given sequence. in total , sequences ( . %) had errors that were denoised inexactly, and corrections were an average of . bp (sd = . ) away from the exact ground truth error location. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . relationship between the amount of missing data in the final denoised barcode sequences (number of ns divided by the total length of the sequence) and the number of ccs reads that contributed to the generation of the barcode. the figure displays only the , denoised barcode sequences submitted to bold that contained at least one “n” (the remaining , barcode sequences in the bold submission did not contain an “n”). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary information supplementary file ('s -single_errors_in_ k_sequences.csv') the , coi barcode sequences with single introduced indel errors that were used to test debar and calibrate the default parameters. supplementary file ('s -control_denoising_no_errors.csv') the , coi barcode sequences with no known indel errors used to assess the false correction rate of debar supplementary file ('s -single_file_pipeline') scripts and example data for the denoising pipeline developed to process coi dna barcode sequence data produced using the pacific biosciences sequel sequencer and mbrave platform supplementary file scripts and example data for the denoising pipeline developed to process coi dna metabarcode sequence data produced using the iontorrent s sequencer and the mbrave platform supplementary file vignette demonstrating the functionality of the debar package. the vignette is also available as part of the r package (https://github.com/cnuge/debar/tree/master/vignettes) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/cnuge/debar/tree/master/vignettes https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / genetic epidemiology of variants associated with immune escape from global sars-cov- genomes genetic epidemiology of variants associated with immune escape from global sars-cov- genomes bani jolly , ,$, mercy rophina , ,$, afra shamnath , mohamed imran , , rahul c. bhoyar , mohit kumar divakar , , pallavali roja rani , gyan ranjan , , paras sehgal , , pulala chandrasekhar , s. afsar , j. vijaya lakshmi , a. surekha , sridhar sivasubbu , , vinod scaria , ,* csir-institute of genomics and integrative biology (csir-igib), new delhi, india academy of scientific and innovative research (acsir), csir-hrdc ghaziabad, uttar pradesh, india kurnool medical college, kurnool, andhra pradesh, india $authors contributed equally and would like to be known as joint first authors *address for correspondence: vinod scaria, vinods@igib.in abstract many antibody and immune escape variants in sars-cov- are now documented in literature. the availability of sars-cov- genome sequences enabled us to investigate the occurrence and genetic epidemiology of the variants globally. our analysis suggests that a number of genetic variants associated with immune escape have emerged in global populations. keywords: covid- , sars-cov- , antibody, mutations, epidemiology text antibodies are one of the emerging therapeutic approaches being explored in covid- . these antibodies typically target the receptor-binding motif or structural domains of the spike protein of sars-cov- , in an attempt to inhibit binding of spike protein with the host receptors. cocktails of antibodies which target distinct structural and functional domains of spike proteins are also being currently developed considering redundant mechanisms of targeting the virus and therefore minimising escape mechanisms. genomic documentation of the spread of sars-cov- across the globe has provided unique insights into the genetic variability and variants of functional consequence. in-depth studies in recent months have unravelled a wealth of information on the immune response in covid- and offered insights into the development of therapeutics. recent investigations suggest a number of genetic variants in sars-cov- are associated with immune escape and/or resistance to antibodies. their structural and functional features and mechanisms of immune evasion are also being extensively studied ( ) . the natural occurrence and genetic epidemiology of these variants across the global populations are poorly understood. we were motivated by the wide availability of sars-cov- genomes from across the world and the increasing numbers of genetic variants suggested to contribute to escape from antibody inhibition. we analysed a comprehensive compendium of genetic variants associated with immune escape and curated by our group from literature and preprint servers ( ). this compendium included unique variants reported in literature. to understand the genetic epidemiology of these variants in the global compendium of genomes, we compiled the dataset of , sars-cov- from gisaid (as of december ) ( ) apart from , genomes (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:vinods@igib.in https://doi.org/ . / . . . sequenced in-house (bioproject id: prjna ). genome sequences with more than % ns, more than ambiguous nucleotides, higher than expected divergence and mutation clusters were excluded from the analysis. after quality control, the final dataset encompassed , genomes from countries. only countries with at least good quality genome submissions were considered for the analysis. of the genetic variants associated with immune escapes were found in a total of , genomes from countries (figure a), out of which variants had > % frequency in the respective countries. phylogenetic analysis was performed following the nextstrain protocol for a total of , genomes, including , randomly selected genomes having these variants (figure b) ( ). homoplasies were identified in the phylogeny using homoplasyfinder ( ). out of , variant sites were found to be homoplasic, suggesting they could emerge independently in different genetic lineages, out of which were found to be at > % frequency in at least one of the countries analysed. out of , genomes analysed from australia, immune escape associated variants mapped to , genomes ( %). of significant frequency was the s:s n variant which was present in , genomes ( %) from australia. high frequency of this variant was also found in a number of other countries particularly in europe. s:n k was also found at high frequencies in genomes from a number of countries in europe ( ). s:n y, one of the variants in the recently reported emergent sars-cov- lineage from the united kingdom, was present in a total of genomes, including genomes from the united kingdom, australia, south africa, usa, denmark and brazil ( , ). all genomes from south africa having s:n y also had the s:e k variant and s:k n was present in of these genomes ( ). the orf a:g v variant was also found to be prevalent across global genomes, with the highest frequencies in hong kong and south korea. this variant is also one of the defining variants for the nextstrain clade a a (gisaid clade v) (figure b). of the genetic variants were found in genomes from india (supplementary figure). the s:n k variant was found to have a frequency of . % in india and a high prevalence in the state of andhra pradesh ( . % of genomes). the variant site was homplasic and the variant was found in genomes belonging to different clades and haplotypes. time-scale analysis suggested the variant emerged in recent months (figure c). the s:n k variant was also reported in a case of covid- reinfection from north india ( ). put together, our analysis suggests that a number of genetic variants which are associated with immune escape have emerged in global populations, some of them have been found to be polymorphic in many global datasets and a subset of variants have emerged to be highly frequent in some countries. homoplasy of the variant sites suggests that there could be a potential selective advantage to these variants. further data and analysis would be needed to investigate the potential impact of such variants on the efficacy of different vaccines in these regions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgements authors acknowledge disha sharma and abhinav jain for the analysis of in-house genomes and the researchers, originating and submitting laboratories of the sequences retrieved from gisaid (https://doi.org/ . /m .figshare. .v ). bj and mkd acknowledge a research fellowship from the council of scientific and industrial research (csir india). the funders had no role in the study design or the decision to publish. references . weisblum y, schmidt f, zhang f, dasilva j, poston d, lorenzi jcc, et al. escape from neutralizing antibodies by sars-cov- spike protein variants. oct [cited dec ]; https://elifesciences.org/articles/ . rophina m, pandhare k, mangla m, shamnath a, jolly b, sethi m, et al. favicov - a comprehensive manually curated resource for functional genetic variants in sars-cov- . nov https://doi.org/ . /osf.io/wp tx . yuelong shu jm. gisaid: global initiative on sharing all influenza data – from vision to reality. eurosurveillance [internet]. mar [cited dec ]; ( ). https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / . nextstrain [internet]. [cited dec ]. https://nextstrain.org/sars-cov- / . crispell j, balaz d, gordon sv. homoplasyfinder: a simple tool to identify homoplasies on a phylogeny. microbial genomics [internet]. jan [cited dec ]; ( ). https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / . hodcroft eb, zuber m, nadeau s, crawford khd, bloom jd, veesler d, et al. emergence and spread of a sars-cov- variant through europe in the summer of . medrxiv : the preprint server for health sciences [internet]. nov [cited dec ]; https://pubmed.ncbi.nlm.nih.gov/ / . rambaut a, loman n, pybus o, barclay w, barrett j, carabelli a, et al. preliminary genomic characterisation of an emergent sars-cov- lineage in the uk defined by a novel set of spike mutations [internet]. [cited dec ]. https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-co v- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ . shang e, axelsen ph. the potential for sars-cov- to evade both natural and vaccine-induced immunity [internet]. cold spring harbor laboratory. [cited dec ]. p. . . . . https://www.biorxiv.org/content/ . / . . . v .abstract . emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus (sars-cov- ) lineage with multiple spike mutations in south africa [internet]. [cited dec ]. https://www.krisp.org.za/publications.php?pubid= . gupta v, bhoyar rc, jain a, srivastava s, upadhayay r, imran m, et al. asymptomatic reinfection in two healthcare workers from india with genetically distinct sars-cov- . clin infect dis [internet]. [cited dec ]; https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /m .figshare. .v https://elifesciences.org/articles/ https://doi.org/ . /osf.io/wp tx https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / https://nextstrain.org/sars-cov- / https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / https://pubmed.ncbi.nlm.nih.gov/ / https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ https://www.biorxiv.org/content/ . / . . . v .abstract https://www.krisp.org.za/publications.php?pubid= https://www.ncbi.nlm.nih.gov/pmc/articles/pmc / https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . (a) variant frequencies of the immune escape variants in genomes of sars-cov- . the total number of genomes analyzed from each country is specified. variants with frequency > % in the respective countries are highlighted in red. (b) global phylogenetic context of the variants. the vertical bar indicates the clade assigned according to the nextstrain nomenclature (c) time-series data on prevalence for the genetic variants showing the region-wise proportion of genomes per month for the variants supplementary figure. variant frequencies of the immune escape variants in genomes isolated from different states in india. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . predicting chemotherapy response using a variational autoencoder approach i i “output” — / / — : — page — # i i i i i i bioinformatics doi. . /bioinformatics/xxxxxx advance access publication date: day month year original paper topic area: biomedical informatics predicting chemotherapy response using a variational autoencoder approach qi wei ∗ and stephen a. ramsey ∗ school of eecs, oregon state university, corvallis, oregon , usa department of biomedical sciences and school of eecs, oregon state university, corvallis, oregon , usa. ∗to whom correspondence should be addressed. associate editor: xxxxxxx received on xxxxx; revised on xxxxx; accepted on xxxxx abstract motivation: multiple studies have shown the utility of transcriptome-wide rna-seq profiles as features for machine learning-based prediction of response to chemotherapy in cancer. while tumor transcriptome profiles are publicly available for thousands of tumors for many cancer types, a relatively modest number of tumor profiles are clinically annotated for response to chemotherapy. the paucity of labeled examples and high dimension of the feature data limit performance for predicting therapeutic response using fully-supervised classification methods. recently, multiple studies have established the utility of a deep neural network approach, the variational autoencoder (vae), for generating meaningful latent features from original data. here, we report first study of a semi-supervised approach using vae-encoded tumor transcriptome features and regularized gradient boosted decision trees (xgboost) to predict chemotherapy drug response for five cancer types: colon adenocarcinoma, pancreatic adenocarcinoma, bladder carcinoma, sarcoma, and breast invasive carcinoma. results: we found: ( ) vae-encoding of the tumor transcriptome preserves the cancer type identity of the tumor, suggesting preservation of biologically relevant information; and ( ) as a feature-set for supervised classification to predict response-to-chemotherapy, the unsupervised vae encoding of the tumor’s gene expression profile leads to better area under the receiver operating characteristic curve (auroc) classification performance than either the original gene expression profile or the pca principal components of the gene expression profile, in four out of five cancer types that we tested. availability: github.com/athed/vae_for_chemotherapy_drug_response_prediction contact: ramseyst@oregonstate.edu supplementary information: supplementary data are available at bioinformatics online. introduction although chemotherapy is a mainstay of treatment for aggressive cancers, many agents have serious side effects (airley, ). whether or not chemotherapy will provide a net benefit to a patient depends in large part on whether the malignancy responds to the treatment. chemotherapy is often administered in cycles (skeel, ), leading to multiple opportunities where treatment appropriateness may be (re- )assessed (chabner and longo, ). currently, the medical cost-benefit of chemotherapy (versus a non-pharmaceutical approach) is assessed in light of patient health status, expected therapeutic tolerance, and tumor pathological classification (kaestner and sewell, ; gurney, ). for many cancer types, there is a broad spectrum of cases where the decision of whether or not to undergo or continue chemotherapy is difficult (corrie, ; whelan et al., ; malfuson et al., ). the development of a quantitative model that could predict—based on a specific tumor’s molecular signature—whether or not the tumor will respond to chemotherapy would have significant clinical utility and would potentially improve patient quality-of-life. moreover, an advance in machine-learning methods for the response-to-chemotherapy prediction problem (chiu et al., ; geeleher et al., ) would have potential crossover benefits for other prediction problems in precision medicine. oncogenesis is driven by alterations in the somatic genome and epigenome in cancer cells (weir et al., ); however, the somatic genetic or epigenetic determinants of response to chemotherapy are also thought © the author . published by oxford university press. all rights reserved. for permissions, please e-mail: journals.permissions@oup.com .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/athed/vae_for_chemotherapy_drug_response_prediction ramseyst@oregonstate.edu weiqi https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i to exert measurable effects on gene expression in the tumor. consistent with this theory, studies of various cancer types have demonstrated that biomarkers identified from systematic measurement of the patient’s cancer transcriptome or proteome correlate with the probability that a tumor will respond to chemotherapy, for example, a five-protein signature in breast cancer (gámez-pozo et al., ), - and -gene signatures in rectal cancer (casado et al., ; del rio et al., ), and a -gene signature in liver cancer (kurokawa et al., ). taken together, the findings from such “omics” biomarker studies suggest that rna sequencing- (rna- seq (wang et al., ))-based transcriptome measurements of tumor samples labeled with clinical response can be used to train machine- learning classifiers for predicting response to chemotherapy. however, the accuracy of such models is presently limited by the small number of available training cases that are labeled for clinical outcome, given the large size of the transcriptome (∼ k genes frankish et al., ) and the significant intertumoral variance of gene expression. for typical cancers, most of the profiled tumor transcriptomes are not labeled for chemotherapeutic response; the ratio of such unlabeled to labeled tumor datasets in the cancer genome atlas (tcga) dataset (hutter and zenklusen, ) ranges from – , depending on the cancer type. while using (exclusively) supervised learning methods for the response-to- chemotherapy prediction problem has been a sensible first step, unlabeled data are a substantial resource that could—in the context of a semi- supervised approach—reveal multivariate structure or patterns that could ultimately improve predictive accuracy. semi-supervised approaches that fuse unsupervised data reduction methods (such as principal components analysis, or pca) for low-dimensional embedding with supervised methods (such as decision trees) for prediction have proved beneficial in problems where large unlabeled datasets are available, for example, a pca-xgboost method has been previously used in finance (wen and huang, ), and an independent components analysis-based method has been used to classify electroencephalographic signals (qin et al., ). multiple studies (an and cho, ; li and she, ; bouchacourt et al., ; kipf and welling, ) have established the power of the variational autoencoder (vae; kingma and welling ( ); jimenez rezende et al. ( ))—an unsupervised nonlinear data embedding model with two deep neural networks oppositely connected through a low-dimensional probabilistic latent space—for finding meaningful and useful latent features in high-dimensional data. in the context of cancer bioinformatics, vaes have been variously used to (i) model cancer gene expression and capture biologically-relevant features using the tcga pan-cancer project rna-seq dataset (way and greene, ); (ii) find encodings that correlate with biological features such as patient sex and tumor type (titus et al., ); (iii) find encodings that can be used to predict gene inactivation in cancer (way and greene, ); and (iv) obtain an encoding that is predictive of chemotherapy resistance (george and lio, ). based on their exploration of multiple vae architectures for predicting gene inactivation in a pan-cancer dataset, way & greene reported ( ) biological insights obtained from the latent-space embeddings learned by vaes. george and lio ( ) used a vae-based, fully unsupervised approach to encode ovarian tumor transcriptomes and obtained latent-space features that were associated with response to chemotherapy. these studies suggest that a tumor transcriptome vae may be broadly useful for the response-to-chemotherapy prediction problem and they set the stage for the present multi-cancer investigation of the utility of the tumor transcriptome vae in precision oncology. given previous reports of success using a vae to obtain useful low-dimensional encodings of transcriptome data (dong et al., ; way and greene, ; way and greene, ), in this work, we first sought to ascertain to what extent a vae encoding of tumor transcriptome data would preserve biological characteristics—spanning multiple genes at a time that have coordinated variation across tumors— that are associated with distinct cancer types. to answer this question, we trained a pan-cancer transcriptome vae and used it to encode tcga tumor rna-seq data from , tumors comprising different cancer types, focusing on the top , most variable genes. we trained the vae using an efficient contemporary optimization engine (adam) to find the vae coefficient values that together balance reconstruction loss and desired latent-space distributional shape. we applied an unsupervised two-dimensional embedding method (t-distributed stochastic neighbor embedding, or t-sne) directly to tumor transcriptome and to the vae- embedded tumor transcriptome data, and mapped clusters of tumors by cancer type across the two t-sne embeddings. we found (sec. . ) that the vae preserves the clustering of tumors of the same cancer type, suggesting biological fidelity in the components of the vae embedding. next, to set the stage for a semi-supervised approach for predicting cancer response to chemotherapy, we selected five cancer types (breast, bladder, colon, pancreatic, and sarcoma) based on sufficient availability of clinically labeled data and then defined three different vae architectures: vae- , which we used to obtain feature data for bladder, breast, and pancreatic cancer; vae- , for sarcoma; and vae- , for colon cancer. in order to train a vae, it is necessary to specify a reconstruction loss function; both l and l reconstruction loss have been used for training vaes in machine-learning, and we sought to clarify which is best for this application. thus, we trained each of the three vae architectures on , tumor transcriptomes from tcga, in an unsupervised fashion, separately using l loss and l loss. next, in order to label tumors for response to chemotherapy, we analyzed the available tcga clinical data regarding the outcome of pharmaceutical therapy (in most cases including chemotherapy) for each of the patients, and thereby assigned a label “responded” or “progressive” to out of the , tumors (sec. . ); the remainder of the tumors were unlabeled and thus used only during vae training. for the labeled tumors, we used the vae- encoded latent vectors as feature data for supervised prediction of the binary label using gradient boosted decision trees (xgboost; chen and guestrin ( )). using this semi-supervised “vae-xgboost” approach, we found (sec. . ) that a vae trained using l reconstruction loss yields features that result in better classification performance (by area under the receiver operating characteristic, auroc) than a vae trained using l . in the main part of this work, using xgboost, we measured response-to-chemotherapy prediction performance for each of three tumor transcriptome-derived feature sets: (i) expression levels of the top % of genes, by intertumoral variance (a fully supervised approach); (ii) the first principal components of expression levels of “top %” genes (“semi- supervised pca-xgboost”); and (iii) vae-encoded expression levels of the top % genes (“semi-supervised vae-xgboost”, our new method, fig. ). within a cross-validation framework for auroc performance evaluation, we found (sec. . ) that for four out of five cancer types, the semi-supervised vae-xgboost approach outperformed the fully- supervised approach. moreover, for four out of the five cancer types, semi- supervised vae-xgboost outperformed semi-supervised pca-xgboost. finally, for the one cancer type for which pca-xgboost outperformed vae-xgboost, we investigated their relative performance through the lens of xgboost feature importance (sec. . ). below, we describe our results (sec. ) and the vae-xgboost method in detail (sec. ). results . vae encoding preserves cancer type features given multiple reports (dolezal et al., ; esteva et al., ) that t-sne can be used to visualize the grouping of cancer types from high- dimensional molecular tumor data, we investigated the extent to which .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i gene expression data (original input, )x reconstructed gene expression data ( output, )x̃ encoder network eq. mean vector ( )μ ̂θ variance vector ( )σ ̂θ sampled latent vector ( )z decoder network g ̂ϕ add labeled input ( )y latent vector + label as input (z, y) xgboost classifier eq. & eq. probability of predicated label p(ỹ | z) reparameterize sampling eq. & eq. fig. : overview of the vae-xgboost method that we used for predicting tumor response to chemotherapy. for each tumor t, the encoder’s input vector xt contains the levels of the top % of genes by intertumoral gene expression variance (sec. . ). each network has multiple fully connected dense layers (sec. . ). the encoder outputs two vectors of configurable latent variable dimension h � m (sec. . ): a vector of means µ and a vector of standard deviations σ that parameterize the multivariate normal latent-space vector z|xt (sec. . ). the sampled encoding z|xt = zt is passed to the decoding neural network (decoder), whose architecture is identical to (with inversion) that of the encoder network. the sampled latent-space vector zt is passed to xgboost for supervised classification to predict response to chemotherapy (training label y, prediction ỹ). vae encoding of tumor transcriptomes preserves data-space features that determine cancer type-specific groupings. in order to do so, we obtained (sec. . ) from the tcga data portal rna-seq transcriptome data for , tumors labeled for different cancer types (listed in fig. ). as a baseline view of transcriptome-based cancer type groupings, we generated a two-dimensional embedding of the , tumor samples by applying t-sne (sec. . ) to the expression levels of the top , most variable genes, yielding distinct clusters (fig. a). next, we trained (sec. . ) a vae to encode the expression levels of the , most variable genes in each of , tumors into , points in a -dimensional latent space. an unsupervised t-sne visualization (fig b) of the vae-encoded tumor transcriptome data was remarkably similar in structure to the t-sne visualization of the , -dimensional original dataset, with intercluster distances for all pairs of clusters correlated between of the two t-sne plots (r = . ; see fig. s ). this analysis indicated that the vae encoding preserves data-space features that distinguish individual cancer types. . obtaining a labeled tumor transcriptome dataset having demonstrated that the vae can efficiently encode tumor transcriptomes while preserving features that distinguish different cancer types, and to set the stage for implementing a semi-supervised approach for predicting response to chemotherapy, we obtained a five-cancer- type tumor transcriptome dataset with a significant subset of the tumors labeled for “response to chemotherapy”, as described below. we obtained transcriptomes of tumors across five cancer types [colon adenocarcinoma (coad), pancreatic adenocarcinoma (paad), bladder carcinoma (blca), sarcoma (sarc), and breast invasive carcinoma (brca); see table ] that we selected based on availability of a sufficient amount of labeled data in tcga (see sec. . ) and generated binary clinical labels for them corresponding to “responded” or “progressive” (see sec. . ). among these tumors, the class balance ratio, i.e., the ratio of responding tumors to progressive disease tumors, ranged from a low of . for pancreatic cancer to a high of . for breast cancer. . l loss is better than l loss for this application having obtained , tumor transcriptomes across five cancer types with of the tumors labeled for response to chemotherapy, we next sought to determine which type of vae reconstruction loss function—l loss or l loss—would yield transcriptome encodings that are most amenable to accurate xgboost-based prediction of response to chemotherapy. on the , tumor transcriptomes, we trained two sets of cancer type-specific vaes (see sec. . ) using l and l loss functions, respectively. we used the l and l vaes to encode the labeled tumor transcriptomes (the top % most variable genes in each cancer type, merged across the five cancers, for a total of , genes) spanning the five cancer types, yielding (for each cancer type) two feature matrices (one for l loss and one for l loss) that we separately evaluated for xgboost prediction (sec. . ) of the binary response-to-chemotherapy class label. by test-set area under the receiver operating characteristic (auroc; sec. . ), averaged across the five cancers, we found (fig. ) that the features that were generated by the l vaes led to . % better (p < − , welch’s t-test) classification performance than the features generated by the l vaes, and thus, for all subsequent analyses, we used vaes trained with l loss. . chemotherapy drug response classification result having selected l reconstruction loss for training vaes to encode tumor transcriptomes for predicting response-to-chemotherapy, we focused on the key question of whether (and to what extent) a semi-supervised approach using the vae can outperform (in terms of predictive accuracy) a fully supervised approach or a semi-supervised approach based on a traditional dimensional reduction technique (principal components analysis, pca). in brief, our vae-based semi-supervised approach involves three steps: (i) training a vae to encode clinically unlabeled tumor transcriptomes (for the top % most variable genes) for a single cancer type, into a low-dimensional space (sec. . ); (ii) using that vae to obtain latent-space encodings for the tumor transcriptomes that are labeled for a relevant clinical endpoint (in this work, response to chemotherapy); and (iii) training and testing a supervised classifier (in this work, xgboost binary classification) using the latent-space encodings as feature data. to address the question of whether this vae-based, semi-supervised (vae-xgboost) approach can outperform a fully supervised approach, we compared the performance of the vae-xgboost method to a fully supervised approach in which we applied xgboost directly to the tumor expression levels of the top % most variable genes ( , genes) as feature data. in the same analysis, to address the question of whether .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i table . table of numbers of samples with chemotherapy response record for each cancer type (n.b., the total number of labeled tumor samples exceeds the total number of patients because some patients had multiple tumors). after each cancer type, its tcga abbreviation is shown in parentheses. cancer type total number of samples (labeled and unlabeled) number of labeled samples proportion of labeled samples class balance ratio (responding/progressive) breast invasive carcinoma (brca) , . . colon adenocarcinomas (coad) . . bladder carcinoma (blca) . . pancreatic adenocarcinoma (paad) . . sarcoma (sarc) . . sum , table . quantitative auroc performances of xgboost (“raw data”), pca-xgboost (“pca”), and vae-xgboost (“vae”), along with pairwise comparisons. auroc (mean) p (welch’s t-test) p (wilcoxon signed-rank test) cancer type vae pca raw data vae versus raw data vae versus pca vae versus raw data vae versus pca brca . . . . × − . × − . × − . × − coad . . . . × − . × − . × − . × − blca . . . . × − . × − . × − . × − paad . . . . × − . × − . × − . × − sarc . . . . × − . × − . × − . × − the vae-xgboost method could outperform a semi-supervised approach based on pca dimensional reduction, we compared the vae-xgboost method to the pca-xgboost method. we carried out this analysis for each of the five cancer types separately, using the set of cancer type-specific labeled tumors (totaling labeled tumors). we measured performance using test-set auroc in a cross-validation framework (sec. . ). for four out of five cancer types (breast, colon, pancreatic, and sarcoma), in terms of test-set auroc, the vae-xgboost approach outperformed the fully-supervised approach of applying xgboost directly to the expression levels of the tumors’ top % most variable genes (fig. ), by both welch’s t-test and wilcoxon’s signed-rank test (table ); for blca, the semi-supervised vae-xgboost and fully-supervised models’ performances were statistically indistinguishable. additionally, for four out of five cancer types (bladder, breast, pancreatic, and sarcoma), the semi-supervised vae-xgboost method significantly outperformed the semi-supervised pca-xgboost method (fig. and table ). the five- cancer average auroc for vae-xgboost was . , a performance gain of . % over the five-cancer average auroc for pca-xgboost ( . ) and a gain of . % over the fully-supervised model’s average ( . ). notably, a single deep vae architecture (vae- , which had a - dimensional latent space and six layers in the encoder; see sec. . ) yielded latent-space encodings that outperformed semi-supervised pca-xgboost for three cancer types (bladder, breast, and pancreatic). . pca & vae feature importance scores, for coad having established that the semi-supervised vae-xgboost outperforms the semi-supervised pca-xgboost approach for tumor transcriptome- based prediction of response to chemotherapy for four out of five cancer types, we sought to understand the basis for the higher performance of pca-xgboost over vae-xgboost on the fifth cancer type, colon adenocarcinoma (coad). specifically, we investigated whether the strong performance of pca-xgboost on coad is attributable to differences in the distributions of xgboost feature importance scores (sec. . ) of the pca features versus vae latent-space features. we found that the distribution of feature importance scores (as a function of rank) was more sharply peaked at lowest-ranked features in the vae than in the pca (fig. ), suggesting that the performance gain with pca reflects a broader spectrum of informative features for that particular cancer type. discussion as far as we are aware, this work is the first report of a broad (five- cancer) investigation of the potential for a vae-based, semi-supervised approach for predicting response to chemotherapy. across the five cancer types that we studied, the ratio of responding tumors to progressive disease tumors ranged from a low of . for pancreatic cancer to a high of . for breast cancer, reflecting a broad range of resistances to standard-of-care chemotherapy. our results clearly demonstrate the utility of the vae for compressing high-dimensional data to a continuous, low-dimensional latent space while retaining features that are essential for distinguishing different cancer types and for predicting response to chemotherapy. nevertheless, three limitations of this work bear noting. the first limitation concerns the type(s) of tumor “omics” data from which features are derived for the predictive model. while in this work we focused on tumor transcriptome data which can be measured with high precision over a wide dynamic range of transcript abundances by rna- seq, we note that tcga datasets of tumor somatic mutations and copy number alteration events are also available (hutter and zenklusen, ). given the voluminous literature on the use of tumor somatic genomic data for precision cancer diagnosis (mitchel et al., ; zhang et al., ; lee et al., ), tumor dna datasets are fertile ground for developing a semi- supervised, multi-omics model for predicting response to chemotherapy. second, we noted for decision tree-based response-to-chemotherapy prediction, the performance of vae-encoded transcriptome features is somewhat sensitive to the type of normalization used for the input data (data not shown). we explored various types of normalization for the rna- seq data including standardization of log counts and using fpkm data, we ultimately chose min-max-normalized log total-count-normalized counts (sec. . ) for the gene expression levels to be used to derive features. however, there are additional transcript quantification methods (evans et al., ) that could be explored in the context of finding optimal tumor transcriptome vae encodings for precision oncology. a similar comment applies to the specific form of the reconstruction loss function: in our analysis, features from the vae trained with l loss clearly (across five cancers) outperformed those from the vae trained with l loss, and thus, consistent with way and greene ( ), we used l loss for the vae that we used to address the main question of this work (sec. . ) as well as the pan-cancer t-sne analysis (sec. . ) .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i fig. : marks represent tumor transcriptomes visualized using t-sne, with colors representing cancer types. (a) original gene expression data of the top , most variable genes. (b) vae compressed gene expression data. red rectangles denote the five cancer types selected for chemotherapy response classification (sec. . ). the third limitation relates to the vae architecture. while it is promising that a single deep vae architecture (vae- , with a - dimensional latent space and six fully-connected layers) yielded features that outperformend pca and the original rna-seq feature data for three different cancer types (bladder, breast, and pancreatic), for . . . . . . . . . . l _loss l _loss a u r o c fig. : average auroc results over five different types of cancer, by loss type. squares, mean values; bars, % confidence interval (c.i.). colon cancer and sarcoma, it was necessary to use shallower (two- layer) vae architectures with bigger latent space dimensions ( and , respectively). of the five cancers studied, colon cancer and sarcoma had the lowest proportions of labeled samples ( . and . respectively; see table ). our findings suggest that for some cancers, a deep, low-latent-dimension vae architecture yields optimal features for predicting response, while for other cancers, a shallow, medium-sized- latent-dimension vae architecture is more effective. more study with larger datasets will be required in order to determine whether a single vae architecture could be successfully used for general-purpose tumor transcriptome feature extraction for precision oncology. while our results show promise for the vae in the context of a semi- supervised approach for response-to-chemotherapy prediction, for colon cancer, the vae-xgboost method did not outperform pca-xgboost (though it did outperform the fully supervised approach of xgboost trained on the unencoded gene expression data). one possible explanation for the colon cancer-specific superior performance of pca features over vae features for predicting response to chemotherapy may be due to the fact that while (for coad) feature importance for the vae features is sharply peaked for the first few features and falls off fairly rapidly with feature rank, the pca features have a much flatter distribution of relative feature importance (fig. ). follow-on studies with larger datasets will be required to delineate under what circumstances transcriptome vae encodings will prove superior to linear principal components. conclusions for four of the five cancer types that we studied, the semi-supervised vae-xgboost approach significantly outperformed a semi-supervised pca-xgboost approach for tumor transcriptome-based prediction of response to chemotherapy, reaching a top auroc of . for pancreatic adenocarcinoma. for four out of five cancer types, the semi-supervised vae-xgboost approach significantly outperformed a fully-supervised approach consisting of xgboost applied to the expression levels of the top % most variably expressed genes. given high-dimensional “omics” data, the vae is a powerful tool for obtaining a nonlinear low-dimensional embedding; it yields features that retain biological patterns that distinguish between different types of cancer and that enable more accurate tumor transcriptome-based prediction of response to chemotherapy than would be possible using the original data or their principal components. .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i sarc coad paad blca brca raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) raw( , ) pca( ) vae( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a u r o c fig. : test-set performance of the three models for predicting response to chemotherapy, across five cancer types. group abbreviations: “pca( )”, the pca-xgboost semi-supervised method ( : number of principal components used as features); “raw( , )“, the fully-supervised xgboost method ( , : number of genes used as features); and “vae(n)”, the vae-xgboost semi- supervised method (n: dimension of the latent feature space). marks correspond to individual replications of five-fold cross- validation; solid squares denote mean; bars indicate % c.i; colors denote the type of feature-set (sec. . ): red, “pca”; olive, “raw”; cyan, vae- ; magenta, vae- ; green, vae- . methods we carried out all data processing and machine-learning tasks on a dell xps workstation equipped with nvidia titan rtx gpu and running the ubuntu gnu/linux operating system version . . all of the analysis code that we implemented was executed in python version . . except that we used r version . . for statistical analysis of auroc values (sec. . ), gene-level mad calculations (sec. . ) and plotting (sec. . ). we carried out all statistical tests using the r computing environment (version . . ) and using the r software package stats version . . . . gene expression data from the xena data portal (goldman et al., ), we obtained tcga level tumor rna-seq transcriptome data of cancer types (totaling , tumors) and, for the response-to-chemotherapy prediction problem, five cancer types [colon adenocarcinomas (coad), pancreatic adenocarcinoma (paad), bladder carcinoma (blca), sarcoma (sarc), and breast invasive carcinoma (brca)] totaling , tumors. we selected the five cancer types based on two criteria: (i) a sufficient number (at least ) of paired tumor-transcriptome and clinical data sum of importance r a n k o f fe a tu re s group pca vae fig. : bars indicate the sum (over replications) of xgboost feature importance scores. “group” indicates the low-dimensional embedding method used (vae or pca). bars separately ordered from highest to lowest (only top most important features shown). samples available for the cancer type; and (ii) a sufficient number (at least ) of tumor transcriptome samples available (regardless of the clinical data availability) for the cancer type. we obtained both the rna- seq (gene-level) total-read-count-normalized log ( +c) read counts and normalized (fragments per kilobase of transcript per million mapped reads, fpkm (dillies et al., )) expression data for for , human genes. to focus the machine-learning on the portion of the tumor transcriptome that had the most variation from tumor to tumor, we identified the top % most variable genes as measured by the median absolute deviation (mad) across tumors, of gene expression in terms of fpkm (we used fpkm for this purpose in order to mitigate bias due to read length and tumor-specific depth of sequencing). for deriving feature-sets for xgboost prediction directly based on transcript abundances or based on vae- or pca encoding, the % criterion applied to each of the five cancer types yielded a set of , genes. we computed mad using the r package stats version . . (r core team, ) with default parameters. after the variance-filtering step, we used the log ( + c) of total-count-normalized count values for the top- % highest-variance genes (that were selected as described above) to obtain (or encode) feature values. we compared the performance—in terms of minimizing the vae reconstruction loss (see sec. . )—of different feature scaling methods (no scaling, min-max normalization, and standardization (kreyszig et al., )) and selected min-max normalization as the method that we used to rescale gene-level count data for input into the vae. . t-distributed stochastic neighbor embedding (t-sne) we computed t-sne embedding components of the tumors using the function sklearn.decomposition.manifold.tsne from the python software package scikit-learn version . . with parameters init = “pca′′, perplexity = , learning_rate = , and n_iter = . for plotting the tumor transcriptome t-sne embeddings, we used the r software package ggplot version . . . . variational autoencoder (vae) an autoencoder is a type of model that combines “encoder” and “decoder” neural networks to learn a low-dimensional continuous data encoding from which the input signal can be approximately reconstructed (kramer, ). a key advantage of an autoencoder is that it is unsupervised, i.e., it can .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i be trained without labeled examples. unlike classical autoencoders (e.g., sparse or denoising autoencoders), the variational autoencoder (vae) is a generative probabilistic model which maps an input vector to a latent-space random variable (r.v.). below, we mathematically define the vae. let t denote the set of tumors for which the vae is to be fit to the tumor transcriptomes (with n ≡ |t|) and let m denote the number of genes for which transcript abundances are used to represent the tumor transcriptome. after min-max transformation of the tumor transcriptome measurements (sec. . ), each tumor’s transcriptome is represented as a vector x ∈ [ , ]m. let x denote the random variable representing the population distribution from which tumor transcriptomes are sampled, and let x ∈ [ , ]m×n represent the composite matrix of all sampled tumor transcriptomes). we aim to learn a vae that will comprise an encoder and decoder, with the encoder consisting of mean and variance functions µ : [ , ]m → rh and σ : [ , ]m → rh+, respectively. together, µ and σ map the tumor transcriptome vector xt to a h-dimensional r.v. z|xt, z|xt ∼n(µ(xt), diag(σ(xt))), ( ) where diag(m) is a matrix whose diagonal elements are the elements of the vector m. the decoder is a function g : rh → [ , ]m that, for an outcome z|xt = zt ∈ rh, maps g : zt → g(zt) ≡ x̃t; ( ) the tilde on x̃t denotes that it is the decoded data for the tumor transcriptome xt. a good autoencoder should have low reconstruction error l, which is convenient to define in terms of the p-norm of the difference between the tumor transcriptome data xt and the reconstructed data x̃t, i.e., ||xt−x̃t|| p p , where || ||p denotes the p-norm. however, this definition of the reconstruction error is only deterministic in the context of a specific outcome z|xt = zt. thus, it is conventional to define the reconstruction error as an expectation value over outcomes of z|xt, l|(x =xt) ≡ e z|xt=zt (||xt −g(zt)|| p p ), ( ) where eΩ represents an expectation value over a space of outcomes Ω. it should be noted the above representation of the reconstruction error is in terms of the outcome, zt, of a r.v. (z|xt) whose distributional parameter functions µ and σ have hyperparameters (neural network coefficients) that will be fitted. because eq. is ill-suited to backpropagation, it is helpful to recast it in terms of a new random variable et that depends on z|xt by et ≡ (diag(σ(xt)))− (zt|xt −µ(xt)). ( ) it follows from eq. and eq. that et is standard multivariate normal, et ∼n( ,i), ( ) where i is the h×h identity matrix, and thus, et does not depend on µ, σ, or t. we therefore drop the subscript t and simply denote the rescaled latent-space random variable as e. solving eq. for z|xt and applying it to eq. , the reconstruction error l|(x =xt) can be represented by l|(x =xt) = ee (∣∣∣∣∣∣xt−g(µ(xt) +√diag(σ(xt)) e)∣∣∣∣∣∣p p ) , ( ) which is amenable to backpropagation because the only r.v. in it is e, whose distributional parameters do not depend on the neural network coefficients that we will be varying. in practice, rather than computing the multivariate integral over outcomes of e, l|(x = xt) is typically approximated by averaging over a limited number j of samples from e, l|(x =xt) ' 〈(∣∣∣∣∣∣xt−g(µ(xt)+√diag(σ(xt)) �j))∣∣∣∣∣∣p p )〉 j , ( ) where 〈〉j denotes average over j ∈{ , . . . ,j} and �j is sample j from e. following way and greene ( ), we used a number of samples that is equivalent to the dimension of the transcriptome, i.e., j = m. for the case of p = (i.e., l norm), minimizing l|(x = xt) as defined above is equivalent to maximizing the expectation value of the log- likelihood log(p(g(z) = xt | x = xt)). however, following way and greene ( ) and consistent with empirical evidence (sec. . ), for our five-cancer study of the utility of a vae-based approach for response- to-chemotherapy prediction, as well as for the pan-cancer t-sne analysis (sec. . ), we chose to use l reconstruction loss, i.e., p = in eq. . the reconstruction loss measures bias error, whose minimization must be balanced against the simultaneous goal of controlling variance error through regularization. in the vae, regularization requires incentivizing (in the learning of µ, σ, and g) the latent space distributions of z|x to be close to standard multivariate normal. this is accomplished by assigning a penalty based on the kullback-leibler divergence between the distribution of z|xt and the target distribution e, represented by dkl(p(z|xt) ||p(e)). this regularization is analytically tractable (duchi, ), and for a given tumor t yields (see supplementary note, eq. s ) the following regularization function: dkl ( p(zt|xt) ∣∣∣∣ p(e)) = ||µ(xt)|| + ||σ(xt)|| −|| log(σ(xt))|| − , ( ) where log(σt) denotes an element-wise log and || || is the l norm. fitting the vae to x requires selecting µ, σ, and g from their respective function spaces; in practice, we search over functions that can be represented using a neural network for µ and σ (parameterized by the vector θ) and a neural network for the function g (parameterized by the vector φ). exploring the space of functions µθ, σθ, and gφ corresponds to computationally searching for the vector pair (θ̂,φ̂) that together minimize the joint (over all tumors) sum of the tumor-specific reconstruction loss and the regularization penalty, (θ̂,φ̂) = argmin (θ,φ) ∑ t∈t [ l|(x = xt)+dkl ( p(z|xt) ∣∣∣∣p(e))]. ( ) applying eqs. , , and , and setting p = as discussed above, we obtain the explicit formula for fitting a vae to x, (θ̂,φ̂) = argmin (θ,φ) ∑ t∈t [ j j∑ j= (∣∣∣∣∣∣xt −gφ(µθ(xt) + √diag(σθ(xt)) �j)∣∣∣∣∣∣ ) + ||µθ(xt)|| + ||σθ(xt)|| −|| log(σθ(xt))|| − ] . ( ) we implemented eq. in tensorflow version . . with keras version . . as the model-level library. we solved eq. using the adam optimization algorithm (kingma and ba, ) (with batch normalization) from the python package keras-gpu version . . with parameters learning_rate = × − , beta_ = . , and beta_ = . , to obtain (θ̂,φ̂). then, for each tumor t, we used a single sample z|xt = zt from the distribution n(µ θ̂ (xt), diag(σθ̂(xt))) as the final latent-space encoding of the tumor to be used for supervised learning (sec. . ). . labeling tumors based on response to chemotherapy from xena and cbioportal (cerami et al., ; gao et al., ), we obtained and combined tcga clinical data (where available) for note, functions µ and σ are just two different outputs of the encoding neural network, differing only at the final layer, and thus for simplicity of notation we represent them as having a common parameter vector θ. .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i the patients whose tumor transcriptomes we acquired (see sec. . ). from xena, we extracted the variables submitter_id.samples, therapy_type, and measure_of_response; from cbioportal, we extracted the variables sample_id, disease.free.status, and pharmaceutical.therapy.indicator. we co-analyzed the xena- and cbioportal-obtained clinical data to label tumors “responded” (y = ) or ”progressive” (y = ), by assigning y = when the clinical record had complete response or partial response in the measure_of_response column of the clinical data from xena, or with value diseasefree in the disease.free.status column of the clinical data from cbioportal while therapy type is recorded as chemotherapy in both. we assigned y = to tumors whose clinical records had values radiographic progressive disease, clinical progressive disease, or stable disease in the xena clinical data column measure_of_response, or had value recurred/ progressed in the cbioportal data column disease.free.status while the therapy_type is recorded as chemotherapy in both files. this yielded labeled tumors out of , total. a total of different drugs were used to treat the patients (see supplementary note, table s ). . vae model architectures we trained six transcriptome-encoding vaes based on four vae architectures, the pan-cancer vae architecture (for the -cancer unsupervised analysis, see sec. . ) and three cancer type-specific vae architectures for response-to-chemotherapy prediction (sec. . ) (one of which was used for three different cancer types, blca, brca, and paad, and the others of which were cancer type specific for coad and sarc). for the pan-cancer vae, we used a latent space dimension h = and three fully connected layers each for the encoder and decoder. for the cancer type-specific vae architectures, we again used the same number of fully-connected layers in the encoder as in the decoder (table ). table . vae architectures used for predicting chemotherapy response (h, latent space dimension; “layers”, # of layers used in the encoder/decoder). name cancer types h layers vae- blca, brca, paad six vae- coad two vae- sarc two . regularized gradient boosted decision trees (xgboost) for predicting whether or not (based on its transcriptome-derived feature- set: raw, pca, or vae) a tumor would respond to chemotherapy, we used xgboost (chen and guestrin, ), an efficient implementation of regularized gradient boosted decision trees. we used the binary classifier function xgbclassifier from the python software package xgboost version . , with gamma= . we tuned eight hyper- parameters (table ) by exhaustive grid-search with five-fold cross- validation, using sklearn.model_selection.gridsearchcv from scikit-learn version . . . to obtain feature importance scores, we used get_score with importance_type = cover. . area under roc curve (auroc) for computing the auroc (i.e., sensitivity versus false positive error rate curve), we used the function metrics.roc_auc_score from the python software package scikit-learn version . . with parameter average=“weighted”. we logit-transformed auroc values before testing (using two-tailed welch’s t-test and the wilcoxon signed rank test) for the l vs. l analysis (fig. . ), we carried out replications of five-fold cross-validation; within each replication, across the five folds, we obtained prediction scores for each tumor from the fold in which the tumor was in the test set, enabling us to compute an overall auroc within each replication. for each training data set, we have done replications of five-fold cross-validation by altering the random seed used for assign split of data during cross-validation. we have conducted the same procedure for five different cancer types (blca, brca, coad, paad, sarc) as shown in the panel names of figure . . principal component analysis (pca) for pca, we used the function decomposition.pca (with parameters svd_solver = “full′′) and n_components = . ( % variance, yielding components) from the python package scikit-learn version . . . for plotting, we used matplotlib version . . . funding sar acknowledges support from the animal cancer foundation. references airley, r. ( ). cancer chemotherapy. wiley-blackwell, ny, ny. an, j. and cho, s. ( ). variational autoencoder based anomaly detection using reconstruction probability. technical report snudm- tr- - , seoul national university. bouchacourt, d. et al. ( ). multi-level variational autoencoder: learning disentangled representations from grouped observations. arxiv: . . casado, e. et al. ( ). a combined strategy of sage and quantitative pcr provides a -gene signature that predicts preoperative chemoradiotherapy response and outcome in rectal cancer. plos one, , – . cerami, e. et al. ( ). the cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. cancer discovery, , . chabner, b. a. and longo, d. l. ( ). cancer chemotherapy and biotherapy: principles and practice. lippincott willians & wilkins, philadelphia, pa, fourth edition. chen, t. and guestrin, c. ( ). xgboost: a scalable tree boosting system. arxiv: . . chiu, y.-c. et al. ( ). predicting drug response of tumors from integrated genomic profiles by deep neural networks. bmc medical genomics, ( ), . corrie, p. g. ( ). cytotoxic chemotherapy: clinical aspects. medicine, ( ), – . del rio, m. et al. ( ). gene expression signature in advanced colorectal cancer patients select drugs and response for the use of leucovorin, fluorouracil, and irinotecan. journal of clinical oncology : official journal of the american society of clinical oncology, ( ), – . dillies, m.-a. et al. ( ). a comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. briefings in bioinformatics, ( ), – . dolezal, j. m. et al. ( ). diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers. bmc cancer, ( ), . dong, h. et al. ( ). variational autoencoder for anti-cancer drug response prediction. arxiv: . . duchi, j. ( ). derivations for linear algebra and optimization. technical report, standford university. .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / i i “output” — / / — : — page — # i i i i i i table . xgboost classification algorithm hyperparameters and hyperparameter ranges used in grid-search tuning. hyperparameter name hyperparameter description hyperparameter range n_estimators number of trees to fit ( , , , . . ., ) max_depth maximum tree depth ( , , , . . ., ) learning_rate boosting learning rate ( . , . , . , . , . , . ) min_child_weight minimum sum of instance weight needed in a child ( , , , . . ., ) subsample sub-sample ratio of the training instance ( . , . , . , . . ., . ) colsample_bytree sub-sample ratio of columns when constructing each tree ( . , . , . , . . ., . ) reg_alpha coefficient of l regularization for the node weights ( , , , ) reg_lambda coefficient of l regularization for the node weights ( , , . . ., ) esteva, a. et al. ( ). dermatologist-level classification of skin cancer with deep neural networks. nature, ( ), – . evans, c. et al. ( ). selecting between-sample rna-seq normalization methods from the perspective of their assumptions. briefings in bioinformatics, ( ), – . frankish, a. et al. ( ). gencode reference annotation for the human and mouse genomes. nucleic acids research, , d –d . gao, j. et al. ( ). integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. science signaling, , . geeleher, p. et al. ( ). clinical drug response can be predicted using baseline gene expression levels and in vitrodrug sensitivity in cell lines. genome biology, ( ), r . george, t. m. and lio, p. ( ). unsupervised machine learning for data encoding applied to ovarian cancer transcriptomes. biorxiv; doi: . / . goldman, m. et al. ( ). the ucsc xena platform for public and private cancer genomics data visualization and interpretation. biorxiv; doi: . / . gurney, h. ( ). how to calculate the dose of chemotherapy. british journal of cancer, , – . gámez-pozo, a. et al. ( ). prediction of adjuvant chemotherapy response in triple negative breast cancer with discovery and targeted proteomics. plos one, , . hutter, c. and zenklusen, j. c. ( ). the cancer genome atlas: creating lasting value beyond its data. cell, ( ), – . jimenez rezende, d. et al. ( ). stochastic backpropagation and approximate inference in deep generative models. arxiv: . . kaestner, s. a. and sewell, g. j. ( ). chemotherapy dosing part i: scientific basis for current practice and use of body surface area. clinical oncology, , – . kingma, d. p. and ba, j. ( ). adam: a method for stochastic optimization. arxiv: . . kingma, d. p. and welling, m. ( ). auto-encoding variational bayes. arxiv, page arxiv: . . kipf, t. n. and welling, m. ( ). variational graph auto-encoders. arxiv: . . kramer, m. a. ( ). nonlinear principal component analysis using autoassociative neural networks. aiche journal, ( ), – . kreyszig, e. et al. ( ). advanced engineering mathematics. wiley, hoboken, nj, tenth edition. kurokawa, y. et al. ( ). molecular prediction of response to - fluorouracil and interferon-α combination chemotherapy in advanced hepatocellular carcinoma. aacr, ( ), – . lee, k. et al. ( ). cpem: accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. scientific reports, ( ), . li, x. and she, j. ( ). collaborative variational autoencoder for recommender systems. in proceedings of the rd acm sigkdd international conference on knowledge discovery and data mining, pages – , new york, ny. acm. malfuson, j.-v. et al. ( ). risk factors and decision criteria for intensive chemotherapy in older patients with acute myeloid leukemia. haematologica, ( ), – . mitchel, j. et al. ( ). a translational pipeline for overall survival prediction of breast cancer patients by decision-level integration of multi- omics data. in ieee international conference on bioinformatics and biomedicine (bibm), pages – . qin, j. et al. ( ). ica based semi-supervised learning algorithm for bci systems. in j. rosca, d. erdogmus, j. c. príncipe, and s. haykin, editors, independent component analysis and blind signal separation, pages – , berlin. springer. r core team ( ). r: a language and environment for statistical computing. r foundation, vienna, austria. isbn - - - . skeel, r. t. ( ). handbook of cancer chemotherapy. lippincott williams & wilkins, philadelphia, pa, sixth edition. titus, a. j. et al. ( ). unsupervised deep learning with variational autoencoders applied to breast tumor genome-wide dna methylation data with biologic feature extraction. biorxiv; doi: . / . wang, z. et al. ( ). rna-seq: a revolutionary tool for transcriptomics. nature reviews genetics, ( ), – . way, g. p. and greene, c. s. ( ). evaluating deep variational autoencoders trained on pan-cancer gene expression. arxiv: . . way, g. p. and greene, c. s. ( ). extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. pacific symposium on biocomputing, , – . weir, b. et al. ( ). somatic alterations in the human cancer genome. cancer cell, ( ), – . wen, h. and huang, f. ( ). personal loan fraud detection based on hybrid supervised and unsupervised learning. in th ieee international conf. on big data analytics (icbda), pages – . whelan, t. et al. ( ). helping patients make informed choices: a randomized trial of a decision aid for adjuvant chemotherapy in lymph node-negative breast cancer. jnci: journal of the national cancer institute, ( ), – . zhang, y. et al. ( ). a novel xgboost method to identify cancer tissue- of-origin based on copy number variations. front genet, , . .cc-by . international licensereview) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified by peerthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction results vae encoding preserves cancer type features obtaining a labeled tumor transcriptome dataset l loss is better than l loss for this application chemotherapy drug response classification result pca & vae feature importance scores, for coad discussion conclusions methods gene expression data t-distributed stochastic neighbor embedding (t-sne) variational autoencoder (vae) labeling tumors based on response to chemotherapy vae model architectures regularized gradient boosted decision trees (xgboost) area under roc curve (auroc) principal component analysis (pca) full-length de novo protein structure determination from cryo-em maps using deep learning full-length de novo protein structure determination from cryo-em maps using deep learning jiahua he and sheng-you huang∗ school of physics, huazhong university of science and technology, wuhan, hubei , p. r. china abstract advances in microscopy instruments and image processing algorithms have led to an increas- ing number of cryo-em maps. however, building accurate models for the em maps at - å resolution remains a challenging and time-consuming process. with the rapid growth of de- posited em maps, there is an increasing gap between the maps and reconstructed/modeled - dimensional ( d) structures. therefore, automatic reconstruction of atomic-accuracy full-atom structures from em maps is pressingly needed. here, we present a semi-automatic de novo struc- ture determination method using a deep learning-based framework, named as deepmm, which builds atomic-accuracy all-atom models from cryo-em maps at near-atomic resolution. in our method, the main-chain and cα positions as well as their amino acid and secondary structure types are predicted in the em map using densely connected convolutional networks. deepmm was extensively validated on simulated maps at å resolution and experimental maps at . - . å resolution as well as an emdb-wide data set of experimental maps at . - . å resolution, and compared with state-of-the-art algorithms including rosettaes, mainmast, and phenix. overall, our deepmm algorithm obtained a significant improvement over existing methods in terms of both accuracy and coverage in building full-length protein structures on all test sets, demonstrating the efficacy and general applicability of deepmm. availability: https://github.com/jiahuahe/deepmm supplementary information: supplementary data are available. ∗email: huangsy@hust.edu.cn; phone: + - - ; fax: + - - .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction cryo-electron microscopy (cryo-em) has now become a widely used technique for structure deter- mination of macromolecular structures in the recent decade – . advances in microscopy instruments and image processing algorithms have led to the rapid increase in the number of solved em maps – . the ‘resolution revolution’ in cryo-em has paved a way for the determination of high-resolution structures of previously intractable biological systems – . according to the statistics of the electron microscopy data bank (emdb) , there were maps deposited in , which are almost times the maps released in . with the rapid growth of deposited em maps, there is an increasing gap between the maps and reconstructed/modeled -dimensional ( d) structures. as of april , , there were emdb maps, but only associated structures were deposited in the protein data bank (pdb) . for those maps determined at near-atomic resolution ( . ∼ . å), it is difficult to build high-resolution models with conventional software designed for x-ray crystallography. in view of the fact that near-atomic resolution maps take up the majority of current and henceforth released maps , tools, which can re- construct structures de novo from em maps without using known structures as templates , are press- ingly needed. as such, some algorithms like em-fold , gorgon , rosetta , , pathwalking – , phenix – , and mainmast , , have been recently presented for constructing and/or assembling structure fragments from cryo-em maps. despite the present progress in de novo structure building for cryo-em maps, there are various limitations in current approaches. they can either only build structural fragments , , or have low accuracy in terms coverage and/or sequence reproduction , , . it remains challenging to automat- ically build an accurate all-atom structure from the em maps at near-atomic resolution. recently, machine learning has been actively applied in structure determination for em maps, such as single particle picking , tomogram annotation , secondary structure prediction , and backbone tracing . however, applying deep learning to build full-length protein structures for near-atomic resolution em maps remains a challenging work. here, we have developed a semi-automatic de novo atomic-accuracy structure reconstruction method for em maps at near-atomic resolution through densely connected convolutional networks .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (densenets) using a deep learning-based framework, named deepmm. instead of tracing the protein main-chain on the raw em density map, deepmm first predicted the probability of main-chain atoms (n, c, and cα) and cα positions near each grid point using one densenet . then, the method traced the main-chain according to the predicted main-chain probability map. the amino acid and secondary structure types were predicted by a second densenet. finally, the protein sequence was aligned to the main-chain according to the predicted cα probabilities, amino acid types, and secondary structure types for all-atom structure building. methods . workflow of deepmm the workflow of deepmm is illustrated in figure a. specifically, staring from a cryo-em map and the target protein sequence, deepmm first standardizes the order of axis, and interpolates grid interval to . å. then, deepmm cuts the entire map into small voxels of size å× å× å. afterwards, one densenet (say densenet a) is used to predict the main-chain and cα probability on each of the voxels. all the predicted probability values form a d probability map. next, possible main- chain paths are generated in the predicted main-chain probability map using a main-chain tracing algorithm . the cα probability values of main-chain points are interpolated from the predicted d cα probability map. afterwards, the amino acid and secondary structure types are predicted for each main-chain point through the second densenet (say densenet b). with the predicted cα probability, amino acid type, and secondary structure type for each main-chain point, the target protein sequence is then aligned to the main-chain paths based on the smith-waterman dynamic programming (dp) algorithm . the resulted multiple cα models are ranked by their alignment scores. finally, the all-atom structures are constructed from the top cα models using the ctrip program in the jackal modeling package , and refined by an energy minimization using amber . . training the densenets of deepmm two densely connected convolutional networks (densenets) are embedded into our deepmm algo- rithm. figure b illustrates the architecture of the networks. densenet is a feed-forward multi-layer .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / network which uses additional paths between earlier and later layers in a dense block. densenets have several compelling advantages. they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters . deepmm also employs a hard parameter-sharing multi-task learning method, which can greatly reduces the risk of overfitting . the first network (i.e. densenet a) is used to simultaneously predict the main-chain probability and cα probability of a grid point. the second network (i.e. densenet b) is used to pre- dict the amino acid type and secondary structure type of a main-chain local dense point (ldp). the input for the densenet a are voxels of size å × å × å. the second network (densenet b) takes the voxels of size å × å × å as input because main-chain points are not always on the integer grid after mean shift. for each voxel, the density values are normalized to the range of [ , ] according to the maximum and minimum density values in the voxel. d convolutions and d pool- ing layers are used instead of their d counterparts used in traditional image processing because the density maps have three dimensions. several dense blocks are used in both networks, each of which consists of eight densely connected layers. for densenet a, the first two dense blocks are shared by both tasks, whereas for densenet b, only one shared block is adopted. after the shared blocks, each task employs two task-specific blocks and gives the final prediction. the details of network architecture are provided in supplementary table . all the training parameters and procedure used for simulated em maps are essentially the same to the parameters and procedure used for experimental em maps unless otherwise specified. for densenet a, all the grid points above a density value d were used for training, where d was set to . for simulated maps at . å resolution. for experimental maps, d was set to / of its recommended contour level. the labels (main-chain probability and cα probability) of a grid point ~a were calculated as follows: p ~x ~a = min{e − ‖~a− ~x‖ r , ∀ ~x ∈ ‖~a − ~x‖ < rcut} ( ) where x stands for the n, c, or cα atoms. the r is the radius at which the probability drop to /e. if no atom is within rcut of a grid point, the corresponding probability is set to . a total of voxels were trained in one batch and epochs were trained for the whole data set. the adam optimizer with an initial learning rate of . was used to minimize the mean absolute error (mae). learning .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / rate decay was adopted, where the learning rate was reduced to / of the current value after every epochs. to avoid over-fitting, the weight decay parameter of adam optimizer was set to e- as the l regularization. for densenet b, one point was randomly sampled within . å for every main-chain atom in the training set. the corresponding amino acid type and second structure type marked by stride were assigned to each point. twenty types of amino acids were grouped into four classes according to their sizes, shapes and distributions in their em density maps , as illustrated in figure d. specifically, gly, ala, ser, cys, val, thr, ile and pro are grouped as class i. leu, asp, asn, glu, gln and met are grouped as class ii. lys and arg are grouped as class iii. his, phe, tyr and trp are grouped as class iv. residues that have structure codes of h, g, or i by stride were labelled as “helix”, those with codes of b/b or e were labelled as “sheet”, and the other residues were labelled as “coil”. all the training parameters were identical to those for densenet a except for using crossentropyloss as loss function. . tracing the main-chain path the main-chain tracing algorithm in mainmast was used to trace the main-chain path in our predicted main-chain probability map. in brief, local dense points (ldps) are first identified using the mean shift algorithm, which iteratively shifts the initial grid points towards the local highest probability by computing the weighted average of probability values. then, the shifted points that are within a threshold distance of . å are clustered, and the point with the highest probability in the cluster is chosen as the representative, called ldp. the next step is to connect ldps into a minimum spanning tree (mst) and iteratively refine the tree structure with a tabu search method. after multiple steps of tabu search, the longest path of the refined tree is traced as the main-chain path. the details of the algorithm can be found in the mainmast study . . aligning target sequence to main-chain path the smith-waterman dynamic programming (dp) algorithm is used to align the target sequence to the predicted main-chain path. the predicted cα probability value, amino acid type, and secondary structure type are assigned to each point of the main-chain. instead of using amino acid types, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / amino acids are grouped into four classes according to their sizes, shapes, and distributions in em density maps (figure d). secondary structures are categorized into three types of helix, sheet, and coil. the match between the target sequence and main-chain path is evaluated by two scoring matrices for amino acid and secondary structure, respectively (figure b). namely, a target residue is more likely to be aligned to a main-chain point with the same amino acid type, the same secondary structure type, and a higher cα probability, and vice versa. the detailed alignment protocol is shown in figures a, b and c. the n residues {ai(i = , ...n)} in the protein are aligned to m ldps {lj(j = , ...m)} in the main-chain path. the matching score m(i, j) for a pair of ai and lj is computed as follows. m(i, j) = waamaa(taa(ai), taa(lj)) + wssmss(tss(ai), tss(lj)) ( ) where maa and maa are the scoring matrices for amino acid and secondary structure matching , , respectively. for a residue ai, the amino acid type is one of the four amino acid classes (taa(ai) = , , , ). the predicted amino acid type for an ldp lj is also one of the four amino acid classes (taa(li) = , , , ). similarly, the secondary structure matching score is calculated using the sec- ondary structure type predicted from the sequence (tss(ai) = , , ) by spider and secondary structure type predicted on ldps (tss(li) = , , ). the scoring matrices maa and mss used in the alignment are shown in figure b. the waa and wss are the weights for corresponding matching scores and set to . and . , respectively. with the calculated matching score m(i, j), an alignment is calculated with the follow rule to form a dp matrix, f , as follows. f(i, j) = max            f(i − , j) + gap f(i − , j − ) − wcα−cα|dstd − d| + wcαpcα(j) + m(i, j) f(i, j − ) ( ) where gap is the gap penalty for unassigned residues in the protein sequence. to ensure a full-length structure reconstruction, gap is set to − . so as to forbid skipped residues. the |dstd − d| is the penalty score for cα-cα distance, where dstd is the standard cα-cα distance and d is the distance between ldp lj and the last aligned ldp. the pcα(j) is the predicted cα probability for ldp lj. the wcα−cα and wcα are the weights for the corresponding scores. here, wcα is set to . , and wcα−cα is set to . , . , and . for “helix”, “sheet”, and “coil”, respectively. for each combination .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / of parameters in the main-chain tracing procedure, cα models are generated. finally, all the generated cα models are ranked by their alignment scores. . parameter settings of deepmm the parameters of mean-shift, mst construction, and tabu search are set to be the same to those in mainmast , unless otherwise specified. deepmm employs several parameter combinations to generate multiple cα models for one em map. for each combination of parameters, trajectories of tabu search are carried out, yielding main-chain paths. since deepmm starts from the main-chain probability map, fewer parameter combinations are needed to reconstruct reliable d structures. for both simulated and experimental maps, the thresholds of probability (Φthr) and normalized probability (θthr) are both set to . for the simulated maps, only one parameter combination is adopted. specifically, the maximum number of tabu search steps (nround) is set to , the sphere radius of local mst (rlocal) is set to . å, and the constraint for the length (dkeep) is set to . å. for the experimental maps, we employ the following combinations of parameters: the sphere radius of local mst (rlocal= . , . , . å), the edge weight threshold (dkeep= . , . , . å), and the maximum number of the tabu search steps (nround= , , ). for the extended emdb- wide test set of maps, we employ fewer combinations of parameters so as to save computational cost: the edge weight threshold (dkeep= . , . å) and the maximum number of the tabu search steps (nround= , ). the sphere radius of local mst (rlocal) is set to å. for each of the generated main-chain path, cα models are generated using different standard cα-cα distances (dstd= . , . , . , . , . , . , . , . å) on two sequence directions. namely, models ( models for each of the trajectories) are constructed for each parameter combination. the cα models are ranked by their alignment scores and then an rmsd cutoff of å is used to remove the one with lower alignment score in two similar structures. finally, the top scored protein cα models are selected to build the all-atom structures. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . datasets used . . training sets two data sets, simulated em map set and experimental em map set, were used to train our deepmm method for simulated maps and experimental maps, respectively. for simulated em maps, representative structures for different superfamilies in the scope database were taken from emap sec as training set. those structures were removed from the training set if they have a tm-score of over . with any structure in the test set. to save the com- putational cost, only randomly selected structures from the training set were retained. next, we used the e pdb mrc.py program from the eman package (version . ) to generate the simulated em maps at . å resolution and . å grid interval for each structure in training and test set. the training scope entries used in this study were listed in supplementary table . for experimental em maps, all the em density maps at - å resolution that have associated pdb models were downloaded from the emdb. as of december , , em maps were collected. any pdb structure and its corresponding em map that met the following criteria were removed: (i) including nucleic acids, (ii) missing side-chain atoms, (iii) including “hetatm” residues, (iv) including “unk” residues, (v) including more than subunit (model), and (vi) including less than or more than residues. then, chains from the remaining experimental em maps were clustered with % sequence identity using cd-hit , yielding a total of chains. to ensure a valid evaluation, chains were removed from training set if they have over % sequence identity with any chain in the test set. each protein chain was zoned out from the whole map using a distance of . å . for good quality maps, protein chain and its associated map should have sufficient structural agreement. the cross-correlation between the experimental map and the simulated map density at the same resolution with the experimental map generated from the structure was calculated using the ucsf chimera . only the chains with a cross-correlation of over . were kept . the final training set consists of non-redundant protein chains. the grid intervals for experimental maps were unified to . å using trilinear interpolation. the training em maps and their corresponding pdb chains used in this study are listed in supplementary table . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . test sets three test sets were used to evaluate our deepmm approach for its accuracy and general applicability, including one simulated map set and two experimental maps. the simulated map set was taken from the test set of simulated maps used by mainmast . the maps were generated at . å resolution with a grid spacing of . å using the e pdb mrc.py program in the eman package . the first experimental test set is the benchmark of em maps at . - . å resolution, which have been used to evaluate mainmast . the corresponding em maps were downloaded from the emdb, for each em map, a single subunit was zoned out from the whole density map at a distance cutoff of . å. in addition, to evaluate the accuracy and general applicability of deepmm, we have also con- structed a large test set of embd-wide experimental maps. the generation procedure of this set was similar to that for the experimental training set. specifically, for each chain of the em pdb structure at . - . å resolution and no more than one subunit (model) from the emdb, a single density patch was zoned out from the whole density map at a distance cutoff of . å. any protein chain and its corresponding em map patch that met the following situation were removed: (i) including nu- cleic acids, (ii) missing side-chain atoms, (iii) including “hetatm” residues, (iv) including “unk” residues, (v) including less than or equal or more than residues, (vi) having over % sequence identity to any chain in the training set. the cross-correlation between the experimental map and the simulated density map at the same resolution generated from the structure should be over . . each protein chain was zoned out from the whole map using a distance of . å . the finial test set consists of protein chains, which are listed in supplementary table . results . model reconstruction for simulated em maps we first evaluated the performance of our deepmm algorithm on the test set of simulated density maps at å resolution. deepmm traced the main-chain of protein on the predicted main-chain probability map rather than the raw em density map. thus, the generated cα models by our deepmm .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / are closer to the native structures with fewer search trajectories and steps compared to mainmast. for each of the maps, deepmm built cα models, which were ranked by their alignment scores. the top-ranked model was selected as the predicted structure. figure shows a comparison of the predicted cα models for the protein chains of different lengths by deepmm and mainmast. the detailed results are provided in supplementary table . it can be seen from the figure that our deepmm method obtained a much better performance than main- mast. as shown in figure a, deepmm built significantly more accurate cα models, and achieved an average cα rmsd of . å when the top scored model was considered, compared to . å for mainmast. deepmm also generated high-quality models with less than . å cα rmsd for all of the maps, compared with only one such model by mainmast. moreover, deepmm achieved the high-accuracy models with less than . å rmsd for of maps, whereas mainmast failed to generate any model with < . å rmsd (figure a). the program click was also used to evaluate the accuracy of the cα models built by deepmm and mainmast. the corresponding re- sults are shown in figure b. similar to the results of cα rmsd comparison, deepmm generated many more high-quality models according to the click rmsd criterion and achieved an average click rmsd of . å when the top model was considered, compared to . å for mainmast. in addition, deepmm also achieved a significantly higher structure overlap than mainmast (fig- ure c). except for two top scored models with . % and . % structure overlap, the rest top models generated by deepmm all have a % structure overlap. on average, deepmm ob- tained a high structure overlap of . %, compare to . % for mainmast. figure also reveals that deepmm generated consistently high-accuracy models for all the proteins of different lengthes, whereas mainmast tended to perform worse with the increasing number of residues in the protein, suggesting the higher robustness of deepmm than mainmast. . model reconstruction for experimental em maps our deepmm method was further tested on the benchmark of experimental density maps at . - . å resolution. for each of the experimental density maps, deepmm built protein cα models, which were then ranked by their alignment scores. figure a shows a comparison of the cα rmsds for the models built by deepmm and main- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mast. the corresponding data are provided in supplementary table . it can be seen from the figure that deepmm generated significantly more accurate models than mainmast. on average, deepmm obtained a cα rmsd of . å for the top scored models, which is much better than . å by mainmast. moreover, deepmm predicted a model of < å for out of top scored models, of which models are within . å cα rmsd. by contrast, only and models are within . å and . å for mainmast, respectively. figure b shows a comparison of the results for the models predicted by deepmm and rosettaes. it can be seen from the figure that deepmm performed much better and generated many more accurate models than rosettaes. compared to models within å rmsd by deepmm, only six models were predicted within . å rmsd by rosettaes for the top predictions. on average, rosetta obtained an average cα rmsd of . å, which is much higher than . å for deepmm. further examination of the predicted results also reveals that the model accuracy depends more on the quality than on the resolution of a map. namely, compared to maps with relatively higher resolution but lower quality like emd- a/b ( . å) and emd- ( . å), maps with relatively lower resolution but higher quality like emd- ( . å) and emd- ( . å) are more likely to be successful in reconstructing a correct model (supplementary table ). this phenomenon can be attributed to the fact that resolution is a global estimation and resolvability is not necessarily uniform throughout the whole map . figure gives two examples of successfully reconstructed structures by deepmm. one exam- ple, emd- , which is a nucleoprotein at . å resolution, was successfully reconstructed by deepmm, as shown in figure a. it can be seen from the figure that the predicted main-chain by deepmm overlaps well with that of the deposited structure. accordingly, the predicted model shows an atomic-accuracy with a cα rmsd of . å. figure b shows the results of another example, emd- , which is the bovine rotavirus vp at . å resolution. because of its high resolution, deepmm predicted a very high accurate model with a small cα rmsd of . å. correspondingly, the constructed full-atom model by deepmm shows an excellent overlap with the deposited structure (figure b). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . evaluation of deepmm on the emdb-wide data set to investigate the accuracy and general applicability of our deepmm method, we have further eval- uated the performance of deepmm on a large test set of emdb-wide experimental maps. this large test set consists of diverse em maps with . - . å resolutions from the emdb that have asso- ciated structures in the pdb (see the methods section). for each of the test cases, our deepmm method was conducted to reconstruct structures using four combinations of parameters, yielding models for each case. figure shows a summary of the results predicted by deepmm. the corre- sponding data are provided in supplementary table . two metrics, rmsd and tmscore, were used to evaluate the overall accuracy of predicted models. on average, deepmm achieved a cα rmsd of . å for the top prediction and . å for the top predictions on this test set of maps. the corresponding average tm-scores are . and . for top and top predictions, suggesting the high accuracy of our deepmm approach. figure a shows the percentage of the predicted models at different cα rmsd cutoffs. it can be seen from the figure that . % of the top models built by deepmm are within å cα rmsd. for the top scored predictions, . % of the cases have an rmsd of less than å. the percentage of the models with different tm-score cutoffs are showed in figure b. it can be seen from the figure that . % of the top models built by deepmm have a tm-score of > . . when the top models were considered, the corresponding percentage increased to . %. comparing the results in figures a and b also reveals that the percentages for tm-score are significantly higher than those for cα- rmsd, suggesting that the models built by deepmm still share the same fold with native structure even if they have a large cα rmsd. figure c shows the percentage of correctly predicted top models (i.e. within å cα rmsd) at different resolutions. for em maps at . - . å resolution, deepmm achieved an excellent per- formance in successfully reconstructing a correct model, and achieved a success rate of . % and . % for the top and scored models, respectively. the performance of deepmm decreased with the decreasing map resolution. specifically, for the em maps with a resolution of . - . å, . - . å, and . - . å, deepmm obtained a success rate of . %/ . %, . %/ . %, and . %/ . % for the top / predictions, respectively. for em maps with a resolution of . å or worse, it is chal- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / lenging for deepmm to build correct models. on average, for the maps at - å resolution, deepmm gave a success rate of . % and . % in reconstructing a correct model within å cα-rmsd for the top and predictions, respectively. figure d shows the percentage of correctly predicted top models using the criterion of tm-score > . in different resolution ranges. similar trends in figure c can be observed in figure d. specifically, for the maps with a resolution of . - . å, . - . å, . - . å, . - . å, and . - . å, deepmm achieved correct models with a tmscore of > . for . %/ . %, . %/ . %, . %/ . %, . %/ . %, and . %/ . % of the test cases when the top / predictions were considered, respectively. on average, for the maps at - å resolution, deepmm obtained a success rate of . % and . % in building a model with tmscore > . for the top and predictions, respectively. next, deepmm was compared with phenix on this test set, where the phenix models were gener- ated using the phenix.map to model tool in the phenix package (version . . - ). two metrics calculated by phenix.chain comparison were used to evaluate the accuracy of a model. one is the fraction of the ca atoms in one model matching the ca atoms in another model within . å re- gardless of their residue names (i.e. coverage or residue match). the other is the percentage of the sequence in the target structure reproduced by the query model (i.e. specificity of sequence match). it should be mentioned that our sequence match is conducted using types of amino acids. a model with a high percentage of residue match may have a very low percentage of sequence match because of mismatching of residue names. figures a and b show the percentages of protein residues and the sequence reproduced by deepmm and phenix at different resolutions. figures c and d give the histograms of corresponding average values at different resolutions. it can be seen from the figure that deepmm achieved a significantly better performance than phenix in both residue match and sequence match, especially for those maps at low resolutions. for the maps at resolutions better than . å, . % of protein residues in the deposited structures were reproduced by our deepmm method, com- pared to . % by phenix. the corresponding average sequence match is . % for our deepmm approach, which is much higher than . % for phenix. for the maps at - å resolution, the average residue match for deepmm is . %, compared with . % for phenix. the corresponding average sequence match is . % for deepmm, which is much higher than . % for phenix. given that the prediction of sequence match is much more challenging than that of residue match, the much better .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / performance of deepmm than phenix in sequence match demonstrated the atomic-accuracy of the model built by deepmm. it is worth mentioning that deepmm can build fully-connected, full-length all-atom protein mod- els, whereas phenix is designed to build initial models of structure fragments. figure shows the protein models built by deepmm and phenix for one example, chain a of dw , part of a gabaa receptor at . å resolution. the deposited structure with its associated em density map (emd- ) is displayed in panel a. figures b and c show the phenix model and its superimposition with the de- posited structure, respectively. it can be seen from the figures that the model built by phenix consists of multiple fragments without showing any secondary structures, as expected. the predicted model by phenix for this map had a residue match of . %, but gave a very low sequence match of . %. therefore, although phenix recovered most parts of the target protein structure from the em density map, it assigned wrong residue names for most of the modeled fragments because its low sequence match, as shown in figure c. in contrast, deepmm built an excellent all-atom structure for this map, with a near-perfect residue match of . % and a high sequence match of . %. therefore, the model predicted by deepmm reproduced most of the secondary structures and had an almost identi- cal chain trace to the deposited structure(figure d). the corresponding amino acid names were also assigned correctly by our deepmm approach (figure e). conclusion in summary, we have developed a semi-automatic de novo structure determination method for near- atomic resolution cryo-em maps using a deep learning-based framework, named as deepmm. our deepmm approach can reconstruct complete all-atom protein structures for em maps with atomic- accuracy. deepmm was extensively validated on diverse benchmarks and compared with state-of-the- art approaches including rosettaes, mainmast, and phenix. deepmm has also been evaluated on an emdb-wide large test set of experimental maps at . - . å resolution. overall, deepmm was able reconstruct the protein models with tmscore> . for over % of the test cases. deepmm is fast and able to reconstruct an all-atom structure from an em map within hr on a single-gpu machine for an average-length protein chain of amino acids. given the high computational effi- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ciency and all-atomic accuracy, it is anticipated that deepmm will serve as an indispensable tool for semi-automatic atomic-accuracy structure determination for near-atomic-resolution cryo-em maps. acknowledgements the authors acknowledge professor daisuke kihara and his students genki terashi and sai raghaven- dra maddhuri venkata subramaniya from purdue university for providing their datasets. this work was supported by the national natural science foundation of china (grant nos. and ) and the startup grant of huazhong university of science and technology. competing interests the authors declare no competing interests. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references ( ) nogales e. the development of cryo-em into a mainstream structural biology technique. nat methods. ; ( ): - . ( ) frank j. advances in the field of single-particle cryo-electron microscopy over the last decade. nat protoc. ; ( ): - . ( ) cheng y. single-particle cryo-em-how did it get here and where will it go. science. ; ( ): - . ( ) raunser s. cryo-em revolutionizes the structure determination of biomolecules. angew chem int ed engl. ; ( ): - . ( ) safdari ha, pandey s, shukla ak, dutta s. illuminating gpcr signaling by cryo-em. trends cell biol. ; ( ): - . ( ) luque d, castón jr. cryo-electron microscopy for the study of virus assembly. nat chem biol. ; ( ): - . ( ) li x, mooney p, zheng s, booth cr, braunfeld mb, gubbens s, agard da, cheng y. electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-em. nat meth- ods. ; ( ): - . ( ) punjani a, rubinstein jl, fleet dj, brubaker ma. cryosparc: algorithms for rapid unsupervised cryo- em structure determination. nat methods. ; ( ): - . ( ) scheres sh. relion: implementation of a bayesian approach to cryo-em structure determination. j struct biol. ; ( ): - . ( ) adams pd, afonine pv, bunkóczi g, chen vb, davis iw, echols n, headd jj, hung lw, kapral gj, grosse-kunstleve rw, mccoy aj, moriarty nw, oeffner r, read rj, richardson dc, richardson js, terwilliger tc, zwart ph. phenix: a comprehensive python-based system for macromolecular structure solution. acta crystallogr d biol crystallogr. ; (pt ): - . ( ) zhang b, zhang x, pearce r, shen hb, zhang y. a new protocol for atomic-level protein struc- ture modeling and refinement using low-to-medium resolution cryo-em density maps. j mol biol. ; ( ): - . ( ) xie r, chen yx, cai jm, yang y, shen hb. spread: a fully automated toolkit for single-particle cryogenic electron microscopy data d reconstruction with image-network-aided orientation assign- ment. j chem inf model. ; ( ): - . ( ) yin s, zhang b, yang y, huang y, shen hb. clustering enhancement of noisy cryo-electron microscopy single-particle images with a network structural similarity metric. j chem inf model. ; ( ): - . ( ) yang yj, wang s, zhang b, shen hb. resolution measurement from a single reconstructed cryo-em density map with multiscale spectral analysis. j chem inf model. ; ( ): - . ( ) kim dn, gront d, sanbonmatsu ky. practical considerations for atomistic structure modeling with cryo-em maps. j chem inf model. ; ( ): - . ( ) joseph ap, lagerstedt i, jakobi a, burnley t, patwardhan a, topf m, winn m. comparing cryo- em reconstructions and validating atomic model fit using difference maps. j chem inf model. ; ( ): - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ( ) patwardhan a. trends in the electron microscopy data bank (emdb). acta crystallogr d struct biol. ; (pt ): - . ( ) berman hm, westbrook j, feng z, gilliland g, bhat tn, weissig h, shindyalov in, bourne pe. the protein data bank. nucleic acids res. ; ( ): - . ( ) alnabati e, kihara d. advances in structure modeling methods for cryo-electron microscopy maps. molecules. ; ( ): . ( ) lindert s, staritzbichler r, wötzel n, karakaş m, stewart pl, meiler j. em-fold: de novo folding of alpha-helical proteins guided by intermediate-resolution electron microscopy density maps. structure. ; ( ): - . ( ) baker ml, abeysinghe ss, schuh s, coleman ra, abrams a, marsh mp, hryc cf, ruths t, chiu w, ju t. modeling protein structure at near atomic resolutions with gorgon. j struct biol. ; ( ): - . ( ) wang ry, kudryashev m, li x, egelman eh, basler m, cheng y, baker d, dimaio f. de novo protein structure determination from near-atomic-resolution cryo-em maps. nat methods. ; ( ): - . ( ) frenz b, walls ac, egelman eh, veesler d, dimaio f. rosettaes: a sampling strategy enabling auto- mated interpretation of difficult cryo-em maps. nat methods. ; ( ): - . ( ) baker mr, rees i, ludtke sj, chiu w, baker ml. constructing and validating initial cα models from subnanometer resolution density maps with pathwalking. structure. ; ( ): - . ( ) chen m, baldwin pr, ludtke sj, baker ml. de novo modeling in cryo-em density maps with path- walking. j struct biol. ; ( ): - . ( ) chen m, baker ml. automation and assessment of de novo modeling with pathwalking in near atomic resolution cryoem density maps. j struct biol. ; ( ): - . ( ) terwilliger tc, adams pd, afonine pv, sobolev ov. a fully automatic method yielding initial models from high-resolution cryo-electron microscopy maps. nat methods. ; ( ): - . ( ) terwilliger tc, adams pd, afonine pv, sobolev ov. cryo-em map interpretation and protein model- building using iterative map segmentation. protein sci. ; ( ): - . ( ) afonine pv, poon bk, read rj, sobolev ov, terwilliger tc, urzhumtsev a, adams pd. real-space refinement in phenix for cryo-em and crystallography. acta crystallogr d struct biol. ; (pt ): - . ( ) terashi g, kihara d. de novo main-chain modeling for em maps using mainmast. nat commun. ; ( ): . ( ) terashi g, kagaya y, kihara d. mainmastseg: automated map segmentation method for cryo-em density maps with symmetry. j chem inf model. ; ( ): - . ( ) tegunov d, cramer p. real-time cryo-electron microscopy data preprocessing with warp. nat methods. ; ( ): - . ( ) chen m, dai w, sun sy, jonasch d, he cy, schmid mf, chiu w, ludtke sj. convolutional neural networks for automated annotation of cellular cryo-electron tomograms. nat methods. ; ( ): - . ( ) maddhuri venkata subramaniya sr, terashi g, kihara d. protein secondary structure detection in intermediate-resolution cryo-em maps using deep learning. nat methods. ; ( ): - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ( ) si d, moritz sa, pfab j, hou j, cao r, wang l, wu t, cheng j. deep learning to predict protein backbone structure from high-resolution cryo-em density maps. sci rep. ; ( ): . ( ) huang g, liu z, van der maaten l, weinberger kq. densely connected convolutional networks. ieee conference on computer vision and pattern recognition (cvpr), honolulu, hi, , - . ( ) smith tf, waterman ms. identification of common molecular subsequences. j mol biol. ; ( ): - . ( ) xiang z, honig b. extending the accuracy limits of prediction for side-chain conformations. j mol biol. ; ( ): - . ( ) petrey d, xiang z, tang cl, xie l, gimpelev m, mitros t, soto cs, goldsmith-fischman s, kernytsky a, schlessinger a, koh iy, alexov e, honig b. using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. proteins. ; suppl : - . ( ) case da, cheatham te rd, darden t, gohlke h, luo r, merz km jr, onufriev a, simmerling c, wang b, woods rj. the amber biomolecular simulation programs. j comput chem. ; ( ): - . ( ) ruder s. an overview of multi-task learning in deep neural networks. arxiv preprint. jun ;arxiv: . . ( ) heinig m, frishman d. stride: a web server for secondary structure assignment from known atomic coordinates of proteins. nucleic acids res. ; (web server issue):w - . ( ) ho cm, li x, lai m, terwilliger tc, beck jr, wohlschlegel j, goldberg de, fitzpatrick awp, zhou zh. bottom-up structural proteomics: cryoem of protein complexes enriched from the cellular milieu. nat methods. ; ( ): - . ( ) wen z, he j, huang sy. topology-independent and global protein structure alignment through an fft- based algorithm. bioinformatics. ; ( ): - . ( ) heffernan r, dehzangi a, lyons j, paliwal k, sharma a, wang j, sattar a, zhou y, yang y. highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. bioinfor- matics. ; ( ): - . ( ) fox nk, brenner se, chandonia jm. scope: structural classification of proteins–extended, integrating scop and astral data and classification of new structures. nucleic acids res. ; (database issue):d - . ( ) zhang y, skolnick j. tm-align: a protein structure alignment algorithm based on the tm-score. nucleic acids res. ; ( ): - . ( ) tang g, peng l, baldwin pr, mann ds, jiang w, rees i, ludtke sj. eman : an extensible image processing suite for electron microscopy. j struct biol. ; ( ): - . ( ) fu l, niu b, zhu z, wu s, li w. cd-hit: accelerated for clustering the next-generation sequencing data. bioinformatics. ; ( ): - . ( ) pettersen ef, goddard td, huang cc, couch gs, greenblatt dm, meng ec, ferrin te. ucsf chimera– a visualization system for exploratory research and analysis. j comput chem. ; ( ): - . ( ) nguyen mn, tan kp, madhusudhan ms. click–topology-independent comparison of biomolecular d structures. nucleic acids res. ; (web server issue):w - . ( ) pintilie g, zhang k, su z, li s, schmid mf, chiu w. measurement of atom resolvability in cryo-em maps with q-scores. nat methods. ; ( ): - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure densenet a preprocess cryo-em map cut map into voxels predict main-chain and cα probability of each voxel densenet b predict amino acid type and secondary structure of main-chain points align protein sequence to cα main-chain path construct all-atom protein model input voxel shared block shared block shared layers task b block task a block task a block task b block specific layers prediction for task a prediction for task b a b densenet m a in -c h a in t ra c in g figure : workflow of our deepmm method. (a) the flowchart of deepmm. deepmm first pre- dicts the main-chain and cα probability of each density voxel using a densely connected convolu- tional network (densenet), and then traces the protein’s main-chain path on the predicted main-chain probability map. next, the amino acid and secondary structure types for each main chain point are predicted by a second densenet. the cα models are generated by aligning the target sequence to the main-chain paths. finally, the all-atom structures are constructed from the cα models using the ctrip program and refined by an amber energy minimization. (b) the multi-task deep densenet ar- chitecture used in deepmm. starting from an input em density voxel, two dense blocks are shared by both tasks in densenet a, while only one dense block is shared by both tasks in densenet b. each prediction task employs two task-specific dense blocks and gives the final prediction. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure s e r i v ... s c a t h c e e e e ... c h h c c coil . i sheet . i sheet . iii helix . ii coil . iv coil . i coil . iii coil . i helix . iv sheet . iii sheet . i coil . i sheet . i sheet . i coil . ii sheet . i coil . iv coil . iv target main-chain path scoring matrix aa i ii iii iv i . - . - . - . ii - . . - . - . iii - . - . . - . iv - . - . - . . ss helix sheet coil helix . - . - . sheet - . . - . coil - . - . . a b c score cα models # . . # . # ... ... ... ... alignment result i ii iii iv d gly ala ser cys val thr ile pro leu asp asn glu gln met lys arg his phe tyr trp figure : alignment protocol between the target sequence and the predicted main-chain for deepmm. (a) deepmm runs alignments of the target sequence of the em map against each candidate main-chain path. each sphere represents a predicted local dense point (ldp) on the main-chain path. predicted information including the cα probability (on the top), secondary structure (in the middle) and amino acid class (at the bottom) of ldps is utilized during alignment. for the target sequence, its secondary structure is predicted by the spider program, as illustrated in the sequence colored in azure under the amino acid sequence. (b) scoring matrices for amino acid type matching and secondary structure matching. (c) the generated cα models are ranked by their alignment score. (d) twenty amino acids are grouped into four classed according to the similarity of their side-chain em densities. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . . . . . . . . . . . . . . deepmm mainmast ca c α r m s d ( Å ) protein length (aa) b deepmm mainmast c l ic k r m s d ( Å ) protein length (aa) deepmm mainmast s tr u c tu re o v e rl a p ( % ) protein length (aa) figure : comparison of the results by deepmm and mainmast for the protein chains with different lengths. (a) the cα rmsds of the top predicted models. (b) the rmsds of matched cα atoms within . å by the structure alignment tool click. (c) the structure overlap calculated by click, which is defined as the fraction of matched cα atoms. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure a m a in m a s t r m s d ( Å ) deepmm rmsd (Å) b r o s e tt a r m s d ( Å ) deepmm rmsd (Å) figure : comparison of the top models for deepmm and two other approaches on the test set of experimental maps. the solid line in the figure is the plot of y = x, and the dashed line stands for y = . (a) comparison of the models by deepmm and mainmast in terms of cα rmsd. (b) comparison of the models by deepmm and rosetta in terms of cα rmsd. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure a b figure : examples of the models generated by deepmm for experimental em maps. the em density map (transparent grey) and its associated native protein structure (green) are displayed on the left side. the cα chains of the deepmm model (red) and the native structure (green) are shown in ball-and-stick format on the predicted main-chain probability map (transparent yellow) in the middle. the full-atom structure generated by deepmm (red) and the native protein structure (green) are displayed on the right side. (a) the nucleoprotein at . å map resolution (emd- ). the top ranked model by deepmm has a cα rmsd of . å. (b) the bovine rotavirus vp at . å map resolution (emd- ). the top model by deepmm has a cα rmsd of . å. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . . . . . . dc b p e rc e n ta g e ( % ) rmsd (Å) top top a p e c e n ta g e ( % ) tm-score top top . - . . - . . - . . - . . - . all % o f r m s d < Å resolution (Å) top top . - . . - . . - . . - . . - . all % o f t m -s c o re > . resolution (Å) top top figure : test results of deepmm on the experimental test cases. (a) the percentage of the top scored models at different cα rmsd cutoffs. (b) the percentage of the top scored models at different tm-score cutoffs. (c) the percentages of top scored models within å rmsd in different map resolution ranges. (d) the percentages of the top scored models with a tm-score above . in different map resolution ranges. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . . . . . . . . . . . . deepmm phenix r e s id u e m a tc h ( % ) resolution (Å) deepmm phenix dc b s e q u e n c e m a tc h ( % ) resolution (Å) a . . . . . . deepmm phenix a v e ra g e r e s id u e m a tc h ( % ) resolution (Å) . . . . . . deepmm phenix a v e ra g e s e q u e n c e m a tc h ( % ) resolution (Å) figure : comparison of the models by deepmm and phenix on the large test set of experimental maps at different resolutions. the results for phenix are colored in orange, and those for deepmm are colored in royal blue. (a) percentages of the protein residues in the deposited structures reproduced by deepmm and phenix. (b) percentages of the sequence of the deposited structure reproduced by deepmm and phenix. (c) average percentage of residue match by deepmm and phenix. (d) average percentage of sequence match by deepmm and phenix. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure phenix deepmm a b c d e figure : protein models reconstructed by deepmm and phenix for the chain a of dw and its associated em density map at . å resolution (emd- ). (a) the native structure overlapped with its associated em density map. (b) the model predicted by phenix, which has a residue match of . % and a sequence match of . %. (c) the phenix model (orange) overlapped with the native structure (green). the enlarged box on the right side shows that the residue names assigned by phenix model are different from those of the native structure. (d) the model predicted by phenix, which has a residue match of . % and a sequence match of . %. (e) the deepmm model (royal blue) overlapped with the native structure (green). the enlarged view of the top region of the protein on the right side shows that the sequence assigned by deepmm is close to that of the native structure. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction methods workflow of deepmm training the densenets of deepmm tracing the main-chain path aligning target sequence to main-chain path parameter settings of deepmm datasets used training sets test sets results model reconstruction for simulated em maps model reconstruction for experimental em maps evaluation of deepmm on the emdb-wide data set conclusion regtools: integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer regtools: integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer kelsy c. cotto , ,†, yang-yang feng ,†, avinash ramu , zachary l. skidmore , , jason kunisaki , megan richters , , sharon freshour , , yiing lin , william c. chapman , ravindra uppaluri , , ramaswamy govindan , , obi l. griffith , , , *, malachi griffith , , , * † denotes co-first authors. * denotes corresponding authors. correspondence to obi l. griffith (obigriffith@wustl.edu) and malachi griffith (mgriffit@wustl.edu). affiliations: . division of oncology, department of medicine, washington university school of medicine, st. louis, mo, usa . mcdonnell genome institute, washington university school of medicine, st. louis, mo, usa . department of genetics, washington university school of medicine, st. louis, mo, usa . department of surgery, washington university school of medicine, st. louis, mo, usa . department of surgery, brigham and women’s hospital, boston, ma, usa . department of medical oncology, dana-farber cancer institute, boston, ma, usa . siteman cancer center, washington university school of medicine, st. louis, mo, usa abstract somatic mutations in non-coding regions and even in exons may have unidentified regulatory consequences which are often overlooked in analysis workflows. here we present regtools (www.regtools.org), a free, open-source software package designed to integrate analysis of somatic variants from genomic data with splice junctions from transcriptomic data to identify variants that may cause aberrant splicing. regtools was applied to over , tumor samples with both tumor dna and rna sequence data. we discovered , events where a variant significantly increased the splicing of a particular junction, across , unique variants and , unique junctions. to characterize these somatic variants and their associated splice isoforms, we annotated them with the variant effect predictor (vep), spliceai, and genotype- tissue expression (gtex) junction counts and compared our results to other tools that integrate genomic and transcriptomic data. while certain events can be identified by the aforementioned tools, the unbiased nature of regtools has allowed us to identify novel splice variants and previously unreported patterns of splicing disruption in known cancer drivers, such as tp , cdkn a, and b m, as well as in genes not previously considered cancer-relevant, such as rnf . introduction .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / alternative splicing of messenger rna allows a single gene to encode multiple gene products, increasing a cell’s functional diversity and regulatory precision. however, splicing malfunction can lead to imbalances in transcriptional output or even the presence of novel oncogenic transcripts . the interpretation of variants in cancer is frequently focused on direct protein- coding alterations . however, most somatic mutations arise in intronic and intergenic regions, and exonic mutations may also have unidentified regulatory consequences , , , . for example, mutations can affect splicing either in trans, by acting on splicing effectors, or in cis, by altering the splicing signals located on the transcripts themselves . increasingly, we are identifying the importance of splice variants in disease processes, including in cancer , . however, our understanding of the landscape of these variants is currently limited, and few tools exist for their discovery. one approach to elucidating the role of splice variants has been to predict the strength of putative splice sites in pre-mrna from genomic sequences, such as the method used by the spliceai tool – . with the advent of efficient and affordable rna-seq, we are also seeing the complementary approach of evaluating alternative splicing events (ases) directly from rna sequencing data. various tools exist which allow the identification of significant ases from transcript-level data within sample cohorts, including suppa and spladder , . many of these tools have also evaluated the role of trans-acting splice mutations . however, few tools are directed at linking specific aberrant rna splicing events to specific genomic variants in cis to investigate the splice regulatory impact of these variants. those few relevant tools that do exist have significant limitations that preclude them from broad applications. the sqtl-based approach taken by leafcutter and other tools is designed for relatively frequent single-nucleotide polymorphisms. it is thus ill-suited to studying somatic variants, or any case in which the frequency of a particular variant is very low (often unique) in a given sample population – . recent tools that have been created for large-scale analysis of cancer-specific data, such as misplice and veridical, ignore certain types of ases, are tailored to specific analysis strategies and sets of hypotheses, or are otherwise inaccessible to the end-user due to issues such as lack of documentation, difficulty with installation and integration with existing pipelines, limited computing efficiency, or licensing issues – . to address these needs, we have developed regtools, a free, open-source (mit license) software package that is well-documented, modularized for ease of use, and designed to efficiently identify potential cis-acting splice-relevant variants in tumors (www.regtools.org). regtools is a suite of tools designed to aid users in a broad range of splicing-related analyses. at the highest level, it contains three sub-modules: a variants module to annotate variant calls with respect to their potential splicing relevance, a junctions module to analyze aligned rna-seq data and associated splicing events, and a cis-splice-effects module that integrates genomic variant calls and transcriptomic sequencing data to identify potential splice-altering variants. each sub-module contains one or more commands, which can be used individually or integrated into regulatory variant analysis pipelines. to demonstrate the utility of regtools in identifying potential splice-relevant variants from tumor data, we analyzed a combination of data available from the mcdonnell genome institute (mgi) at washington university school of medicine and the cancer genome atlas (tcga) project. in .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / total, we applied regtools to , samples across cancer types. we contrasted our results with other tools that integrate genomic and transcriptomic data to identify potential splice altering variants, specifically veridical, misplice, and savnet , , . novel junctions identified by regtools were compared to data from the genotype-tissue expression (gtex) project to assess whether these junctions are present in normal tissues . variants significantly associated with novel junctions were processed through vep and illumina’s spliceai tool to compare our findings with splicing consequences predicted based on the variant information alone , . with this additional analysis, we were able to more easily identify both variants in known cancer drivers, whose splicing consequences have not been previously reported in the literature, and potentially novel cancer drivers, whose disruption relies on splice-altering mutations results the regtools tool suite supports splice regulatory variant discovery by the integration of genome and transcriptome data. regtools is a suite of tools designed to aid users in a broad range of splicing-related analyses. the variants module contains the annotate command. the variants annotate command takes a vcf of somatic variant calls and a gtf of transcriptome annotations as input. regtools does not have any particular preference for variant callers or reference annotations. each variant is annotated by regtools with known overlapping genes and transcripts, and is categorized into one of several user-configurable “variant types”, based on position relative to the edges of known exons. the variant type annotation depends on the stringency for splicing-relevance that the user sets with the “splice variant window” setting. by default, regtools marks intronic variants within bp of the exon edge as “splicing intronic”, exonic variants within bp as “splicing exonic”, other intronic variants as “intronic”, and other exonic variants simply as “exonic.” regtools considers only “splicing intronic” and “splicing exonic” as important. to allow for discovery of an arbitrarily expansive set of variants, regtools allows the user to customize the size of the exonic/intronic windows individually (e.g. -i -e for intronic variants bp from an exon edge and exonic variants bp from an exon edge) or even consider all exonic/intronic variants as potentially splicing-relevant (e.g. -e or -i) (figure a). the junctions module contains the extract and annotate commands. the junctions extract command takes an alignment file containing aligned rna-seq reads, infers the exon-exon boundaries based on the cigar strings , and outputs each “junction” as a feature in bed format. the junctions annotate command takes a file of junctions in bed format (such as the one output by junctions extract), a fasta file containing the reference genome, and a gtf file containing reference transcriptome annotations and generates a tsv file, annotating each junction with: the number of acceptor sites, donor sites, and exons skipped, and the identities of known overlapping transcripts and genes. we also annotate the “junction type”, which denotes if and how the junction is novel (i.e. different compared to provided transcript annotations). if the donor is known, but the acceptor is not or vice-versa, it is marked as “d” or “a”, respectively. if both are known, but the pairing is not known, it is marked as “nda”, whereas if both are .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / unknown, it is marked as “n”. if the junction is not novel (i.e. it appears in at least one transcript in the supplied gtf), it is marked as “da” (figure b). the cis-splice-effects module contains the identify command, which identifies potential splice- altering variants from sequencing data. the following are required as input: a vcf file containing variant calls, an alignment file containing aligned rna-sequencing reads, a reference genome fasta file, and a reference transcriptome gtf file. the identify pipeline internally relies on variants annotate, junctions extract, and junctions annotate to output a tsv containing junctions proximal to putatively splicing-relevant variants. the identify pipeline can be customized using the same parameters as in the individual commands. briefly, cis-splice-effects identify first performs variants annotate to determine the splicing-relevance of each variant in the input vcf. for each variant, a “splice junction region” is determined by finding the largest span of sequence space between the exons that flank the exon associated with the variant. from here, junctions extract identifies splicing junctions present in the rna-seq bam. next, junctions annotate labels each extracted junction with information from the reference transcriptome as described above and its associated variants based on splice junction region overlap (figure c). for our analysis, we annotated the pairs of associated variants and junctions identified by regtools, which we refer to as “events”, with additional information such as whether this association was identified by a comparable tool, the junction was found in gtex, and whether the event occurred in a cancer gene according to cancer gene census (cgc) (figure c) , . finally, we created igv sessions for each event identified by regtools that contained a bed file with the junction, a vcf file with the variant, and an alignment (bam) file for each sample that contained the variant . these igv sessions were used to manually review candidate events to assess whether the association between the variant and junction makes sense in a biological context. regtools is designed for broad applicability and computational efficiency. by relying on well- established standards for sequence alignments, annotation files, and variant calls and by remaining agnostic to downstream statistical methods and comparisons, our tool can be applied to a broad set of scientific queries and datasets. moreover, performance tests show that cis- splice-effects identify can process a typical candidate variant list of , , variants and a corresponding rna-seq bam file of , , reads in just ~ minutes (supplementary figure ). pan-cancer analysis of tumor types identifies somatic variants that alter canonical splicing regtools was applied to , samples over cancer types. of these cohorts came from tcga while the remaining three were obtained from other projects being conducted at mgi. cohort sizes ranged from to , samples. in total, , , variants (figure a) and , , , junction observations (figure b) were analyzed by regtools. by comparing the number of initial variants per cohort to the number of statistically significant variants, we .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / were able to show that regtools produces a prioritized list of potential splice relevant variants (supplementary figure ). additionally, when analyzing the junctions within each sample, we found that junctions present in the reference transcriptome are frequently seen within gtex data while junctions observed from a sample’s own transcriptome data that were not present in the reference are rarely seen within gtex (supplementary figure ). , significant variant junction pairings were found for junctions that use a known donor and novel acceptor (d), novel donor and known acceptor (a), or novel combination of a known donor and a known acceptor (nda), with novel here meaning that the junction was not found in the reference transcriptome (methods, figure c, supplemental files and ). while our analysis primarily focuses on variants in relation to novel splice events because of the potential importance of these events within tumor processes, we also wanted to assess how often a variant was significantly associated with a known junction. , variant junction pairings were found for junctions known to the reference (da junctions) (supplemental files and ). this finding indicates that while splice variants usually result in a novel junction occurring, they sometimes alter the expression of known junctions. generally, significant events were evenly split among each of the novel junction types considered (d, a, and nda). the number of significant events increased as the splice variant window size increased, with both the e and i results being comparable in number. notably, hepatocellular carcinoma (hcc) was the only cohort that had whole genome sequencing (wgs) data available and, as expected, it exhibited a marked increase in the number of significant events for its results within the “i” splice variant window. this observation highlights the low sequence coverage of intronic regions that occurs with wes which subsequently leads to underpowered discovery of potential splice altering variants within introns. variants were analyzed across tumor types for how often they result in either a single or multiple novel junctions (figure a). while a single variant resulting in a single novel junction is most commonly observed ( . - . %), a single variant also commonly results in multiple junctions being created, either of the same type ( . - . %) or of different types ( . - . %) (figure b). variants that are associated with multiple novel junctions of different types were further investigated to identify how often a particular junction type occurred with another (figure c). most commonly, we observed an alternate donor or acceptor site being used in conjunction with an exon skipping event. these events were particularly common within the default window ( intronic bases or exonic bases from the exon edge), as a snv or indel within these positions has a high probability of disrupting the natural splice site, thus causing the splicing machinery to use a cryptic splice site nearby or skip the splice site entirely. the next most common event was an alternate donor site and an alternate acceptor site both being used as the result of a single variant. the combination of a novel acceptor site and novel donor site being used in conjunction with an exon-skipping event occurred the least and occurrence of this type of event remains fairly low, even as the search space increases within the larger splice variant windows. this finding indicates the low likelihood of a single variant resulting in simultaneous disruption of a splice acceptor and donor as well as complete skipping of an exon. overall, this analysis highlights that there is evidence that a single variant can lead to multiple novel junctions being expressed. tools that only allow for a single junction to be predicted or .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / associated with a variant therefore may not be completely describing the effect of the variant in question in up to ~ % of cases. regtools identifies splice altering variants missed by other splice variant predictors and annotators to evaluate the performance of regtools, we compared our results to those of savnet, misplice, veridical, vep, and spliceai , , , , . these tools vary in their inputs and methodology for identifying splice altering variants (figure a). both vep and spliceai only consider information about the variant and its genomic sequence context and do not consider information from a sample’s transcriptome. a variant is considered to be splice relevant according to vep if it occurs within - bases on the exonic side or - bases on the intronic side of a splice site. spliceai does not have restrictions on where the variant can occur in relation to the splice site but by default, it predicts one new donor and acceptor site within bp of the variant, based on reference transcript sequences from gencode. like regtools, savnet, misplice, and veridical integrate genomic and transcriptomic data in order to identify splice altering variants. misplice only considers junctions that occur within bp of the variant. additionally, savnet, misplice, and veridical filter out any transcripts found within the reference transcriptome. savnet, misplice, and veridical employ different statistical methods for the identification of splice altering variants. in contrast to regtools, none of the mentioned tools allow the user to set a custom window in which they wish to focus splice altering variant discovery (e.g. around the splice site, all exonic variants, etc.). these tools have different levels of code availability. misplice is available via github as a collection of perl scripts that are built to run via load sharing facility (lsf) job scheduling. to run misplice without an lsf cluster, the authors mention code changes are required. veridical is available via a subscription through cytognomix’s mutationforecaster. similar to regtools, savnet is available via github or through a docker image. however, savnet relies on splicing junction files generated by star whereas regtools can use rna-seq alignment files from hisat , tophat , or star, thus allowing it to be integrated into bioinformatic workflows more easily. in their recent publications, savnet , misplice , and veridical , also analyzed data from tcga, with only minor differences in the number of samples included for each study. vep and spliceai results were obtained by running each tool on all starting variants for the cohorts included in this study. in order to efficiently compare this data, an upset plot (figure b) was created . only variants are identified as splice altering by all six tools. comparatively, misplice and savnet find few splice altering variants, potentially indicating that these tools are overlooking the complete set of variants that have an effect on splicing. in contrast, veridical identifies by far the most splice altering variants across all tools, with . percent of its calls being found by it alone. spliceai and vep called a large number of variants, either alone or in agreement, that none of the tools that integrate transcriptomic data from samples identify. this highlights a limitation of using tools that only focus on genomic data, particularly in a disease context where transcripts are unlikely to have been annotated before. regtools addresses these short-comings by identifying what pieces of information to extract from a sample’s genome and transcriptome in a very basic, unbiased way that allows for generalization. other .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tools either only analyze genomic data, focus on junctions where either the canonical donor or acceptor site is affected (missing junctions that result from complete exon skipping), or consider only those variants within a very narrow distance from known splice sites. regtools can include any kind of junction type, including exon-exon junctions that have ends that are not known donor/acceptor sites according to the gtf file (n junction according to regtools), any distance size to make variant-junction associations, and any window size in which to consider variants. due to these advantages, regtools identified events missed by one or multiple of the tools to which we compared (figure b; supplementary figures and ). pan-cancer analysis reveals novel splicing patterns within known cancer genes and potential cancer drivers while efforts have been made to associate variants with specific cancer types, there has been little focus on identifying such associations in splice-altering variants, even those in known cancer genes. tp is a rare example whose splice-altering variants are well characterized in numerous cancer types . as such, we further analyzed significant events to identify genes that had recurrent splice altering variants. within each cohort, we looked for recurrent genes using two separate metrics: a binomial test p-value and the fraction of samples (see methods). for ranking and selecting the most recurrent genes, each metric was computed by pooling across all cohorts. for assessing cancer-type specificity, each metric was then also computed using only results from a given cancer cohort. since the mechanisms underlying the creation of novel junctions versus the disruption of existing splicing patterns may be different, analysis was performed separately for d/a/nda junctions (figure , supplementary figure , supplementary file ) and da junctions (supplementary figure , supplementary file ), which allowed multiple test correction in accordance with the noise of the respective data. we identified , genes in which there was least one variant predicted to influence the splicing of a d/a/nda junction. the th percentile of these genes, when ranked by either metric, are significantly enriched for known cancer genes, as annotated by the cgc (p= . e- , ranked by binomial p-values, p= . e- , ranked by fraction of samples; hypergeometric test). we also identified , genes in which there was least one variant predicted to influence the splicing of a da (known) junction. the th percentile of these genes, when ranked by either metric, are also significantly enriched for known cancer genes, as annotated by the cancer gene census (p= . e- , ranked by binomial p-values, p= . e- , ranked by fraction of samples; hypergeometric test). we also performed the same analyses using either the tcga or mgi cohorts alone. the tcga-only analyses gave very similar results to the combined analyses, with the th percentile of genes found in the d/a/nda and da analyses again being enriched for cancer genes (supplementary figures and ; supplemental files and ). due to small cohort sizes, in the mgi-only analyses, we identified only and genes in the d/a/nda and da analyses, respectively. the th percentile of genes from these analyses, respectively, were not significantly enriched for cancer genes (supplementary figures and ; supplemental files and ). when analyzing d, a, and nda junctions, we saw an enrichment for known tumor suppressor genes among the most splice disrupted genes, including several examples where splice .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / disruption is a known mechanism such as tp , pten, cdkn a, and rb . specifically, in the case of tp , we identified variants that were significantly associated with at least one novel splicing event. one such example is the intronic snv (grch , chr :g. c>a) that was identified in an oscc sample and was associated with an exon skipping event and an alternate acceptor site usage event, with and reads of support, respectively (supplemental figure ). the cancer types in which we find splice disruption of tp and other known cancer genes is in concordance with associations between genes and cancer types described by cgc and chasmplus , . our analysis’s recovery of known drivers, many of which with known susceptibilities to splicing dysregulation in cancer, indicates the ability of our method to identify true splicing effects that are likely cancer-relevant. another cancer gene that we found to have a recurrence of splicing altering variants was b m. specifically, we identified six samples with intronic variants on either side of exon (figure ). while mutations have been identified and studied within exon , we did not find literature that specifically identified intronic variants near exon as a mechanism for disrupting b m . these mutations were identified by vep to be either splice acceptor variant or a splice donor variant and were also identified by veridical. misplice was able to predict one of the novel junctions for each variant but failed to predict additional novel junctions due to the limitation of that tool to only predict one novel acceptor and donor site per variant. notably, out of the samples that these variants were found in are msi-h (microsatellite instability-high) tumors . mutations in b m, particularly within colorectal msi-h tumors, have been identified as a method for tumors to become incapable of hla class i antigen-mediated presentation . furthermore, in a study of patients treated with immune checkpoint blockade (icb) therapy, defects to b m were observed in . % of patients with progressing disease . in the same study, b m mutations were exclusively seen in pretreatment samples from patients who did not respond to icb or in post- progression samples after initial response to icb . there are several genes that are responsible for the processing, loading, and presentation of antigens, and have been shown to be mutated in cancers . however, no proteins can be substituted for b m in hla class i presentation, thus making the loss of b m a particularly robust method for icb resistance . we also observe exonic variants and variants further in intronic regions that disrupt canonical splicing of b m. these findings indicate that intronic variants that result in alternative splice products within b m may be a mechanism for immune escape within tumor samples. we also identify recurrent splice altering variants in genes not known to be cancer genes (according to cgc), such as rnf . regtools identified a recurrent single base pair deletion that results in an exon skipping event of exon (supplementary figure ). this gene is a paralog of rnf , which has been found to be mutated in several cancer types . this variant junction association was found in stad, ucec, coad, and esca tumors, all of which are considered to be msi-h tumors . after analyzing the effect of the exon skipping event on the mrna sequence, we concluded that the reading frame remains intact, possibly leading to a gain of function event. additionally, the skipping of exon leads to the removal of a transmembrane domain and a phosphorylation site, s , which could be important for the regulation of this gene . based on these findings, rnf may play a role similar to rnf and may be an important driver event in certain tumor samples. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / while most of our analysis focused on splice altering variants that resulted in d, a, nda junctions, we also wanted to investigate variants that shifted the usage of known donor and acceptor sites. through this analysis, we identified cdkn a, a tumor suppressor gene that is frequently mutated in numerous cancers , to have several variants that led to alternate donor usage (supplementary figure ). when these variants are present, an alternate known donor site is used that leads to the formation of the transcript enst . instead of enst . , the transcript that encodes for p ink a, a known tumor suppressor. the transcript that results from use of this alternate donor site is missing the last twenty-eight amino acids that form the c-terminal end of p ink a. notably, this removes two phosphorylation sites within the p protein, s and s , which when phosphorylated promotes the association of p ink a with cdk . this finding highlights the importance of including known transcripts in alternative splicing analyses as variants may alter splice site usage in a way that results in a known but pathogenic transcript product. discussion splice associated variants are often overlooked in traditional genomic analysis. to address this limitation, we created regtools, a software suite for the analysis of variants and junctions in a splicing context. by relying on well-established standards for analyzing genomic and transcriptomic data and allowing flexible analysis parameters, we enable users to apply regtools to a wide set of scientific methodologies and datasets. to ease the use and integration of regtools into analysis workflows, we provide documentation and example workflows via (regtools.org) and provide a docker image with all necessary software installed. in order to demonstrate the utility of our tool, we applied regtools to , tumor samples across tumor types to profile the landscape of this category of variants. from this analysis, we report , variants that cause novel splicing events that were missed by vep or spliceai. only . percent of these mutations were previously discovered by similar attempts, while . percent are novel findings. we demonstrate that there are splice altering variants that occur beyond the splice site consensus sequence, shift transcript usage between known transcripts, and create novel exon-exon junctions that have not been previously described. specifically, we describe notable findings within b m, rnf , and cdkn a. these results demonstrate the utility of regtools in discovering novel splice-altering mutations and confirm the importance of integrating rna and dna sequencing data in understanding the consequences of somatic mutations in cancer. to allow further investigation of these identified events, we make all of our annotated result files (supplemental files - ) and recurrence analysis files (supplemental files - ) available. understanding the splicing landscape is crucial for unlocking potential therapeutic avenues in precision medicine and elucidating the basic mechanisms of splicing. the exploration of novel tumor-specific junctions will undoubtedly lead to translational applications, from discovering novel tumor drivers, diagnostic and prognostic biomarkers, and drug targets, to identifying a previously untapped source of neoantigens for personalized immunotherapy. while our analysis .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / focuses on splice altering variants within cancers, we believe regtools will play an important role in answering this broad range of questions by helping users extract splicing information from transcriptome data and linking it to somatic (or germline) variant calls. the computational efficiency of regtools and increasing availability and size of such datasets may also allow for improved understanding of splice regulatory motifs that have proven difficult to accurately define such as exonic and intronic splicing enhancers and silencers. any group with paired dna and rna-seq data for the same samples stands to benefit from the functionality of regtools. methods software implementation regtools is written in c++. cmake is used to build the executable from source code. we have designed the regtools package to be self-contained in order to minimize external software dependencies. a unix platform with a c++ compiler and cmake is the minimum prerequisite for installing regtools. documentation for regtools is maintained as text files within the source repository to minimize divergence from the code. we have implemented common file handling tasks in regtools with the help of open-source code from samtools/htslib and bedtools in an effort to ensure fast performance, consistent file handling, and interoperability with any aligner that adheres to the bam specification. statistical tests are conducted within regtools using the rmath framework. travis ci and coveralls are used to automate and monitor software compilation and unit tests to ensure software functionality. we utilized the google test framework to write unit tests. regtools consists of a core set of modules for variant annotation, junction extraction, junction annotation, and gtf utilities. higher level modules such as cis-splice-effects make use of the lower level modules to perform more complex analyses. we hope that bioinformaticians familiar with c/c++ can re-use or adapt the regtools code to implement similar tasks. benchmarking performance metrics were calculated for all regtools commands. each command was run with default parameters on a single blade server (intel(r) xeon(r) cpu e - v @ . ghz) with gb of ram and replicates for each data point (supplementary figure ). specifically for cis-splice-effects identify, we started with random selections of somatic variants, ranging from , - , , , across data subsets. using the output from cis-splice-effects identify, variants annotate was run on somatic variants from the subsets (range: - , ) predicted to have a splicing consequence. the function junctions extract was performed on the hcc tumor rna-seq data aligned with hisat to grch and randomly downsampled at intervals ranging from - %. using output from junctions extract, junctions annotate was performed for data subsets ranging from , - , randomly selected junctions. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / benchmark tests revealed an approximately linear performance for all functions. variance between real and cpu time is highly dependent on the i/o speed of the write-disk and could account for artificially inflated real time values given multiple jobs writing to the same disk at once. the most computationally expensive function in a typical analysis workflow was junctions extract, which on average processed , reads/second (cpu) and took an average of . real vs . cpu minutes to run on a full bam file ( , , reads total). the function junctions annotate was the next most computationally intensive function and took an average of . real/ . cpu minutes to run on , junctions, processing junctions/second (cpu). the other functions were comparatively faster with cis-splice-effects identify and variants annotate able to process , and variants per second (cpu), respectively. to process a typical candidate variant list of , , variants and a corresponding rna-seq bam file of , , reads with cis-splice-effects identify takes ~ . real/ . cpu minutes (supplementary figure ). performance metrics were also calculated for the statistics script and its associated wrapper script that handles dividing the variants into smaller chunks for processing to limit ram usage. this command, compare_junctions, was benchmarked in january using amazon web services (aws) on a m . xlarge instance, based on the amazon linux ami, with gb of ram, vcpus, and a mounted tb ssd ebs volume with iops. these data were generated from running compare_junctions on each of the included cohorts, with the largest being our brca cohort ( sample) which processed . events per second (cpu). using regtools to identify cis-acting, splice altering variants regtools contains three sub-modules: “variants”, “junctions”, and “cis-splice-effects”. for complete instructions on usage, including a detailed workflow for how to analyze cohorts using regtools, please visit regtools.org. variants annotate this command takes a list of variants in vcf format. the file should be gzipped and indexed with tabix . the user must also supply a gtf file that specifies the reference transcriptome used to annotate the variants. the info column of each line in the vcf is populated with comma-separated lists of the variant-overlapping genes, variant-overlapping transcripts, the distance between the variant and the associated exon edge for each transcript (i.e. each start or end of an exon whose splice variant window included the variant) defined as min(distance_from_start_of_exon, distance_from_end_of_exon), and the variant type for each transcript. internally, this function relies on htslib to parse the vcf file and search for features in the gtf file which overlap the variant. the splice variant window size (i.e. the maximum distance from the edge of an exon used to consider a variant as splicing-relevant) can be set by the options “- e ” and “-i ” for exonic and intronic variants, respectively. the variant type for each variant thus depends on the options used to set the splice variant window size. variants captured by the window set by “-e” or “-i” are annotated as .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / “splicing_exonic” and “splicing_intronic”, respectively. alternatively, to analyze all exonic or intronic variants, the “-e” and “-i” options can be used. otherwise, the “-e” and “-i” options themselves do not change the variant type annotation, and variants found in these windows are labeled simply as “exonic” or “intronic”. by default, single exon transcripts are ignored, but they can be included with the “-s” option. by default, output is written to stdout in vcf format. to write to a file, use the option “-o ”. junctions extract this command takes an alignment file containing aligned rna-seq reads and infers junctions (i.e. exon-exon boundaries) based on skipped regions in alignments as determined by the cigar string operator codes. these junctions are written to stdout in bed format. alternatively, the output can be redirected to a file with the “-o ”. regtools ascertains strand information based on the xs tags set by the aligner, but can also determine the inferred strand of transcription based on the bam flags if a stranded library strategy was employed. in the latter case, the strand specificity of the library can be provided using “-s ” where = unstranded, = first-strand/rf, = second-strand/fr. we suggest that users align their rna-seq data with hisat , tophat , or star , as these are the aligners we have tested to date. if rna-seq data is unstranded and aligned with star, users must run star with the --outsamattributes option to include xs tags in the bam output. users can set thresholds for minimum anchor length and minimum/maximum intron length. the minimum anchor length determines how many contiguous, matched base pairs on either side of the junction are required to include it in the final output. the required overlap can be observed amongst separated reads, whose union determines the thickstart and thickend of the bed feature. by default, a junction must have bp anchors on each side to be counted but this can be set using the option “-a ”. the intron length is simply the end coordinate of the junction minus the start coordinate. by default, the junction must be between bp and , bp, but the minimum and maximum can be set using “-i ” and “-i ”, respectively. for efficiency, this tool can be used to process only alignments in a particular region as opposed to analyzing the entire bam file. the option “-r :-” can be used to set a single contiguous region of interest. multiple jobs can be run in parallel to analyze separate non-contiguous regions. junctions annotate this command takes a list of junctions in bed format as input and annotates them with respect to a reference transcriptome in gtf format. the observed splice-sites used are recorded based on a reference genome sequence in fasta format. the output is written to stdout in tsv format, with separate columns for the number of splicing acceptors skipped, number of splicing donors skipped, number of exons skipped, the junction type, whether the donor site is known, whether the acceptor site is known, whether this junction is known, the overlapping transcripts, and the overlapping genes, in addition to the chromosome, start, stop, junction name, junction score, and strand taken from the input bed file. this output can be .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / redirected to a file with “-o /path/to/file”. by default, single exon transcripts are ignored in the gtf but can be included with the option “-s”. cis-splice-effects identify this command combines the above utilities into a pipeline for identifying variants which may cause aberrant splicing events by altering splicing motifs in cis. as such, it relies on essentially the same inputs: a gzipped and tabix-indexed vcf file containing a list of variants, an alignment file containing aligned rna-seq reads, a gtf file containing the reference transcriptome of interest, and a fasta file containing the reference genome sequence of interest. first, the list of variants is annotated. the splice variant window size is set using the options “- e”, “-i”, “-e”, and “-i”, just as in variants annotate. the splice junction region size (i.e. the range around a particular variant in which an overlapping junction is associated with the variant) can be set using “-w ”. by default, this range is not a particular number of bases but is calculated individually for each variant, depending on the variant type annotation. for “splicing_exonic”, “splicing_intronic”, and “exonic” variants, the region extends from the ’ end of the exon directly upstream of the variant-associated exon to the ’ end of the exon directly downstream of it. for “intronic” variants, the region is limited to the intron containing the variant. single-exons can be kept with the “-s” option. the annotated list of variants in vcf format (analogous to the output of variants annotate) can be written to a file with “-v /path/to/file”. the bam file is then processed in the splice junction regions to produce the list of junctions. a file containing these junctions in bed format (analogous to the output of junctions extract) can be written using “-j /path/to/file”. the minimum anchor length, minimum intron length, and maximum intron length can be set using “-a”, “-i”, and “-i” options, just as in junctions extract. the list of junctions produced by the preceding step is then annotated with the information presented in junctions annotate. additionally, each junction is annotated with a list of associated variants (i.e. variants whose splice junction regions overlapped the junction). the final output is written to stdout in tsv format (analogous to the output of junctions annotate) or can be redirected to a file with “-o /path/to/file”. cis-splice-effects associate this command is similar to cis-splice-effects identify, but takes the bed output of junctions extract in lieu of an alignment file with rna alignments. as with cis-splice-effects identify, each junction is annotated with a list of associated variants (i.e. variants whose splice junction regions overlapped the junction). the resulting output is then the same as cis-splice-effects identify, but limited to the junctions provided as input. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / analysis dataset description cancer cohorts were analyzed from tcga. these cancer types are adrenocortical carcinoma (acc), bladder urothelial carcinoma (blca), brain lower grade glioma (lgg), breast invasive carcinoma (brca), cervical squamous cell carcinoma and endocervical adenocarcinoma (cesc), cholangiocarcinoma (chol), colon adenocarcinoma (coad), esophageal carcinoma (esca), glioblastoma multiforme (gbm), head and neck squamous cell carcinoma (hnsc), kidney chromophobe (kich), kidney renal clear cell carcinoma (kirc), kidney renal papillary cell carcinoma (kirp), liver hepatocellular carcinoma (lihc), lung adenocarcinoma (luad), lung squamous cell carcinoma (lusc), lymphoid neoplasm diffuse large b cell lymphoma (dlbc), mesothelioma (meso), ovarian serous cystadenocarcinoma (ov), pancreatic adenocarcinoma (paad), pheochromocytoma and paraganglioma (pcpg), prostate adenocarcinoma (prad), rectum adenocarcinoma (read), sarcoma (sarc), skin cutaneous melanoma (skcm), stomach adenocarcinoma (stad), testicular germ cell tumors (tgct), thymoma (thym), thyroid carcinoma (thca), uterine carcinosarcoma (ucs), uterine corpus endometrial carcinoma (ucec), and uveal melanoma (uvm). three cohorts were derived from patients at washington university in st. louis. these cohorts are hepatocellular carcinoma (hcc), oral squamous cell carcinoma (oscc), and small cell lung cancer (sclc). sample processing we applied regtools to tumor cohorts. genomic and transcriptomic data for cohorts were obtained from the cancer genome atlas (tcga). information regarding the alignment and variant calling for these samples is described by the genomic data commons data harmonization effort . whole exome sequencing (wes) mutation calls for these samples from muse , mutect , varscan , and somaticsniper , were left-aligned, trimmed, and decomposed to ensure the correct representation of the variants across the multiple callers. samples for the remaining three cohorts, hcc, sclc, and oscc, were sequenced at washington university in st. louis. genomic data were produced by wes for sclc and oscc and whole genome sequencing (wgs) for hcc. normal genomic data of the same sequencing type and tumor rna-seq data were also available for all subjects. sequence data were aligned using the genome modeling system (gms) using tophat for rna and bwa-mem for dna. hcc and sclc were aligned to grch while oscc was aligned to grch . somatic variant calls were made using samtools v . . , somaticsniper v . . , strelka v . . . , and varscan v . . , through the gms. high-quality mutations for all samples were then selected by requiring that a variant be called by two of the four variant callers. candidate junction filtering to generate results for splice variant window sizes, we ran cis-splice-effects identify with sets of splice variant window parameters. for our “i e ” window (regtools default), to examine intronic variants within bases and exonic variants within bases of the exon edge, we set “-i -e ”. similarly, for “i e ”, to examine intronic variants within bases and exonic variants within bases of the exon edge, we set “-i -e ”. to view all exonic variants, we simply set “- .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / e”, without “-i” or “-e” options. to view all intronic variants, we simply set “-i”, without “-i” or “-e” options. tcga samples were processed with grch .d .vd .fa (downloaded from the gdc reference file page at https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference- files) as the reference fasta file and gencode.v .annotation.gtf (downloaded via the gencode ftp) as the reference transcriptome. oscc was processed with homo_sapiens.grch .dna_sm.primary_assembly.fa and homo_sapiens.grch . .gtf (both downloaded from ensembl). hcc and sclc were processed with homo_sapiens.grch .dna_sm.primary_assembly.fa and homo_sapiens.grch . .gtf (both downloaded from ensembl). statistical filtering of candidate events we refer to a statistical association between a variant and a junction as an “event”. for each event identified by regtools, a normalized score (norm_score) was calculated for the junction of the event by dividing the number of reads supporting that junction by the sum of all reads for all junctions within the splice junction region for the variant of interest. this metric is conceptually similar to a “percent-spliced in” (psi) index, but measures the presence of entire exon-exon junctions, instead of just the inclusion of individual exons. if there were multiple samples that contained the variant for the event, then the mean of the normalized scores for the samples was computed (mean_norm_score). if only one sample contained the variant, its mean_norm_score was thus equal to its norm_score. this value was then compared to the distribution of samples which did not contain the variant to calculate a p-value as the percentage of the norm_scores from these samples which are at least as high as the mean_norm_score computed for the variant-containing samples. we performed separate analyses for events involving canonical junctions (da) and those involving novel junctions which used at least one known splice site (d/a/nda), based on annotations in the corresponding reference gtf. for this study, we filtered out any junctions which did not use at least one known splice site (n) and junctions which did not have at least reads of evidence across variant-containing samples. the benjamini-hochberg procedure was then applied to the remaining events. following correction, an event was considered significant if its adjusted p-value was ≤ . . annotation with gtex junction data and other splice prediction tools events identified by regtools as significant were annotated with information from gtex, vep, spliceai, misplice, and veridical. gtex junction information was obtained from the gtex portal. specifically, the exon-exon junction read counts file from the v release was used for data aligned to grch while the same file from the v release was used for the data aligned to .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / grch . mappings between tumor cohorts and gtex tissues can be found in supplemental file . we annotated all starting variants with vep in the “per_gene” and “pick” modes. the “per_gene” setting outputs only the most severe consequence per gene while the “pick” setting picks one line or block of consequence data per variant. we considered any variant with at least one splicing-related annotation to be “vep significant”. all variants were also processed with spliceai using the default options. a variant was considered to be “spliceai significant” if it had at least one score greater than . , the developers’ value for high recall of their model. variants identified by misplice were obtained from the paper supplemental tables and were lifted over to grch . variants identified by savnet were obtained from the paper supplemental tables and were lifted over to grch . variants identified by veridical , were obtained via download from the link reference within the manuscript and lifted over to grch . visual exploration of statistically significant candidate events igv sessions were created for each event identified by regtools that was statistically significant. each igv session file contained a bed file with the junction, a vcf file with the variant, and an alignment file for each sample that contained the variant. additional information, such as the splice sites predicted by spliceai, were also added to these session files to enhance the exploration of these events. events of interest were manually reviewed in igv to assess whether the association between the variant and junction made sense in a biological context (e.g. affected a known splice site, altered a genomic sequence to look more like a canonical splice site, or the novel junction disrupted active or regulatory domains of the protein product). an extensive review of literature and visualizations of junction usage in the presence and absence of the variant were also used to identify novel, biologically relevant events. identification of genes with recurrent splice altering variants for each cohort, we calculated a p-value to assess whether the splicing profile from a particular gene was significantly more likely to be altered by somatic variants. specifically, we performed a -tailed binomial test, considering the number of samples in a cohort as the number of attempts. success was defined by whether the sample had evidence of at least one splice-altering variant in that gene. the null probability of success, pnull was calculated as where s is the total number of base positions residing in any of the gene’s splice variant windows, v is the event that a somatic variant occurred at such a base position, and a is the event that this variant was deemed to be significantly associated with at least one junction in our analysis. the joint probability that both v and a occurred was estimated by dividing the total of events across all samples in which each junction was detected by s. the value of s was computed based on the exon and transcript definitions in the reference gtf used for performing regtools analyses on a given cohort. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / we also calculated overall metrics, in order to rank genes. for each set of cohorts (e.g. tcga- only, mgi-only, combined), an overall p-value was computed for each gene according to the above formula, pooling all of the samples across the included cohorts, and the fraction of samples was simply calculated by dividing the number of samples in which an event occurred within the given gene by the total number of samples, pooled across the included cohorts. the reference gtf used for analyzing the tcga samples (i.e. gencode.v .annotation.gtf) was used for all sets of cohorts. code availability regtools is open source (mit license) and available at https://github.com/griffithlab/regtools/. all scripts used in the analyses presented here are also provided. for ease of use, a docker container has been created with regtools, r, and python installed (https://hub.docker.com/r/griffithlab/regtools/). this docker container allows a user to run the workflow we outline at https://regtools.readthedocs.io/en/latest/workflow/. docker is an open- source software platform that enables applications to be readily installed and run on any system. the availability of regtools with all its dependencies as a docker container also facilitates the integration of the regtools software into workflow pipelines that support docker images. data availability sequence data for each cohort analyzed in this study are available through dbgap at the following accession ids: phs for tcga cohorts, phs for hcc, phs for sclc, and phs for oscc. statistically significant events for d, a, and nda junctions across the four variant splicing windows used are available via supplemental files and . statistically significant events for da junctions are available as supplemental files and . complete results of gene recurrence analysis are available as supplemental files and . acknowledgments we thank the patients and their families for donation of their samples and participation in clinical trials. we would like to thank donald conrad for his initial idea to compare to variant effect predictor tools. kelsy cotto was supported by siteman cancer center under fund number # - and t ca . avinash ramu was supported by the ‘burroughs wellcome fund institutional program unifying population and laboratory based sciences award’ at washington university. malachi griffith was supported by the national human genome research institute (nhgri) of the national institutes of health (nih) under award number r hg . malachi griffith and obi griffith were supported by the nih national cancer institute (nci) under award numbers u ca , u ca , u ca u ca . malachi griffith and megan richters were supported by the v foundation for cancer research under award number v - . the results published here are in whole or part based upon data generated by the tcga research network: https://www.cancer.gov/tcga. contributions k.c.c. and y.-y.f. were involved in all aspects of this study, including designing methodology, developing and testing the tool software, analyzing and interpreting data, and writing the .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / manuscript, with input from a.r., z.l.s., m.r., s.f., j.k., o.l.g., and m.g. a.r. designed the tool and led software development efforts. y.l., w.c.c., r.u., and r.g. provided unpublished tumor datasets and provided critical feedback on the manuscript. o.l.g. and m.g. supervised the study. all authors read and approved the final manuscript. conflicts of interest w. chapman serves on the advisory board for novartis pharmaceutical and reports intellectual property with pathfinder therapeutics. r. uppaluri reports grants and personal fees from merck inc. r. govindan served as consultant for horizon pharmaceuticals and geneplus. references . chabot, b. & shkreta, l. defective control of pre-messenger rna splicing in human disease. j. cell biol. , – ( ). . vogelstein, b. et al. cancer genome landscapes. science , – ( ). . soemedi, r. et al. pathogenic variants that alter protein code often disrupt splicing. nat. genet. , – ( ). . supek, f., miñana, b., valcárcel, j., gabaldón, t. & lehner, b. synonymous mutations frequently act as driver mutations in human cancers. cell , – ( ). . jung, h. et al. intron retention is a widespread mechanism of tumor-suppressor inactivation. nat. genet. , – ( ). . venables, j. p. aberrant and alternative splicing in cancer. cancer res. , – ( ). . climente-gonzález, h., porta-pardo, e., godzik, a. & eyras, e. the functional impact of alternative splicing in cancer. cell rep. , – ( ). . chen, j. & weiss, w. a. alternative splicing in cancer: implications for biology and therapy. oncogene , – ( ). . xiong, h. y. et al. rna splicing. the human splicing code reveals new insights into the genetic determinants of disease. science , ( ). . yeo, g. & burge, c. b. maximum entropy modeling of short sequence motifs with applications to rna splicing signals. j. comput. biol. , – ( ). .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / . fairbrother, w. g., yeh, r.-f., sharp, p. a. & burge, c. b. predictive identification of exonic splicing enhancers in human genes. science , – ( ). . wang, z. et al. systematic identification and analysis of exonic splicing silencers. cell , – ( ). . jaganathan, k. et al. predicting splicing from primary sequence with deep learning. cell , – .e ( ). . kahles, a., ong, c. s., zhong, y. & rätsch, g. spladder: identification, quantification and testing of alternative splicing events from rna-seq data. bioinformatics , – ( ). . trincado, j. l. et al. suppa : fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. genome biol. , ( ). . kahles, a. et al. comprehensive analysis of alternative splicing across tumors from , patients. cancer cell , – .e ( ). . li, y. i. et al. annotation-free quantification of rna splicing using leafcutter. nat. genet. , – ( ). . monlong, j., calvo, m., ferreira, p. g. & guigó, r. identification of genetic variants associated with alternative splicing using sqtlseeker. nat. commun. , ( ). . li, y. i. et al. rna splicing is a primary link between genetic variation and disease. science , – ( ). . jayasinghe, r. g. et al. systematic analysis of splice-site-creating mutations in cancer. cell rep. , – .e ( ). . viner, c., dorman, s. n., shirley, b. c. & rogan, p. k. validation of predicted mrna splicing mutations using high-throughput transcriptome data. f res. , ( ). . shirley, b. c., mucaki, e. j. & rogan, p. k. pan-cancer repository of validated natural and cryptic mrna splicing mutations. f res. , ( ). . shiraishi, y. et al. a comprehensive characterization of cis-acting splicing-associated .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / variants in human cancer. genome res. , – ( ). . gtex consortium. the genotype-tissue expression (gtex) project. nat. genet. , – ( ). . mclaren, w. et al. the ensembl variant effect predictor. genome biol. , ( ). . li, h. et al. the sequence alignment/map format and samtools. bioinformatics , – ( ). . sondka, z. et al. the cosmic cancer gene census: describing genetic dysfunction across all human cancers. nat. rev. cancer , – ( ). . robinson, j. t. et al. integrative genomics viewer. nat. biotechnol. , – ( ). . dobin, a. et al. star: ultrafast universal rna-seq aligner. bioinformatics , – ( ). . kim, d., langmead, b. & salzberg, s. l. hisat: a fast spliced aligner with low memory requirements. nat. methods , – ( ). . kim, d. et al. tophat : accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. genome biol. , r ( ). . conway, j. r., lex, a. & gehlenborg, n. upsetr: an r package for the visualization of intersecting sets and their properties. bioinformatics , – ( ). . surget, s., khoury, m. p. & bourdon, j.-c. uncovering the role of p splice variants in human malignancy: a clinical perspective. onco. targets. ther. , – ( ). . tokheim, c. & karchin, r. chasmplus reveals the scope of somatic missense mutations driving human cancers. cell syst , – .e ( ). . bicknell, d. c., kaklamanis, l., hampson, r., bodmer, w. f. & karran, p. selection for β - microglobulin mutation in mismatch repair-defective colorectal carcinomas. curr. biol. , – ( ). . bonneville, r. et al. landscape of microsatellite instability across cancer types. jco precis oncol , ( ). . kloor, m. et al. immunoselective pressure and human leukocyte antigen class i antigen .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / machinery defects in microsatellite unstable colorectal cancers. cancer res. , – ( ). . sade-feldman, m. et al. resistance to checkpoint blockade therapy through inactivation of antigen presentation. nat. commun. , ( ). . seliger, b., maeurer, m. j. & ferrone, s. antigen-processing machinery breakdown and tumor growth. immunol. today , – ( ). . güssow, d. et al. the human beta -microglobulin gene. primary structure and definition of the transcriptional unit. j. immunol. , – ( ). . wang, l., yin, w. & shi, c. e ubiquitin ligase, rnf , inhibits the progression of tongue cancer. bmc cancer , ( ). . hornbeck, p. v. et al. phosphositeplus, : mutations, ptms and recalibrations. nucleic acids res. , d – ( ). . zhao, r., choi, b. y., lee, m.-h., bode, a. m. & dong, z. implications of genetic and epigenetic alterations of cdkn a (p (ink a)) in cancer. ebiomedicine , – ( ). . gump, j., stokoe, d. & mccormick, f. phosphorylation of p ink a correlates with cdk association. j. biol. chem. , – ( ). . quinlan, a. r. bedtools: the swiss-army tool for genome feature analysis. curr. protoc. bioinformatics , . . – ( ). . li, h. tabix: fast retrieval of sequence features from generic tab-delimited files. bioinformatics , – ( ). . gdc data processing. https://gdc.cancer.gov/about-data/gdc-data-processing. . fan, y. et al. accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling for sequencing data. biorxiv ( ) doi: . / . . cibulskis, k. et al. sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. nat. biotechnol. , – ( ). .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / . koboldt, d. c. et al. varscan : somatic mutation and copy number alteration discovery in cancer by exome sequencing. genome res. , – ( ). . larson, d. e. et al. somaticsniper: identification of somatic point mutations in whole genome sequencing data. bioinformatics , – ( ). . griffith, m. et al. genome modeling system: a knowledge management platform for genomics. plos comput. biol. , e ( ). . li, h. & durbin, r. fast and accurate short read alignment with burrows-wheeler transform. bioinformatics , – ( ). . saunders, c. t. et al. strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. bioinformatics , – ( ). .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / main figures figure : flexible, streamlined discovery of cis-acting splice variants with regtools modules and cis-splice-effects identify workflow. a) by default, variants annotate marks variants within bp on the exonic side and bp on the intronic side of an exon edge as potentially splicing-relevant. this “splice variant window” can be modified individually for the exonic side and intronic side using the “-e” and “-i” options, respectively. with cis-splice-effects identify, for each variant in the splice variant window, a “splice junction region” is determined by finding the largest span of sequence space between exons which flank the exon associated with the splicing-relevant variant. the splice junction region can also be set manually to contain the entire sequence space n bases upstream and downstream of the variant using the “-w” option. junctions overlapping the splice junction region are associated with the variant. using the -e option considers all exonic variants as potentially splicing-relevant, but is otherwise the same. the -i option considers all intronic variants and also limits the splice junction region to the intronic region in which the variant is found, excluding the flanking exons. b) cis-splice-effects identify and the underlying junctions annotate command annotate splicing events based on whether the donor and acceptor site combination is found in the reference transcriptome gtf. in this example, there are two known transcripts (shown in blue) which overlap a set of junctions from rnaseq data (depicted as junction supporting reads .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / in red). comparing the observed junctions to the reference junctions in the first transcript (top panel), regtools checks to see if the observed donor and acceptor splice sites are found in any of the reference exons and also counts the number of exons, acceptors, and donors skipped by a particular junction. double arrows represent matches between observed and reference acceptor/donor sites while single arrows show novel splice sites. these steps are repeated for the rest of the relevant transcripts, keeping track of whether there are known acceptor-donor combinations. junctions with a known donor but novel acceptor or vice-versa are annotated as “d” or “a”, respectively. if both sites are known but do not appear in combination in any transcripts, the junction is annotated as “nda”, whereas if both sites are unknown, the junction is annotated as “n”. if the junction is known to the reference gtf, it is marked as “da”. c) the cis-splice-effects identify command relies on the variants annotate, junctions extract, and junctions annotate submodules. this pipeline takes variant calls and rna-seq alignments along with genome and transcriptome references and outputs information about novel junctions and associated potential cis splice-altering sequence variants. regtools is agnostic to downstream research goals and its output can be filtered through user-specific methods and thus can be applied to a broad set of scientific questions. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure . overview of input data considered and significant events identified by regtools for each tumor type. a) summary of initial variants considered for analysis by regtools per sample per tumor cohort. each sample’s variant count is plotted and violin plots are overlaid for each cohort. b) summary unique exon-exon junction observations for each sample. each sample’s unique junction count is plotted and violin plots are overlaid for each cohort. c) summary of significant junction types for each cohort across each of the variant window sizes that were used in this analysis. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure . splice regulatory variants often lead to the expression of multiple alternative junctions. a) a single variant can result in either one or more than one alternatively spliced junctions. depicted is a variant resulting in a single novel transcript product (purple), a variant resulting in two novel transcript products that both use alternate donor sites (yellow), and a variant resulting in multiple junctions of different types (teal). b) stacked bar graph visualizing how often a .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / variant leads to each of the categories mentioned above across the four regtools variant windows used. this analysis is for all variants that regtools identified as significant. c) bar chart showing how often each of the described junction combinations occurs when a single variant results in multiple junction types across each of the regtools splice variant windows used. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure . comparison of regtools with other tools that identify potential splice altering variants. a) conceptual diagram of contrasting approaches used to identify splice regulatory tools/methods. a red dot indicates that the source only considers genomic data for making its calls, as opposed to a combination of genomic and transcriptomic data. b) upset plot comparing splice altering variants identified by regtools to those identified by other splice variant predictors and annotators. each tool and their total number of variant predictions are shown on the left side bar graph. the numbers of variants specific to each tool or shared between different combinations of tools are indicated by the bar graph along the top and the individual or connected dots. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure . pan-cancer analysis of cohorts from tcga and mgi reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns results of analysis for recurrently disrupted genes in each cohort. columns correspond to the most frequently recurring genes, as ranked by fraction of samples. genes are clustered by whether they were annotated by the cgc as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / another type of cancer-relevant gene. shading corresponds to −log (p value) and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / figure . several snvs in b m associated with alternate acceptor and alternate donor usage. a) igv snapshot of three intronic variant positions found to be associated with usage of an alternate acceptor and alternative donor site that leads to formation of novel transcript products. this result was found using the default splice variant window parameter (i e ). b) zoomed in view of the variants identified by regtools that are associated with alternate acceptor and donor usage. two of these variant positions flank the acceptor site and one flanks the donor site that are being affected. c) sashimi plot visualizations for samples containing the identified variants that show alternate acceptor usage (red) or alternate donor usage (orange). .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figures supplementary figure . benchmarking of each regtools command. the total cpu time (system time + user time) and real time are plotted against the number of entries processed for each available regtools function using total replicates. for the cis- splice-effects identify/cis-splice-effects associate/variants annotate workflows, the number of entries corresponds to the number of somatic variants, whereas the number of entries in the junctions extract/junctions annotate/compare_junctions workflows corresponds to the number of reads processed from a downsampled bam file, the number of junctions processed, and the number of candidate variant junction pairings processed, respectively. for compare_junctions, candidate variant junction pairings were compared across the number of samples in that cohort, with the largest being samples that comprise our brca cohort. loess curves are fitted onto each plot. of rt, .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure . summary of variants analyzed by regtools in each tumor cohort summary of the starting number of high quality variants per sample, the number of initial variants considered for analysis by regtools for each variant window used per tumor cohort, and the number of significant variants for each variant window used per tumor cohort. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure . visualization of junctions across cohorts. summary of the total junction read counts, unique junctions (all types), unique known (da) junctions, unique known (da) junctions not found in gtex, unique d, a, nda junctions, and unique d, a, nda junctions not found in gtex per sample per cohort. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure : intronic snv in cttn associated with an exon skipping event. a) igv snapshot of a single nucleotide variant (grch , chr :g. g>c) within an intron of cttn in luad sample tcga- - - a. this variant is associated with an exon skipping event causing the formation of an nda junction, junc , which has reads of support. the variant was identified by regtools, vep, and veridical but no other tools. this result was found using the default splice variant window parameter (i e ). b) sashimi plot visualization of the novel junction. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure : exonic snv in lztr associated with alternative donor usage. a) igv snapshot of a single nucleotide variant (grch , chr :g. g>c) within an exon of lztr in luad sample tcga- - - a. this variant is associated with the formation of an a junction, junc , which has reads of support. the variant was identified by regtools, vep, and spliceai but no other tools. this result was found using the default splice variant window parameter (i e ). b) sashimi plot visualization of the novel junction. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure . pan-cancer analysis of cohorts from tcga and mgi reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns results of analysis for recurrently disrupted genes in each cohort. a) rows correspond to the most frequently recurring genes, as ranked by binomial p-value. genes are clustered by whether they were annotated by the cgc as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. shading corresponds to −log (p value) and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. b) rows correspond to the most frequently recurring genes, as ranked by fraction of samples. shading corresponds to the fraction of samples and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure . pan-cancer analysis of cohorts from tcga and mgi reveals genes recurrently disrupted by variants which promote splicing of particular canonical junctions results of analysis for recurrently disrupted genes in each cohort. a) rows correspond to the most frequently recurring genes, as ranked by binomial p-value. genes are clustered by whether they were annotated by the cgc as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. shading corresponds to −log (p value) and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. b) rows correspond to the most frequently recurring genes, as ranked by fraction of samples. shading corresponds to the fraction of samples and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure . tcga pan-cancer analysis reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns results of analysis for recurrently disrupted genes in each tcga cohort. a) rows correspond to the most frequently recurring genes, as ranked by binomial p-value. genes are clustered by whether they were annotated by the cgc as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. shading corresponds to −log (p value) and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. b) rows correspond to the most frequently recurring genes, as ranked by fraction of samples. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / shading corresponds to the fraction of samples and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. supplementary figure . tcga pan-cancer analysis reveals genes recurrently disrupted by variants which promote splicing of particular canonical junctions results of analysis for recurrently disrupted genes in each tcga cohort. a) rows correspond to the most frequently recurring genes, as ranked by binomial p-value. genes are clustered by whether they were annotated by the cgc as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. shading corresponds to −log (p value) and columns represent cancer types. red marks within cells indicate that the .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / gene was annotated by chasmplus as a driver within a given tcga cohort. b) rows correspond to the most frequently recurring genes, as ranked by fraction of samples. shading corresponds to the fraction of samples and columns represent cancer types. red marks within cells indicate that the gene was annotated by chasmplus as a driver within a given tcga cohort. supplementary figure . analysis of hcc, oscc, and sclc cohorts reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns results of analysis for recurrently disrupted genes in each mgi cohort. a) rows correspond to the most frequently recurring genes, as ranked by binomial p-value. shading corresponds to −log (p value) and columns represent cancer types. b) rows correspond to the most frequently recurring genes, as ranked by fraction of samples. shading corresponds to the fraction of samples and columns represent cancer types. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure . analysis of hcc, oscc, and sclc cohorts reveals genes recurrently disrupted by variants which promote splicing of particular canonical junctions results of analysis for recurrently disrupted genes in each tcga cohort. a) rows correspond to the most frequently recurring genes, as ranked by binomial p-value. shading corresponds to −log (p value) and columns represent cancer types. b) rows correspond to the most frequently recurring genes, as ranked by fraction of samples. shading corresponds to the fraction of samples and columns represent cancer types. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure : intronic snv in tp associated with alternative donor usage. a) igv snapshot of a single nucleotide variant (grch , chr :g. c>a) within an intron of tp in an oscc sample. this variant is associated with an exon skipping event with reads of support and an alternate acceptor site usage with reads of support. this result was found using the default splice variant window parameter (i e ). b) sashimi plot visualization of the novel junction. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure : intronic deletion in rnf associated with alternative donor usage. a) igv snapshot of a single nucleotide variant (grch , chr :g. dela) within an intron of rnf in coad samples. this variant is associated with an exon skipping event with and reads of support for the samples shown. this result was found using the default splice variant window parameter (i e ). b) sashimi plot visualization of the novel junction. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figure : several snvs in cdkn a associated with alternate donor usage. a) igv snapshot of three variant positions in cdkn a found to be associated with usage of an alternate donor site that leads to formation of an alternate known transcript. this result was found using the default splice variant window parameter (i e ) for known (da) junctions. b) zoomed in view of the variants identified by regtools that are associated with alternate donor usage. two of these variant positions flank the donor site that is no longer being used. c) sashimi plot visualizations for samples containing the identified variants that show alternate donor usage. .cc-by-nc-nd . international licensea certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under the copyright holder for this preprint (which was notthis version posted january , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . /