auto-corpus: automated and consistent outputs from research publications auto-corpus: automated and consistent outputs from research publications yan hu ,a, shujian sun ,a, thomas rowlands , tim beck , ,b, and joram m. posma , ,b section of bioinformatics, division of systems medicine, department of metabolism, digestion and reproduction, imperial college london, sw az, united kingdom department of genetics and genome biology, university of leicester, le rh, united kingdom health data research (hdr) uk, united kingdom a these authors contributed equally. b these authors contributed equally. � abstract motivation: the availability of improved natural lan- guage processing (nlp) algorithms and models enable researchers to analyse larger corpora using open source tools. text mining of biomedical literature is one area for which nlp has been used in recent years with large untapped potential. however, in order to generate cor- pora that can be analyzed using machine learning nlp algorithms, these need to be standardized. summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. results: we present here an automated pipeline that cleans html files from biomedical literature. the output is a single json file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. we analyzed a total of , open access articles from pubmed central, from both genome-wide and metabolome-wide association studies, and developed a model to standardize the section headers based on the information artifact ontology. extraction of table data was developed on pubmed articles and fine-tuned using the equivalent publisher versions. availability: the auto-corpus package is freely available with detailed instructions from github at https://github.com/jmp /autocorpus/. information artefact ontology | natural language processing | text standard- ization correspondence: timbeck [at] leicester.ac.uk and jmp [at] ic.ac.uk introduction natural language processing (nlp) is a branch of artificial intelligence that uses computers to process, understand and use human language. nlp is applied in many different fields including language modelling, speech recognition, text min- ing and translation systems. in the biomedical realm, nlp has been applied to extract for example medication data from electronic health records and patient clinical history from clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts ( ). biomedical publications, unlike structured electronic health records, are semi-structured and this makes it difficult to extract and inte- grate the relevant information ( ). the format of research ar- ticles differs between publishers and sections describing the same entity, for example statistical methods, can be found in different locations in the document in different publica- tions. both unstructured text and semi-structured document elements, such as headings, main texts and tables, can con- tain important information that can be extracted using text mining ( ). the development of the genome-wide association study (gwas) has been led to by the on-going revolution in high- throughput genomic screening and a deeper understanding of the relationship between genetic variations and diseases/traits ( ). in a typical gwas, researchers collect data from study participants, use single nucleotide polymorphism (snp) ar- rays to detect the common variants among participants, and conduct statistical tests to determine if the association be- tween the variants and traits is significant. the results are mostly represented in publication tables, but can also be found in the main text, and there are multiple community ef- forts to store these reported associations in queryable, on- line databases ( , ). these efforts involve time-intensive and costly manual data curation to transcribe results from the publications, and supplementary information, into databases. summary-level gwas results are generally reported in the literature according to community norms (e.g. a snp asso- ciated to a phenotype with a probability value), hence nlp algorithms can be trained to recognize the formats in which data are reported to facilitate faster and scalable information extraction that is less prone to human error. development of effective automatic text mining algorithms for gwas literature can also potentially benefit other fields in biomedical research as the body of biomedical literature grows every day. yet previous attempts of mining scientific literature focused mainly on information extraction from ab- stracts and some on the main text, while for the most part ignoring tables. to facilitate the process of preparing a cor- pus for nlp tasks such as named-entity recognition (ner), text classification or relationship extraction, we have devel- oped an automated pipeline for consistent outputs from research publications (auto-corpus) as a python package. the main aims of auto-corpus are: • to provide clean text outputs for each publication sec- tion with standardized section names hu and sun, et al. | biorχiv | january , | – .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/jmp /autocorpus/ timbeck@leicester.ac.uk jmp @ic.ac.uk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / • to represent each publication’s tables in a javascript object notation (json) format to facilitate data im- port into databases • to use the text outputs to find abbreviations used in the text we exemplify the package on a corpus of , open access gwas publications whose data have been manually added to the gwas central database to list phenotypes, snps and p-values found in the cleaned text (figure ). in addition, we also include data on , + metabolome-wide association studies (mwas) to ensure the methods are not biased towards one domain. mwas focus on small molecules, some of which are end-products of cellular regulatory processes, that are the response of the human body to genetic or environmental variations ( ). materials and methods data. hypertext markup language (html) files for , open access gwas publications whose data exists in the gwas central database ( ) were downloaded from pubmed central (pmc) in march . a further , open access publications of mwas on cancer, gastrointestinal diseases, metabolic syndrome, sepsis and neurodegenerative, psychi- atric, and brain illnesses were also downloaded in the same format. publisher versions of ca. % of these publications were downloaded in july to test the algorithms on pub- lications with different html formats. the gwas dataset was randomly divided into training publications to de- velop algorithms, and a test set of the remaining publica- tions. processing. html files were loaded using the beautiful- soup html parser package (v . . ). beautifulsoup was used to convert html files to tree-like structures with each branch representing a html section and each leaf a html element. after html files were loaded, all superscripts, subscripts, and italics were converted to plain text. auto- corpus extracts h , h and h tags for titles and headings, and p tags for paragraph texts using the default configura- tion. the headings and paragraphs are saved in a structured javascript object notation (json) file for each html file. tables are extracted from the document using a different set of configuration files (separate configurations for different ta- ble structures can be defined and used) and saved in a new json model that ensures tables of all formats and origin, not only restricted to gwas publications, can be described in the same structured model, so that these can be used as in- put to rule-based or deep learning algorithms for data extrac- tion. the data cells are stored in the “result” key, and their corresponding section name and header names are stored in “section_name” and “columns” keys respectively. therefore, extracting relationships between cells only requires simple rules. fig. . workflow of the auto-corpus package. | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ontologies for entity recognition. the information arti- fact ontology (iao) was created to serve as a domain-neutral resource for the representation of types of information con- tent entities such as documents, databases, and digital im- ages ( ). we used the v - - model ( ) in which different terms exist that describe headers typically found in biomedical literature. the extracted headers in the json file were first mapped to the iao terms using the lexical owl ontology matcher ( ). we use fuzzy matching using the fuzzywuzzy package (v . . ) to map headers to the pre- ferred section header terms and synonyms, with a similarity threshold of . . this threshold was evaluated by confirming all matches were accurate by two independent researchers. after the direct iao mapping and fuzzy matching, unmapped headers still exist. to map these headings, we developed a new method using a directed graph (digraph) for representa- tion since headers are not repeated within a document, are se- quential and have a set order that can be exploited. digraphs consist of nodes (entities, headers) and edges (links between nodes) and the weight of the nodes and edges is propor- tional to the number of publications in which these are found. while digraphs from individual publications are acyclic, the combined graph can contain cycles hence digraphs opposed to directed acyclic graphs are used. unmapped headers are assigned a section based on the digraph and the headers in the publication that could be mapped (anchor points). for example, at this point in this article the main headers are ‘ab- stract’ followed by ‘introduction’ and ‘materials and meth- ods’ that could make up a digraph. another article with head- ers ‘abstract’, ‘background’ and ‘materials and methods’ has two anchor points that match the digraph, and the unmapped header (‘background’) can be inferred from appearing in be- tween the anchor points in the digraph (‘abstract’, ‘materials and methods’): ‘introduction’. we use this process to eval- uate new potential synonyms for existing terms and identify new potential terms for sections found in biomedical litera- ture. we used the human phenotype ontology (hpo) to identify disease traits in the full texts. the hpo was developed with the goal to cover all common phenotypic abnormalities in hu- man monogenic diseases ( ). use cases: regular expression algorithms. abbrevia- tions in the full text are found using an adaptation of a previ- ously published methodology ( ) based on regular expres- sions using the abbreviations package (v . . ). the brief principle of it is to find all brackets within a corpus. if the number of words in a bracket is < it considers if it could be an abbreviation. it searches the characters within the brackets in the text on either side of the brackets one by one. the first character of one of these words must contain the first charac- ter within that bracket. and the other characters within that bracket must be contained by other words followed by the previous word whose first character is the same as the first character in that bracket. we combine the output of the pack- age with abbreviations defined in the abbreviations section (if found) from the iao/digraph model. for phenotype entity recognition, first any abbreviations in paragraphs extracted from the full text are replaced by their definition. this text is then tokenized using the spacy pack- age (v . ) (model en_core_web_sm) and compared against phenotypes and their synonyms defined by hpo for disease traits matching. p-values and snps were identified in the full text and tables based on regular expressions as they have a standard form. pairs of p-value-snp associations are found in the text using dependency parse trees ( ). use cases: deep learning-based named-entity recog- nition. the first example of a use case is to recognize the assay with which the data was acquired, however no ex- isting models exist for this purpose. we fine-tuned a pre- existing model trained for biomedical ner, the biomedi- cal bidirectional encoder representations from transform- ers (biobert) ( ), using part of our corpus where only mwas assays were tagged. we applied our fine-tuned model only on the paragraphs in the materials and methods sec- tions to recognize the assays used. a second biobert-based model was fine-tuned on phenotypes, which already exist in the data, and enriched in phenotypes associated with the mwas publications. this model was applied on only the abstract and paragraphs from the results section. the third example was applied only on paragraphs from the results and discussion sections using an existing model specifically trained to recognize chemical entities, chemlistem (v . . ) ( ). use cases: paragraph classification. it is possible un- mapped headers are mapped to multiple sections if the an- chor points are far apart. in order to test the applicability of a machine learning model to classify paragraphs we trained a random forest classifier on a dataset consisting of , ab- stract paragraphs and non-abstract paragraphs. % of the data was used for training and the remainder as the test set. results the order of sections in biomedical literature. a total of , headers were extracted from the , publica- tions, mapped to iao (v - - ) terms and visualized by means of a digraph with unique nodes and directed edges (figure a). the major unmapped node is ‘associated data’, which is a header specific for pmc articles that ap- pears at the beginning of each article before the abstract. the main structure of biomedical articles that were analyzed is: abstract → introduction → materials → results → discus- sion → conclusion → acknowledgements → footnotes sec- tion → references. iao has separate definitions for ‘mate- rials’ (iao: ), ‘methods’ (iao: ) and ‘statis- tical methods’ (iao: ) sections, hence they are sepa- rate nodes in the graph and introduction is also often followed by headers to reflect the methods section (and synonyms). there is also a major directed edge from introduction directly to results, with materials and methods placed after the discus- sion and/or conclusion sections. hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / all unmapped headers were investigated and evaluated whether some could be used as synonym for existing cate- gories. the digraph was also inspected by means of visual- izing individual ego-networks which show the edges around a specific node mapped to an existing iao term. figure b shows the ego-network for abstract, and four main categories and one potential new synonym (precis, in red) were iden- tified. the majority of unmapped headers (in purple), that follow the abstract, relate to a document that is written as one coherent whole, with specific headers for each section or a general header for the full/main text. an additional four unmapped headers relate to ‘materials and methods’ in their broader sense and these are data, data description, par- ticipants and sample. the remaining two categories of un- mapped headers to/from abstract can be classified as new sections ‘graphical abstract’ and ‘highlights’. these head- ers were found alongside, and appear to be distinct from, the (textual) abstract. based on the digraph, we then assigned data and data descrip- tion to be synonyms of the materials section, and participants and sample as a new category termed ‘participants’ which is related to, but deemed distinct from, the existing patients sec- tion (iao: ). the same process was applied to ego- networks from other nodes linked to existing iao terms to add additional synonyms to simplify the digraph. figure c shows the resulting digraph with only existing and newly pro- posed section terms. new proposed elements for the iao. each existing iao term contains one or more synonyms and extracted head- ers were first mapped directly to these terms. any headers that could not be mapped directly are mapped in the second step using fuzzy matching (e.g. the typographical error ‘ex- peremintal section’ in pmc is correctly mapped to the methods section). the last step involves mapping remain- ing unmapped headers to existing terms based on the digraph and using the structure (anchor headers) of the publication. headers that can be mapped to existing terms in the second and third steps, are included as synonyms in the model. the existing categories for which new potential synonyms were identified are listed in table a and b with their existing synonyms and newly identified synonyms. from the analysis of ego-networks four new potential cate- gories were identified: disclosure, graphical abstract, high- lights and participants. table details the proposed defini- tion and synonyms for these categories. in the digraph in figure c this section is located towards the end of a pub- lication and in some instances is followed by the conflict of interest section. table data extraction with different configurations. pmc articles are standardized which makes data extraction more straightforward, however some publications are not deposited into pmc or other repositories and can only be found via publisher websites. while the package has been developed using a large set of pmc articles, we compared the auto-corpus output for pmc articles with the output for the equivalent articles made available by the publishers. we found no differences in how headers were extracted and paragraphs were classified based on the digraph. however, the representation of tables does differ substantially between publishers, hence a model developed on pmc articles alone will fail to extract the data. we circumvent this issue by defin- ing configuration files for different table formats and we com- pare the accuracy of the data represented in the json format (figure ) between pmc and publisher versions of the same papers. using the default (pmc) configuration on non-pmc arti- cles none of the tables are represented accurately in the json. auto-corpus allows to use a variety of configura- tion files (a single file, or all as batch) to be used to extract data from tables. one configuration file, different to the de- fault, correctly represented the data in json format of % ( ) of tables. the remaining tables could be repre- sented correctly using different configuration files. when the right configuration file is used for non-pmc articles, all tables ( %) are represented identically to the json output from the matching pmc version. use cases. the extracted paragraphs were classified as one (or more) categories based on the digraph. this is the purpose of the auto-corpus package, to prepare a corpus for analy- sis so that different sections can be used for specific purposes. we detail how these standardized texts can be used for entity recognition. paragraph classification. while many headers can be mapped using fuzzy matching plus the digraph structure, some headers remain unmapped (e.g. the headers in purple in figure b: full text, main text, etc.) while others can be assigned to multiple (possible) sections. the choice of as- signing multiple categories to unmapped headers based on the digraph is deliberate as it is to ensure the algorithm does not wrongly assign it to only one (e.g. ‘materials’ over ‘meth- ods’). the next step is to perform the paragraph classification using nlp algorithms to learn from the word usage and con- text. we show that random forests can be used to this end by training it to distinguish between abstracts and other para- graphs. paragraphs from the test set were predicted us- ing a random forest trained on , paragraphs. for the test set, we obtained an f -score of . for classifying abstracts (precision = . , recall = . ) and . for classifying non- abstracts (precision = . , recall = . ). abbreviation identification. the abbreviation detection algo- rithm searches through each paragraph using a rule-based ap- proach to find all abbreviations used. auto-corpus then investigates whether a paragraph is mapped to the abbrevia- tions category and, if found, it combines these two lists of ab- breviations found in the publication. for example, when ap- plied on an mwas publication ( ) which contains a header titled “abbreviations” the algorithm combines the ab- breviations listed by the authors and with a further identi- fied from the text (figure ), including an abbreviation used with two spellings in the text. | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . digraph generated from analyzing section headers from , open access publications from pubmed central. (a) digraph of the v - - iao model consists of unique nodes, of which could be directly mapped to section terms (in orange) and the remainder are unmapped headers (in grey), and directed edges. relative node sizes and edge widths are directly proportional to the number of publications with these (subsequent) headers. blue edges indicate the edge with the highest weight from the source node, edges that exist in fewer than % of publications are shown in light grey and the remainder in black. (b) unmapped nodes connected to ‘abstract’ as ego node, excluding corpus specific nodes, grouped into different categories. unlabeled nodes are titles of paragraphs in the main text. (c) final digraph model used in auto-corpus to classify paragraphs after fuzzy matching. this model includes new (proposed) section terms and each section contains new synonyms identified in this analysis. ‘associated data’ is included as this is a pmc-specific header found before abstracts and can be used to indicate the start of most articles. rule-based extraction of gwas summary-level data. gwas central relies on curated data extracted manually from pub- lications or other databases. we investigated whether a rule-based approach to recognize phenotypes, snps and p- values can correctly identify data from publications con- tained within the database. a rule-based approach by ap- plying the hpo on the gwas publications from the test set, identified a total of , unique disease traits (major and minor) in these publications. traits are recorded for these publications in gwas central and the rule-based approach found with a perfect match. for % of the publica- tions all traits were correctly identified. snps have standard- ized formats, hence rule-based approaches are well suited for their identification. likewise, p-values in gwas publica- tions are typically represented using scientific notation and can also be identified using rule-based methods. a total of , snp/p-value pairs were found across the main text and tables of the publications. for . % of publications all associations recorded in the gwas central database are also found using this approach. while . % of these pub- lications present results (snp/p-value pairs) only in tables, and . % of pairs are found in tables, associations were identified from the main text that are not represented in ta- bles. , pairs match those recorded in the database (total of , pairs for these publications), however many associ- ations in the database are not represented in main text/tables but in supplementary materials. auto-corpus includes a separate function to convert csv/tsv data to table json for- mat (figure ), as summary-level results are often saved in these file formats as part of the supplementary information. named-entity recognition. three different deep learning models were used for ner on specific paragraphs of publica- tions. a pre-trained biomedical entity recognition algorithm ( ) was fine-tuned using the results from the rule-based approach applied on gwas data. example sentences that contain hpo terms were used to fine-tune the transformer model and then applied on mwas publications from four broad and distinct phenotypes (cancer, gastrointestinal diseases, metabolic syndrome, and neurodegenerative, psy- chiatric and brain illnesses). the fine-tuned deep learning algorithm obtained accuracies between . and . , aver- aging around . % (table ). we then fine-tuned the same base model for recognizing as- says in text by training on sentences identified from the text that contain assays routinely used in mwas. the first pass consisted of a rule-based approach, with fuzzy matching, to find sentences with terms and these were then used to fine- tune the deep learning model. figure shows the result- ing output in json format for one mwas publication ( ). hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / category (iao identifier) existing synonyms (iao v - - ) new synonyms identified a abstract (iao: ) abstract precis acknowledgements (iao: ) acknowledgements, acknowledgments acknowledgement, acknowledgment, acknowledgments and disclaimer author contributions (iao: ) author contributions, contributions by the authors authors’ contribution, authors’ contributions, authors’ roles, contributorship, main authors by consortium and author contributions discussion (iao: ) discussion, discussion section discussions footnote (iao: ) endnote, footnote footnotes introduction (iao: ) background, introduction introductory paragraph methods (iao: ) experimental, experimental procedures, experimental section, materials and methods, methods analytical methods, concise methods, experimental methods, method, method validation, methodology, methods and design, methods and procedures, methods and tools, methods/design, online methods, star methods, study design, study design and methods references (iao: ) bibliography, literature cited, references literature cited, reference, references, reference list, selected references, web site references supplementary material (iao: ) additional information, appendix, supplemental information, supplementary material, supporting information additional file, additional files, additional information and declarations, additional points, electronic supplementary material, electronic supplementary materials, online content, supplemental data, supplemental material, supplementary data, supplementary figures and tables, supplementary files, supplementary information, supplementary materials, supplementary materials figures, supplementary materials figures and tables, supplementary materials table, supplementary materials tables table a. newly identified synonyms for existing iao terms ( xx) from the digraph mapping of , publications. elements in italics have previously been submitted by us for inclusion into iao and added in the latest release (v - - ). lastly, we applied a domain specific algorithm for recogniz- ing chemical entities in the text and tables ( ) to identify metabolites in the same publication (figure ). discussion the analysis of our corpus of , open access publica- tions has resulted in identifying well over new synonyms for existing terms used in biomedical literature to indicate what a paragraph is about. in addition, we identified four new potential categories not previously included in the iao. we previously submitted a subset of synonyms reported here and one of the new categories for inclusion in the iao. these have been accepted by the iao and are included in the lat- est release (v - - ), hence we presented our analyses using the previous version of iao that does not include part of our work. in the latest release, the ‘graphical abstract’ section has been added (iao: ) based on our contri- bution. also, a new ‘research participants’ (iao: ) section has been added as contribution by others in the same release; therefore synonyms found here for the new category ‘participants’ section will be proposed in future as synonyms for the ‘research participants’ section. while the disclosure section appears to be distinct from the conflict of interest sec- tion due to a directed edge in the digraph, its synonyms could also be proposed to be part of the existing conflict of interest section in iao. standardization of text for nlp is an important step in preparing a corpus. auto-corpus outputs a json file of cleaned text, with standardized headers as well as all data presented in tables in json format. standardizing headers is important because some sections are more important than | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / category (iao identifier) existing synonyms (iao v - - ) new synonyms identified a abbreviations (iao: ) abbreviations, abbreviations list, abbreviations used, list of abbreviations, list of abbreviations used abbreviation and acronyms, abbreviation list, abbreviations and acronyms, abbreviations used in this paper, definitions for abbreviations, glossary, key abbreviations, non-standard abbreviations, nonstandard abbreviations, nonstandard abbreviations and acronyms author information (iao: ) author information, authors’ information biographies, contributor information availability (iao: ) availability, availability and requirements availability of data, availability of data and materials, data archiving, data availability, data availability statement, data sharing statement conclusion (iao: ) concluding remarks, conclusion, conclusions, findings, summary conclusion and perspectives, summary and conclusion conflict of interest (iao: ) competing interests, conflict of interest, conflict of interest statement, declaration of competing interests, disclosure of potential conflicts of interest authors’ disclosures of potential conflicts of interest, competing financial interests, conflict of interests, conflicts of interest, declaration of competing interest, declaration of interest, declaration of interests, disclosure of conflict of interest, duality of interest, statement of interest consent (iao: ) consent informed consent ethical approval (iao: ) ethical approval ethics approval and consent to participate, ethical requirements, ethics, ethics statement funding source declaration (iao: ) funding, funding information, funding sources, funding statement, funding/support, source of funding, sources of funding financial support, grants, role of the funding source, study funding future directions (iao: ) future challenges, future considerations, future developments, future directions, future outlook, future perspectives, future plans, future prospects, future research, future research directions, future studies, future work outlook materials (iao: ) materials data, data description statistical analysis (iao: ) statistical analysis statistical methods, statistical methods and analysis, statistics study limitations (iao: ) limitations, study limitations strengths and limitations, study strengths and limitations table b. newly identified synonyms for existing iao terms ( xx) from the digraph mapping of , publications. elements in italics have previously been submitted by us for inclusion into iao and added in the latest release (v - - ). hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / proposed category proposed definition proposed synonyms disclosure “a part of a document used to disclose any associations by authors that might be perceived as to potentially interfere with or prevent them from reporting research with complete objectivity.” author disclosure statement, declarations, disclosure, disclosure statement, disclosures graphical abstract “an abstract that is a pictorial summary of the main findings described in a document.” central illustration, graphical abstract, toc image, visual abstract highlights “a short collection of key messages that describe the core findings and essence of the article in concise form. it is distinct and separate from the abstract and only conveys the results and concept of a study. it is devoid of jargon, acronyms and abbreviations and targeted at a broader, non-technical audience.” author summary, editors’ summary, highlights, key points, overview, research in context, significance, toc participants “a section describing the recruitment of subjects into a research study. this section is distinct from the ‘patients’ section and mostly focusses on healthy volunteers.” participants, sample table . newly proposed categories of entities found in , publications in the biomedical literature that could not be mapped to existing terms in iao. elements in italics have previously been submitted by us for inclusion into iao and added in the latest release (v - - ). known phenotype papers accuracy cancer . gastrointestinal diseases . metabolic syndrome . neurodegenerative, psychiatric, brain illnesses . table . summary of results for named-entity recognition (ner) of phenotypes in mwas papers. others for specific tasks. for example, no new findings can be found in an introduction however it is well suited to discover the main phenotypes under study, only in materials/methods can details be found on how these phenotypes are studied and using what technologies, and findings can only be found in results (and discussion) sections. hence it is important to classify these paragraphs and auto-corpus does this by using the structure of the publication and the digraph. we showed that we can further improve the assignment by train- ing machine learning models with good accuracy to distin- guish between different types of texts in cases where there may be ambiguity - this can be further improved by using a multi-class classifier and using all paragraphs. these data are then available for use in downstream analyses using ded- icated algorithms for entity recognition or other methods. auto-corpus is able to process all html formatted tables from both gwas and mwas corpora, as opposed to pre- vious methods which could only operate on % of , tables ( ). it takes auto-corpus on average . seconds to process all tables within a publication compared to several minutes if this is done manually. moreover, auto-corpus also supports parallel computing, thereby further reducing the time needed to process publications as these can be run in batch. the structured json output is machine readable and can be used to support data import into database. here we used the json output of auto-corpus in several examples to demonstrate some potential use cases. we demonstrated that existing algorithms trained on biomedical data can be fine- tuned to recognize new entities such as assays and pheno- types, which also opens up the possibility of using these data to train new deep learning algorithms for recognizing new entities such as metabolites (opposed to chemical entities), snps and p-values, as well as identifying the relationships between them from text. ner algorithms have difficulty with recognizing terms that are abbreviated, therefore the list of abbreviations found by auto-corpus can be used to replace all abbreviations in the text to their definitions. conclusion the auto-corpus package is freely available and can be de- ployed on local machines as well as using high-performance computing to process publications in batch. a step-by-step guide to detail how to use auto-corpus is supplied with the package. the key features of auto-corpus are that it: . outputs all text and table data in a standardized json format, . classifies each paragraph into separate categories of text, and | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . example of json format for table data from this work (shown for table ). the auto-corpus output for tables consists of ‘status’, ‘error message’ and ‘tables’ as top level fields, ‘tables’ has fields ‘identifier’, ‘title’, ‘columns’, ‘section’ and ‘footer’, and ‘section’ contains ‘section name’ and ‘results’. fig. . example of json output of abbreviation detection using a rule-based ap- proach on an mwas publication ( ). fig. . example of json output of named-entity recognition (ner) on an mwas publication ( ) using a fine-tuned transformer-based deep learning model for as- says and bidirectional long-short term memory network for chemical entity recogni- tion. . is implemented in pure python code and does not have non-python dependencies. acknowledgements we thank mohamed ibrahim (university of leicester) for identifying different configu- rations of tables for different html formats, and joy li and filip makraduli (imperial college london) for testing the package and providing feedback. author contributions tb and jmp designed and supervised the research. ss and yh developed the pipeline and analyzed data. ss developed the initial table extraction algorithm and implemented the phenotype recognition algorithm. yh developed the section header standardization algorithm and implemented the abbreviation recognition al- gorithm. ss fine-tuned the table extraction algorithm for use on non-pmc texts. tr refined standardization of full texts and contributed algorithms for utf- and utf- conversions of non-ascii characters to unicode. ss, yh, tb and jmp wrote the manuscript. funding this work has been supported by health data research (hdr) uk and the medical research council via an ukri innovation fellowship to tb (mr/s / ) and a rutherford fund fellowship to jmp (mr/s / ). footnote orcid: - - - (jmp). bibliography . seyedmostafa sheikhalishahi, riccardo miotto, joel t dudley, alberto lavelli, fabio rinaldi, and venet osmani. natural language processing of clinical notes on chronic diseases: systematic review. jmir med inform, ( ):e , . issn - . doi: . / . . ramón a-a. erhardt, reinhard schneider, and christian blaschke. status of text-mining techniques applied to biomedical text. drug discovery today, ( ): – , . issn - . doi: https://doi.org/ . /j.drudis. . . . . nikola milosevic, cassie gregson, robert hernandez, and goran nenadic. a frame- work for information extraction from tables in biomedical literature. international jour- nal on document analysis and recognition (ijdar), ( ): – , . doi: . / s - - - . . peter m. visscher, naomi r. wray, qian zhang, pamela sklar, mark i. mccarthy, matthew a. brown, and jian yang. years of gwas discovery: biology, function, and translation. the american journal of human genetics, ( ): – , . issn - . doi: https://doi.org/ . /j.ajhg. . . . . tim beck, tom shorter, and anthony j brookes. gwas central: a comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide as- sociation studies. nucleic acids research, (d ):d –d , . issn - . doi: . /nar/gkz . . annalisa buniello, jacqueline a l macarthur, maria cerezo, laura w harris, james hay- hurst, cinzia malangone, aoife mcmahon, joannella morales, edward mountjoy, elliot sol- lis, daniel suveges, olga vrousgou, patricia l whetzel, ridwan amode, jose a guillen, harpreet s riat, stephen j trevanion, peggy hall, heather junkins, paul flicek, tony bur- dett, lucia a hindorff, fiona cunningham, and helen parkinson. the nhgri-ebi gwas hu and sun, et al. | auto-corpus biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gtr.ukri.org/projects?ref=mr/s / https://gtr.ukri.org/projects?ref=mr/s / https://orcid.org/ - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / catalog of published genome-wide association studies, targeted arrays and summary statis- tics . nucleic acids research, (d ):d –d , . issn - . doi: . /nar/gky . . jeremy k. nicholson, elaine holmes, and paul elliott. the metabolome-wide association study: a new look at human disease risk factors. journal of proteome research, ( ): – , . doi: . /pr . pmid: . . werner ceusters. an information artifact ontology perspective on data collections and asso- ciated representational artifacts. studies in health technology and informatics, : – , . issn - . . alan ruttenberg, adam goldstein, albert goldfain, barry smith, bjoern peters, carlo tor- niai, chris mungall, chris stoeckert, christian a. boelling, darren natale, david osumi- sutherland, gwen frishkoff, holger stenzhorn, james a. overton, james malone, jen- nifer fostel, jie zheng, jonathan rees, larisa soldatova, lawrence hunter, mathias brochhausen, matt brush, melanie courtot, michel dumontier, paolo ciccarese, pat hayes, philippe rocca-serra, randy dipert, ron rudnicki, satya sahoo, sivaram ara- bandi, werner ceusters, william duncan, william hogan, and yongqun (oliver) he. infor- mation artefact ontology (v - - ). https://raw.githubusercontent.com/ information-artifact-ontology/iao/v - - /iao.owl, . ac- cessed: - - . . a. ghazvinian, n. f. noy, and m. a. musen. creating mappings for ontologies in biomedicine: simple methods work. amia annu symp proc, : – , . . peter n. robinson, sebastian köhler, sebastian bauer, dominik seelow, denise horn, and stefan mundlos. the human phenotype ontology: a tool for annotating and analyzing hu- man hereditary disease. the american journal of human genetics, ( ): – , . issn - . doi: https://doi.org/ . /j.ajhg. . . . . ariel schwartz and marti hearst. a simple algorithm for identifying abbreviation definitions in biomedical text. pacific symposium on biocomputing. pacific symposium on biocomputing, : – , . doi: . / _ . . katrin fundel, robert küffner, and ralf zimmer. relex—relation extraction using de- pendency parse trees. bioinformatics, ( ): – , . issn - . doi: . /bioinformatics/btl . . jinhyuk lee, wonjin yoon, sungdong kim, donghyeon kim, sunkyu kim, chan ho so, and jaewoo kang. biobert: a pre-trained biomedical language representation model for biomedical text mining. bioinformatics, ( ): – , . issn - . doi: . /bioinformatics/btz . . peter corbett and john boyle. chemlistem: chemical named entity recognition using recurrent neural networks. journal of cheminformatics, ( ), . doi: . / s - - - . . charles r. evans, alla karnovsky, melissa a. kovach, theodore j. standiford, charles f. burant, and kathleen a. stringer. untargeted lc–ms metabolomics of bronchoalveolar lavage fluid differentiates acute respiratory distress syndrome from health. journal of pro- teome research, ( ): – , . doi: . /pr . . nikola milosevic, cassie gregson, robert hernandez, and goran nenadic. disentangling the structure of tables in scientific literature. in elisabeth métais, farid meziane, mohamad saraee, vijayan sugumaran, and sunil vadera, editors, natural language processing and information systems, pages – . springer international publishing, . isbn - - - - . doi: https://doi.org/ . / - - - - _ . | biorχiv hu and sun, et al. | auto-corpus .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://raw.githubusercontent.com/information-artifact-ontology/iao/v - - /iao.owl https://raw.githubusercontent.com/information-artifact-ontology/iao/v - - /iao.owl https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / apobec mediated c-to-u rna editing: target sequence and trans-acting factor contribution to rna editing events in murine transcripts in-vivo. saeed soleymanjahi , valerie blanc and nicholas o. davidson , division of gastroenterology, department of medicine, washington university school of medicine, st. louis, mo to whom communication should be addressed: email: nod@wustl.edu running title: apobec mediated c to u rna editing keywords: rna folding; a cf; rbm ; january , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract ( words) mammalian c-to-u rna editing was described more than years ago as a single nucleotide modification in apob rna in small intestine, later shown to be mediated by the rna-specific cytidine deaminase apobec . reports of other examples of c-to-u rna editing, coupled with the advent of genome-wide transcriptome sequencing, identified an expanded range of apobec targets. here we analyze the cis-acting regulatory components of verified murine c- to-u rna editing targets, including nearest neighbor as well as flanking sequence requirements and folding predictions. we summarize findings demonstrating the relative importance of trans- acting factors (a cf, rbm ) acting in concert with apobec . using this information, we developed a multivariable linear regression model to predict apobec dependent c-to-u rna editing efficiency, incorporating factors independently associated with editing frequencies based on sanger-confirmed editing sites, which accounted for % of the observed variance. co- factor dominance was associated with editing frequency, with rnas targeted by both rbm and a cf observed to be edited at a lower frequency than rbm dominant targets. the model also predicted a composite score for available human c-to-u rna targets, which again correlated with editing frequency. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction mammalian c-to-u rna editing was identified as the molecular basis for human intestinal apob production more than three decades ago (chen et al. ; hospattankar et al. ; powell et al. ). a site-specific enzymatic deamination of c to u of apob mrna was originally considered the sole example of mammalian c-to-u rna editing, occurring at a single nucleotide in a kilobase transcript and mediated by an rna specific cytidine deaminase (apobec ) (teng et al. ). with the advent of massively parallel rna sequencing technology we now appreciate that apobec mediated rna editing targets hundreds of sites (rosenberg et al. ; blanc et al. ) mostly within ’ untranslated regions of mrna transcripts. this expanded range of targets of c-to-u rna editing prompted us to reexamine key functional attributes in the regulatory motifs (both cis-acting elements and trans-acting factors) that impact editing frequency, focusing primarily on data emerging from studies of mouse cell and tissue-specific c-to-u rna editing. earlier studies identified rna motifs (davies et al. ) contained within a -nucleotide segment flanking the edited cytidine base in vivo (in cell lines) or within nucleotides using s extracts from rat hepatoma cells (bostrom et al. ; driscoll et al. ). those, and other studies, established that apob rna editing reflects both the tissue/cell of origin as well as rna elements remote and adjacent to the edited base (bostrom et al. ; davies et al. ). a granular examination of the regions flanking the edited base in apob rna demonstrated a critical ’ sequence - , downstream of c , in which mutations reduced or abolished editing activity (shah et al. ). this ’ site, termed a “mooring sequence” was associated with a s- “editosome” complex (smith et al. ), which was both necessary and sufficient for site-specific apob rna editing and editosome assembly (backus and smith ). other cis-acting elements include a nucleotide spacer region between the edited cytidine and the mooring sequence, and also sequences ’ of the editing site that regulate editing efficiency (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (backus and smith ; backus et al. ) along with au-rich regions both ’ and ’ of the edited cytidine that together function in concert with the mooring sequence (hersberger and innerarity ). advances in our understanding of physiological apob rna editing emerged in parallel from both the delineation of key rna regions (summarized above) and also with the identification of components of the apob rna editosome (sowden et al. ). apobec , the catalytic deaminase (teng et al. ) is necessary for physiological c-to-u rna editing in vivo (hirano et al. ) and in vitro (giannoni et al. ). using the mooring sequence of apob rna as bait, two groups identified apobec complementation factor (a cf), an rna-binding protein sufficient in vitro to support efficient editing in presence of apobec and apob mrna (lellek et al. ; mehta et al. ). those findings reinforced the importance of both the mooring sequence and an rna binding component of the editosome in promoting apob rna editing. however, while a cf and apobec are sufficient to support in vitro apob rna editing, neither heterozygous (blanc et al. ) or homozygous genetic deletion of a cf impaired apob rna editing in vivo in mouse tissues (snyder et al. ), suggesting that an alternate complementation factor was likely involved. other work identified a homologous rna binding protein, rbm , that functioned to promote apob rna editing both in vivo and in vitro (fossat et al. ), and more recent studies utilizing conditional, tissue-specific deletion of a cf and rbm indicate that both factors play distinctive roles in apobec -mediated c-to-u rna editing, including apob as well as a range of other apobec targets (blanc et al. ). these findings together establish important regulatory roles for both cis-acting elements and trans-acting factors in c-to-u mrna editing. however, the majority of studies delineating cis- acting elements reflect earlier, in vitro experiments using apob mrna and relatively little is known regarding the role of cis-acting elements in tissue-specific c-to-u rna editing of other transcripts, in vivo. here we use statistical modeling to investigate the independent roles of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . candidate regulatory factors in mouse c-to-u mrna editing using data from in vivo studies from over editing sites in transcripts (meier et al. ; rosenberg et al. ; gu et al. ; blanc et al. ; rayon-estrada et al. ; snyder et al. ; blanc et al. ; kanata et al. ). we also examined these regulatory factors in known human mrna targets (chen et al. ; powell et al. ; skuse et al. ; mukhopadhyay et al. ; grohmann et al. ; schaefermeier and heinze ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results descriptive data c-to-u rna editing sites were identified based on eight studies that met inclusion and exclusion criteria (meier et al. ; rosenberg et al. ; gu et al. ; blanc et al. ; rayon-estrada et al. ; snyder et al. ; blanc et al. ; kanata et al. ), representing distinct rna editing targets. % ( / ) of rna targets were edited at one chromosomal location (figure c) and % ( / ) of mrna targets were edited at both a single chromosomal location and also within a single tissue (figure d). the majority of editing sites occur in the ` untranslated region ( / ; %), with exonic editing sites the next most abundant subgroup ( / ; %, figure e). chromosome x harbors the highest number of editing sites ( / ; %), followed by chromosomes and ( / ; . % for both, supplemental figure ). / editing sites were confirmed by sanger sequencing, with a mean editing frequency of ± %. base content of sequences flanking edited and mutated cytidines au content was enriched (~ %) in nucleotides both immediately upstream and downstream of the edited cytidine across mouse rna editing targets (figure a and c). the average au content across the region nucleotides upstream to nucleotides downstream of the edited cytidine was ~ % ( - %). because apobec has been shown to be a dna mutator (harris et al. ; wolfe et al. ; wolfe et al. ), we determined the au content of the mutated deoxycytidine region flanking human dna targets (nik-zainal et al. ) to be ~ % at a site one nucleotide downstream of the edited base (figure b, c). the average au content in the sequence nucleotides upstream and nucleotides downstream of mutated deoxycytidines is % ( - . %). the average au content was % and % in nucleotides immediately upstream and downstream, respectively, of the targeted deoxycytidine in a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . subgroup of over dna editing events of the c to t type (nik-zainal et al. ), which is closer to the distribution found in c to u rna editing targets. these features suggest that au enrichment is an important component to editing function of apobec on both rna and dna targets, especially for the c/dc to u/dt change. factors influencing editing frequency regulatory-spacer-mooring cassette: we observed no significant associations between editing frequency and mismatches in motif a (r=- . , p=. ) or motif b (r=- . , p=. ) (supplemental figure ), while mismatches in motif c and d negatively impacted editing frequency (r=- . , p=. ) (motif d r=- . , p=. , figure b). au content of motif b showed a trend towards negative association with editing frequency (r=- . , p=. figure c), but au contents of motifs a (r= . , p=. ), c (r=- . , p=. ), and d (r=- . , p=. ) did not impact editing frequency (supplemental figure ). the abundance of g in motif c (r= . , p=. ), abundance of c in motif b (r= . , p=. ), and g/c fraction in motif c (r= . , p=. ) showed either significance or a trend to associations with editing frequency. the spacer sequence averaged ± nucleotides, ranging from to , with trend of association between length and editing frequency (r=- . , p=. ). the mean spacer sequence au content was ± %, with no association between editing frequency and au content (r=- . , p=. , supplemental figure ). however, g abundance (r=- . , p=. ) and g/c fraction (r=- . , p=. ) of spacer showed significant associations with editing frequency in sanger-confirmed targets. the mean number of mismatches in the first nucleotides of the spacer sequence was . ± with higher number of mismatches exerting a significant negative impact on editing frequency (r=- . , p=. ) (figure d). the mean number of mismatches in the mooring sequence was . ± . , ranging from to nucleotides. the number of mismatches showed a significant negative association with editing frequency (r=- . , p=. , figure e). the base content of individual nucleotides surrounding the edited cytidine showed significant associations with editing frequency, which (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . was more emphasized in nucleotides closer to the edited cytidine (figure f, supplemental table ). furthermore, overall au content of downstream sequence + to + had positive impact on editing frequency (r= . , p=. ) (supplemental figure ). however, g abundance in downstream nucleotides (r=- . , p=. ) and g/c fraction in downstream nucleotides (r=- . , p=. ) showed significant or a trend of significant negative associations with editing frequency in sanger-confirmed targets. secondary structure: we generated a predicted secondary structure for editing sites, with four subgroups based on overall structure and location of the edited cytidine: loop (cloop), stem (cstem), tail (ctail), and non-canonical structure (nc). the majority of editing sites were in the cloop subgroup ( %), followed by cstem ( %), ctail ( %), and nc ( %) subgroups (figure a). editing sites in the ctail subgroup exhibited lower editing frequencies compared to editing sites in cloop ( ± vs ± %, p=. ) or cstem ( ± %, p=. ) subgroups. no significant differences were detected in other comparisons (figure b). the edited cytidine was located in loop, stem, and tail of the secondary structure in ( %), ( %), and ( %) of the edited rnas, respectively. editing sites with the edited cytidine within the loop exhibited significantly higher editing frequency compared to those with the edited cytidine in the tail ( ± % vs ± %, p=. ). other subgroups exhibited comparable editing frequencies (supplemental figure ). the majority ( %) of editing sites contained a mooring sequence located in main stem-loop structure (figure c), with the remainder located in the tail or secondary loop. average editing efficiency was significantly higher in targets where the mooring sequence was located in the main stem-loop (figure d). we also calculated the proportion of total nucleotides that constitute the main stem-loop in the secondary structure. the average ratio was . ± . ranging from . to (supplemental table ) with higher ratios associated with higher editing frequency of the corresponding editing site (r= . , p=. ) (figure e). finally, we considered the orientation of free tails in the secondary structure in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . terms of length and symmetry. symmetric free tails were observed in % of editing sites (supplemental figure ). the length of ’ free tail showed negative association with editing frequency (r=- . , p=. , figure f) while no significant associations were detected between either the length of ’ tail or symmetry of tails and editing frequency (supplemental figure ). trans-acting factors and tissue specificity: data for relative dominance of cofactors in apobec - dependent rna editing were available for editing sites for targets in small intestine or liver (blanc et al. ). rbm was identified as the dominant factor in / ( %) sites; a cf was the dominant factor in / ( %) editing sites with the remaining sites ( / ; %), exhibiting equal codominancy (figure a). the average editing frequencies at editing sites revealed differences across the groups with ± % in rbm -dominant targets, ± % in a cf-dominant, and ± % in the co-dominant group (p=. ) (figure b). the majority of rna editing targets were edited in one tissue ( / ; % figure c), while the maximum number of tissues in which an editing target is edited (at the same site) is (cd ). the small intestine harbors the highest number of verified editing sites ( / ; %), followed by liver ( / ; %), and adipose tissue ( / ; % figure d). sites edited in brain tissue showed the highest average editing frequency ( ± %, n= ), followed by bone marrow myeloid cells ( ± %, n= ), and kidney ( ± %, n= figure e). we then developed a multivariable linear regression model to predict apobec dependent c- to-u rna editing efficiency, incorporating factors independently associated with editing frequencies (table ). this model, based on sanger-confirmed editing sites with available data for all of the parameters mentioned, accounted for % of variance in editing frequency of editing sites included (r = . , p<. table ). the final multivariable model revealed several factors independently associated with editing frequency, specifically the number of mismatches in mooring sequence; regulatory sequence motif d; au content of regulatory sequence motif b; overall secondary structure for group ctail vs group cloop; location of mooring sequence in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . secondary structure; “base content score” parameter that represents base content of the sequences flanking edited cytidine (table ). removing “base content score” from the model reduced the power from r = . to r = . . next, we added a co-factor dominance variable and fit the model using the editing sites with available data for cofactor dominance. along with other factors mentioned above, co-factor dominance showed significant association with editing frequency (table ) with rnas targeted by both rbm and a cf observed to be edited at a lower frequency than rbm dominant targets. factors associated with co-factor dominance (figure , supplemental table , supplemental figure ), included tissue-specificity, with higher frequency of rbm -dominant sites in small intestine compared to liver ( vs %, p=. ) and a cf-dominant and co-dominant editing sites more prevalent in liver. the number of mooring sequence mismatches also varied among three subgroups: . ± . in rbm -dominant subgroup; . ± . in a cf-dominant subgroup; and . ± . in co-dominant subgroup (p=. ). this was also the case regarding mismatches in the spacer: . ± . in rbm -dominant subgroup; . ± . in a cf-dominat subgroup; . ± . in co-dominant subgroup (p=. ). au content (%) of downstream sequence + to + was higher in rbm -dominant subgroup (p=. ). finally, the location of the edited cytidine in secondary structure of mrna strand was different across three subgroups (p=. , figure ). we used pairwise multinomial logistic regression to determine factors independently associated with co-factor dominance (figure c, supplemental table ). ctail editing sites, those with more mismatches in mooring and regulatory motif c, lower au content in downstream sequence, and higher au content in regulatory motif d were more likely co-dominant. editing sites from small intestine and those with higher au content of downstream sequence were more likely rbm -dominant. editing sites from liver and those with higher mismatches in regulatory motif b were more likely a cf-dominant (figure c). human mrna targets (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . finally, we turned to an analysis of human c-to-u rna editing targets for which this same panel of parameters was available (table ). aside from apob rna, which is known to be edited in the small intestine (chen et al. ; powell et al. ), other targets have been identified in central or peripheral nervous tissue (skuse et al. ; mukhopadhyay et al. ; meier et al. ; schaefermeier and heinze ). the human targets were categorized into low editing (nf , glyrα , glyrα ) and high editing (apob, tph b exon , tph b exon ) subgroups using % as cut-off. a composite score (maximum= ) was generated based on six parameters introduced in the mouse model with notable variance between the two subgroups including mismatches in mooring sequence, spacer length, location of the edited cytidine, and relative abundance of stem-loop bases (table ). high editing targets exhibited a significantly higher composite score ( . vs , p=. ) compared to low editing targets and the composite score significantly correlated with editing frequency in individual targets (r= . , p=. ). the canonical editing target apob (chen et al. ; powell et al. ) achieved a score of (out of ), reflecting the observation that one of the six parameters (au% of regulatory motifs) in human apob is non-preferential compared to the editing-promoting features identified in the mouse multivariable model. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion the current study reflects our analysis of c-to-u rna editing sites from target mrnas, with the majority residing within the ’ untranslated region. our multivariable model identified several key factors influencing editing frequency, including host tissue, base content of nucleotides surrounding the edited cytidine, number of mismatches in regulatory and mooring sequences, au content of the regulatory sequence, overall secondary structure, location of the mooring sequence, and co-factor dominance. these factors, each exerting independent effects, together accounted for % of the variance in editing frequency. our findings also showed that mismatches in the mooring and regulatory sequences, au content of regulatory and downstream sequences, host tissue and secondary structure of target mrna were associated with the pattern of co-factor dominance. several aspects of these primary conclusions merit further discussion. previous studies investigating the key factors that regulate c-to-u mrna editing were confined to in vitro studies and predicated on a single mrna target (apob) (backus and smith ; shah et al. ; smith et al. ; backus and smith ; hersberger and innerarity ). with the expanded range of verified c-to-u rna editing targets now available for interrogation, we revisited the original assumptions to understand more globally the determinants of c-to-u mrna editing efficiency. in undertaking this analysis, we were reminded that the requirements for c-to-u mrna editing in vitro often appear more stringent than in vivo (backus and smith ; shah et al. ), which further emphasizes the importance of our findings. in addition, our approach included both cis-acting sequence- and folding-related predictions along with the role of trans-acting factors and took advantage of statistical modeling to adjust for confounding or modifier effects between these factors to identify their role in editing frequency. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we began with the assumptions established for apob rna editing which identified a nucleotide segment encompassing the edited base, spacer, mooring sequence, and part of regulatory sequence as the minimal sequence competent for physiological editing in vitro and in vivo (davies et al. ; shah et al. ; backus and smith ). those studies identified an -nucleotide mooring sequence as essential and sufficient for editosome assembly and site- specific c-to-u editing (backus and smith ; shah et al. ; backus and smith ) and established optimal positioning of the mooring sequence relative to the edited base in apob rna (backus and smith ). the current work supports the key conclusions of this original mooring sequence model as applied to the entire range of c-to-u rna editing targets. we observed that mismatches in either the mooring or regulatory sequences were independent factors governing editing frequency. by contrast, while mismatches in the spacer sequence also showed negative association with editing frequency, the impact of spacer mismatches were not retained in the final model, nor was the length of the spacer associated with editing frequency. furthermore, we found mismatches in the regulatory sequence motif c to be more important than mismatches in motif b. these inconsistencies might conceivably reflect the context in which an rna segment is studied (backus and smith ). for example, our analysis reflects physiological conditions in which naturally occurring mrna targets are edited, while the aforementioned study used in vitro data based on varying lengths of apob mrna embedded within different mrna contexts (apoe rna) (backus and smith ). in addition to the components of mooring sequence model, we examined variations in the base content in different segments/motifs as well as among individual nucleotides surrounding the edited cytidine. as expected, we found that sequences flanking the edited cytidine exhibited high au content. we further observed a similarly high au content in the flanking sequences of a range of proposed apobec-mediated dna mutation targets in human cancer tissues and cell lines (alexandrov et al. ; petljak et al. ), especially in targets with dc/dt change (nik- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . zainal et al. ). this observation implies that apobec-mediated dna and rna editing frequency may each be functionally modified by au enrichment in the flanking sequences surrounding modifiable bases. the base content in individual nucleotides surrounding the edited cytidine also exerted significant impact on editing frequency, particularly in a - nucleotide segment spanning the edited cytidine (supplemental table ), accounting for % of the variance in editing frequency independent of the mooring sequence model. our findings regarding individual nucleotides surrounding the edited cytidine are consistent with findings for both dna and rna editing targets, particularly in the setting of cancers (backus and smith ; conticello ; roberts et al. ; saraconi et al. ; gao et al. ; arbab et al. ). recent work examining the sequence-editing relationship of a large in vitro library of dna targets edited by different synthetic cytidine base editor (cbe)s (arbab et al. ) showed that the base content of a -nucleotide window spanning the edited cytidine explained - % of the editing variance, in particular one or two nucleotides immediately ’ of the edited nucleotide. that study also demonstrated that occurrence of t and c nucleotides at the position - increased, while a g nucleotide at that position decreased editing frequency (arbab et al. ). however, in contrast to our findings, the presence of a at position - had either a negative or null effect on dna editing activity (arbab et al. ). this latter finding is consistent with the lower au content observed in nucleotides adjacent to the edited cytidine in apobec- dna targets compared to the au content in rna targets. our findings assign a greater importance of adjacent nucleotides in rna editing frequency, similar to earlier reports that the five bases immediately ’ of the edited cytidine in apob mrna exert a greater impact on editing activity compared to nucleotides further upstream of this segment (backus and smith ; shah et al. ; backus and smith ). g/c fraction of a -nucleotide window spanning the edited cytidine in dna targets is associated with editing activity of the synthetic cbes (arbab et al. ). although we found significant associations of rna editing with g/c fraction in segments surrounding the edited cytidine in univariate analyses, these associations (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . were not retained in the final model. in contrast, the au content of regulatory sequence motif b remained as an independent factor determining editing frequency in the final model. the conserved -nucleotide sequence around the edited c forms a stem-loop secondary structure, where the editing site is in an octa-loop (richardson et al. ) as predicted for the -nucleotide sequence of apob mrna (shah et al. ). this stem-loop structure is predicted to play an important role in recognition of the editing site by the editing factors (bostrom et al. ; davies et al. ; driscoll et al. ; chen et al. ). mutations resulting in loss of base pairing in peripheral parts of the stem did not impact the editing frequency (shah et al. ). editing sites with the cytidine located in central parts (e.g. loop) exhibited higher editing frequencies than those with the edited cytidine located in peripheral parts (e.g. tail) and it is worth noting that the computer-based stem-loop structure was independently confirmed by nmr studies of a -nucleotide human apob mrna (maris et al. ). those studies demonstrated that the location of the mooring sequence in the apob mrna secondary structure plays a critical role in the rna recognition by a cf (maris et al. ). in line with those findings, the current findings emphasize that the location of the mooring sequence in secondary structure of the target mrna exerts significant independent impact on editing frequency. these predictions were confirmed in crystal structure studies of the carboxyl-terminal domain of apobec- and its interaction with cofactors and substrate rna (wolfe et al. ). our conclusions regarding murine c-to-u editing frequency, such as mooring sequence, base content, and secondary structure appear consistent with a similar regulatory role among the smaller number of verified human targets. that being said, further study and expanded understanding of the range of c-to-u editing targets in human tissues will be needed as recently suggested (destefanis et al. ), analogous to that for a-to-i editing (bahn et al. ; bazak et al. ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we recognize that other factors likely contribute to the variance in rna editing frequency not covered by our model. we did not consider the role of naturally occurring variants in apobec , for example, which may be a relevant consideration since mutations in apobec family genes were shown to modify the editing activity of related hybrid dna cytosine base editors (arbab et al. ). furthermore, genetic variants of apobec in humans were associated with altered frequency of glyr editing (kankowski et al. ). other factors not included in our approach included entropy-related features, tertiary structure of the mrna target and other regulatory co-factors. another limitation in the tissue-specific designation used to categorize editing frequency is that cell specific features of editing frequency may have been overlooked. for example, small intestinal and liver preparations are likely a blend of cell types (macparland et al. ; elmentaite et al. ) and tumor tissues are highly heterogeneous in cellular composition (barker et al. ). the current findings provide a platform for future approaches to resolve these questions. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . materials and methods search strategy a comprehensive literature review from (when apob rna editing was first reported (chen et al. ; powell et al. )) to november , using studies published in english reporting c-to-u mrna editing frequencies of individual or transcriptome-wide target genes. databases searched included medline, scopus, web of science, google scholar, and proquest (for thesis). the references of full texts retrieved were also scrutinized for additional papers not indexed in the initial search. study selection primary records (n= ) were screened for relevance and in vivo studies reporting editing frequencies of individual or transcriptome-wide apobec -dependent c-to-u mrna targets selected, using a threshold of % editing frequency. for analyses based on rna sequence information, only targets with available sequence information or chromosomal location for the edited cytidine were included. exclusion criteria included: studies that reported c-to-u mrna editing frequencies of target genes in other species, studies reporting editing frequencies of target genes in animal models overexpressing apobec , exclusively in vitro studies, and conference abstracts. human targets we included studies reporting human c-to-u mrna targets (chen et al. ; powell et al. ; skuse et al. ; mukhopadhyay et al. ; grohmann et al. ; schaefermeier and heinze ). we also included work describing apobec -mediated mutagenesis in human breast cancer (nik-zainal et al. ). data extraction (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . two reviewers (ss and vb) conducted the extraction process independently and discrepancies were addressed upon consensus and input from a third reviewer (nod). the parameters were categorized as follows: general parameters: gene name (rna target), chromosomal and strand location of the edited cytidine, tissue site, editing frequency determined by rna-seq or sanger sequencing as illustrated for apob (figure a). editing frequency was highly correlated by both approaches (r= . p< . ), and where both methodologies were available we used rna- seq. we also defined relative dominance of editing co-factors (a cf-dominant, rbm - dominant, or co-dominant), relative mrna expression (edited gene vs unedited gene) by rna- seq or quantitative rt-pcr, and abundance of corresponding protein (edited gene vs unedited gene) by western blotting or proteomic comparison. co-factor dominancy was determined based on the relative contribution of each co-factor to editing frequency. in each editing site, editing frequencies in mouse tissues deficient in a cf or rbm were compared to that of wild- type mice. the relative contribution of each co-factor was calculated by subtracting the editing frequency for each target in a cf or rbm knockout tissue from the total editing frequency in wild-type control. editing sites with < % difference between contributions of rbm and a cf were considered co-dominant. sites with ≥ % difference were considered either rbm - or a cf-dominant, depending on the co-factor with higher contribution (blanc et al. ). sequence-related parameters: a sequence spanning nucleotides upstream and nucleotides downstream of the edited cytidine was extracted for each c-to-u mrna editing site. these sequences were extracted either directly from the full-text or using online ucsc genome browser on mouse (ncbi /mm ) and human (grch /hg ) (https://genome.ucsc.edu/cgi- bin/hggateway) . using the mooring sequence model (backus and smith ), three cis-acting elements were considered for each site. these elements included ) a -nucleotide segment immediately upstream of the edited cytidine as “regulatory sequence”; ) a -nucleotide segment downstream of the edited cytidine with complete or partial consensus with the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . canonical “mooring sequence” of apob mrna; ) the sequence between the edited cytidine and the ’ end of the mooring sequence, referred to as “spacer”. we used an unbiased approach to identify potential mooring sequences by taking the nearest segment to the edited cytidine with lowest number of mismatch(es) compared to the canonical mooring sequence of apob rna. for each of the three segments, we investigated the number of mismatches compared to the corresponding segment of apob gene (blanc et al. ), as well as length of spacer, the abundance of a and u nucleotides (au content) and the g to c abundance ratio (g/c fraction (arbab et al. )). we also calculated relative abundance of a, g, c, and u individually across a region nucleotides upstream and nucleotides downstream of the edited cytidine across all editing sites. for comparison, we examined the base content of a sequence spanning nucleotides upstream and downstream of mutated deoxycytidine for over proposed c to x (t, a, and g) dna mutation targets of apobec family in human breast cancer (nik-zainal et al. ) along with relative deoxynucleotide distribution in proximity to the edited site. secondary structure parameters: we used rna-structure (reuter and mathews ) and mfold (zuker ) to determine the secondary structure of an rna cassette consisting of regulatory sequence, edited cytidine, spacer, and mooring sequence. secondary structures similar to that of the cassette for apob chr : consisting of one loop and stem (with or without unassigned nucleotides with ≤ unpaired bases inside the stem) as the main stem-loop with or without free tail(s) in one or both ends of the stem were considered as canonical. two other types of secondary structure were considered as non-canonical structures (figure b), with ≥ loops located either at ends of the stem or inside the stem. loops inside the stem were circular open structures with ≥ unpaired bases. editing sites with canonical structure were further categorized into three subgroups based on location of the edited cytidine: specifically (cloop), stem (cstem), or tail (ctail). in addition to overall secondary structure, we considered (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . location of the edited cytidine, location of mooring sequence, symmetry of the free tails, and proportion of the nucleotides in the target cassette that constitute the main stem-loop. this proportion is . in the case of apob chr : where all the bases are part of the main stem-loop structure. symmetry was defined based on existence of free tails in both ends of the rna strand. statistical methodology continuous variables are reported as means ± sd with relative proportions for binary and categorical variables. t-test and anova tests were used to compare continuous parameters of interest between two or more than two groups, respectively. chi-squared testing was used to compare binary or categorical variables among different groups. pearson r testing was used to investigate correlation of two continuous variables. we used linear regression analyses to develop the final model of independent factors that correlate with editing frequency. we used the hosmer and lemeshow approach for model building (hosmer jr et al. ) to fit the multivariable regression model. in brief, we first used bivariate and/or simple regression analyses with p value of . as the cut-off point to screen the variables and detect primary candidates for the multivariable model. subsequently, we fitted the primary multivariable model using candidate variables from the screening phase. a backward elimination method was employed to reach the final multivariable model. parameters with p values < . or those that added to the model fitness were retained. next, the eliminated parameters were added back individually to the final model to determine their impact. plausible interaction terms between final determinants were also checked. the final model was screened for collinearity. we used the same approach to develop a multinomial logistic regression model to identify factors that were independently associated with co-factor dominance in rna editing sites. squared r and pseudo squared r were used to estimate the proportion of variance in responder parameter that could be explained by multivariable linear regression and multinomial logistic regression models, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . respectively. the same screening and retaining methods were used to investigate association of base content in a sequence nucleotides upstream and nucleotides downstream of the edited cytidine, with editing frequency. however, after determining the nucleotides that were retained in final regression model, a proxy parameter named “base content score” was calculated for each editing site based on the β coefficient values retrieved for individual nucleotides in the model. this parameter was used in the final model as representative variable for base content of the aforementioned sequence in each editing site. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgments this work was supported by grants from the national institutes of health grants dk- , dk- , washington university digestive diseases research core center p dk- (to nod) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references ucsc genome browser on mouse (ncbi /mm ; ) and human (grch /hg ; ) assemblies. alexandrov lb, nik-zainal s, wedge dc, aparicio sa, behjati s, biankin av, bignell gr, bolli n, borg a, borresen-dale al et al. . signatures of mutational processes in human cancer. nature : - . arbab m, shen mw, mok b, wilson c, matuszek z, cassa ca, liu dr. . determinants of base editing outcomes from target library analysis and machine learning. cell : - e . backus jw, schock d, smith hc. . only cytidines ' of the apolipoprotein b mrna mooring sequence are edited. biochim biophys acta : - . backus jw, smith hc. . apolipoprotein b mrna sequences ' of the editing site are necessary and sufficient for editing and editosome assembly. nucleic acids res : - . -. . three distinct rna sequence elements are required for efficient apolipoprotein b (apob) rna editing in vitro. nucleic acids res : - . bahn jh, lee jh, li g, greer c, peng g, xiao x. . accurate identification of a-to-i rna editing in human by transcriptome sequencing. genome res : - . barker n, ridgway ra, van es jh, van de wetering m, begthel h, van den born m, danenberg e, clarke ar, sansom oj, clevers h. . crypt stem cells as the cells-of-origin of intestinal cancer. nature : - . bazak l, haviv a, barak m, jacob-hirsch j, deng p, zhang r, isaacs fj, rechavi g, li jb, eisenberg e et al. . a-to-i rna editing occurs at over a hundred million genomic sites, located in a majority of human genes. genome res : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . blanc v, henderson jo, newberry ep, kennedy s, luo j, davidson no. . targeted deletion of the murine apobec- complementation factor (acf) gene results in embryonic lethality. molecular and cellular biology : - . blanc v, park e, schaefer s, miller m, lin y, kennedy s, billing am, ben hamidane h, graumann j, mortazavi a et al. . genome-wide identification and functional analysis of apobec- -mediated c-to-u rna editing in mouse small intestine and liver. genome biol : r . blanc v, xie y, kennedy s, riordan jd, rubin dc, madison bb, mills jc, nadeau jh, davidson no. . apobec complementation factor (a cf) and rbm interact in tissue-specific regulation of c to u rna editing in mouse intestine and liver. rna : - . bostrom k, lauer sj, poksay ks, garcia z, taylor jm, innerarity tl. . apolipoprotein b rna editing in chimeric apolipoprotein eb mrna. j biol chem : - . chen sh, habib g, yang cy, gu zw, lee br, weng sa, silberman sr, cai sj, deslypere jp, rosseneu m et al. . apolipoprotein b- is the product of a messenger rna with an organ-specific in-frame stop codon. science : - . chen sh, li xx, liao ws, wu jh, chan l. . rna editing of apolipoprotein b mrna. sequence specificity determined by in vitro coupled transcription editing. j biol chem : - . conticello sg. . creative deaminases, self-inflicted damage, and genome evolution. annals of the new york academy of sciences : - . davies ms, wallis sc, driscoll dm, wynne jk, williams gw, powell lm, scott j. . sequence requirements for apolipoprotein b rna editing in transfected rat hepatoma cells. j biol chem : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . destefanis e, avsar g, groza p, romitelli a, torrini s, pir p, conticello sg, aguilo f, dassi e. . a mark of disease: how mrna modifications shape genetic and acquired pathologies. rna. driscoll dm, wynne jk, wallis sc, scott j. . an in vitro system for the editing of apolipoprotein b mrna. cell : - . elmentaite r, ross adb, roberts k, james kr, ortmann d, gomes t, nayak k, tuck l, pritchard s, bayraktar oa et al. . single-cell sequencing of developing human gut reveals transcriptional links to childhood crohn's disease. dev cell. fossat n, tourle k, radziewic t, barratt k, liebhold d, studdert jb, power m, jones v, loebel da, tam pp. . c to u rna editing mediated by apobec requires rna-binding protein rbm . embo rep : - . gao j, choudhry h, cao w. . apolipoprotein b mrna editing enzyme catalytic polypeptide-like family genes activation and regulation during tumorigenesis. cancer science : - . giannoni f, bonen dk, funahashi t, hadjiagapiou c, burant cf, davidson no. . complementation of apolipoprotein b mrna editing by human liver accompanied by secretion of apolipoprotein b . j biol chem : - . grohmann m, hammer p, walther m, paulmann n, buttner a, eisenmenger w, baghai tc, schule c, rupprecht r, bader m et al. . alternative splicing and extensive rna editing of human tph transcripts. plos one : e . gu t, buaas fw, simons ak, ackert-bicknell cl, braun re, hibbs ma. . canonical a-to-i and c-to-u rna editing is enriched at 'utrs and microrna target sites in multiple mouse tissues. plos one : e . harris rs, bishop kn, sheehy am, craig hm, petersen-mahrt sk, watt in, neuberger ms, malim mh. . dna deamination mediates innate immunity to retroviral infection. cell : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . hersberger m, innerarity tl. . two efficiency elements flanking the editing site of cytidine in the apolipoprotein b mrna support mooring-dependent editing. j biol chem : - . hirano k, young sg, farese rv, jr., ng j, sande e, warburton c, powell-braxton lm, davidson no. . targeted disruption of the mouse apobec- gene abolishes apolipoprotein b mrna editing and eliminates apolipoprotein b . j biol chem : - . hosmer jr dw, lemeshow s, sturdivant rx. . applied logistic regression. john wiley & sons. hospattankar av, higuchi k, law sw, meglin n, brewer hb, jr. . identification of a novel in-frame translational stop codon in human intestine apob mrna. biochem biophys res commun : - . kanata e, llorens f, dafou d, dimitriadis a, thune k, xanthopoulos k, bekas n, espinosa jc, schmitz m, marin-moreno a et al. . rna editing alterations define manifestation of prion diseases. proc natl acad sci u s a : - . kankowski s, forstera b, winkelmann a, knauff p, wanker ee, you xa, semtner m, hetsch f, meier jc. . a novel rna editing sensor tool and a specific agonist determine neuronal protein expression of rna-edited glycine receptors and identify a genomic apobec dimorphism as a new genetic risk factor of epilepsy. front mol neurosci : . lellek h, kirsten r, diehl i, apostel f, buck f, greeve j. . purification and molecular cloning of a novel essential component of the apolipoprotein b mrna editing enzyme- complex. j biol chem : - . macparland sa, liu jc, ma xz, innes bt, bartczak am, gage bk, manuel j, khuu n, echeverri j, linares i et al. . single cell rna sequencing of human liver reveals distinct intrahepatic macrophage populations. nat commun : . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . maris c, masse j, chester a, navaratnam n, allain fh. . nmr structure of the apob mrna stem-loop and its interaction with the c to u editing apobec complementary factor. rna : - . mehta a, kinter mt, sherman ne, driscoll dm. . molecular cloning of apobec- complementation factor, a novel rna-binding protein involved in the editing of apolipoprotein b mrna. mol cell biol : - . meier jc, henneberger c, melnick i, racca c, harvey rj, heinemann u, schmieden v, grantyn r. . rna editing produces glycine receptor alpha (p l), resulting in high agonist potency. nat neurosci : - . mukhopadhyay d, anant s, lee rm, kennedy s, viskochil d, davidson no. . c-->u editing of neurofibromatosis mrna occurs in tumors that express both the type ii transcript and apobec- , the catalytic subunit of the apolipoprotein b mrna-editing enzyme. am j hum genet : - . nik-zainal s, alexandrov lb, wedge dc, van loo p, greenman cd, raine k, jones d, hinton j, marshall j, stebbings la et al. . mutational processes molding the genomes of breast cancers. cell : - . petljak m, alexandrov lb, brammeld js, price s, wedge dc, grossmann s, dawson kj, ju ys, iorio f, tubio jmc et al. . characterizing mutational signatures in human cancer cell lines reveals episodic apobec mutagenesis. cell : - e . powell lm, wallis sc, pease rj, edwards yh, knott tj, scott j. . a novel form of tissue- specific rna processing produces apolipoprotein-b in intestine. cell : - . rayon-estrada v, harjanto d, hamilton ce, berchiche ya, gantman ec, sakmar tp, bulloch k, gagnidze k, harroch s, mcewen bs et al. . epitranscriptomic profiling across cell types reveals associations between apobec -mediated rna editing, gene expression outcomes, and cellular function. proc natl acad sci u s a : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . reuter js, mathews dh. . rnastructure: software for rna secondary structure prediction and analysis. bmc bioinformatics : . richardson n, navaratnam n, scott j. . secondary structure for the apolipoprotein b mrna editing site. au-binding proteins interact with a stem loop. j biol chem : - . roberts sa, lawrence ms, klimczak lj, grimm sa, fargo d, stojanov p, kiezun a, kryukov gv, carter sl, saksena g et al. . an apobec cytidine deaminase mutagenesis pattern is widespread in human cancers. nat genet : - . rosenberg br, hamilton ce, mwangi mm, dewell s, papavasiliou fn. . transcriptome- wide sequencing reveals numerous apobec mrna-editing targets in transcript ' utrs. nat struct mol biol : - . saraconi g, severi f, sala c, mattiuz g, conticello sg. . the rna editing enzyme apobec induces somatic mutations and a compatible mutational signature is present in esophageal adenocarcinomas. genome biol : . schaefermeier p, heinze s. . hippocampal characteristics and invariant sequence elements distribution of glra and glra c-to-u editing. mol syndromol : - . shah rr, knott tj, legros je, navaratnam n, greeve jc, scott j. . sequence requirements for the editing of apolipoprotein b mrna. j biol chem : - . skuse gr, cappione aj, sowden m, metheny lj, smith hc. . the neurofibromatosis type i messenger rna undergoes base-modification rna editing. nucleic acids res : - . smith hc, kuo sr, backus jw, harris sg, sparks ce, sparks jd. . in vitro apolipoprotein b mrna editing: identification of a s editing complex. proc natl acad sci u s a : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . snyder em, mccarty c, mehalow a, svenson kl, murray sa, korstanje r, braun re. . apobec complementation factor (a cf) is dispensable for c-to-u rna editing in vivo. rna : - . sowden m, hamm jk, spinelli s, smith hc. . determinants involved in regulating the proportion of edited apolipoprotein b rnas. rna : - . teng b, burant cf, davidson no. . molecular cloning of an apolipoprotein b messenger rna editing protein. science : - . wolfe ad, arnold db, chen xs. . comparison of rna editing activity of apobec -a cf and apobec -rbm complexes reconstituted in hek t cells. j mol biol : - . wolfe ad, li s, goedderz c, chen xs. . the structure of apobec and insights into its rna and dna substrate selectivity. nar cancer : zcaa . zuker m. . mfold web server for nucleic acid folding and hybridization prediction. nucleic acids res : - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table . multivariable linear regression model for determinant factors of editing frequency in mouse apobec -dependent c-to-u mrna editing sites. determinant of editing frequency subgroup ß ( % ci) p value model without co-factor group n= ; r = . ; p<. base content score per unit increments . [ . , . ] < . count of mismatches in mooring sequence per unit increments - . [- . , - . ] <. count of mismatches in regulatory sequence motif d (whole sequence) per unit increments - . [- . , - . ] . au content of regulatory sequence motif b per % increments - . [- . , - . ] . overall secondary structure c loop reference c stem . [- . , . ] . c tail - . [- . , - . ] . non-canonical - . [- . , - . ] . location of mooring sequence stem-loop reference other - . [- . , - . ] <. after adding co-factor group to the model n= ; r = . ; p<. co-factor group rbm dominant reference co-dominant - . [- . , - . ] . a cf dominant . [- . , . ] . ß: represents average change (%) in the editing frequency compared to the reference group ci: confidence interval (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table : characteristics of human c-to-u mrna editing targets parameter low editing high editing nf glycra glycra tph b tph b apob editing location c c c c (exon ) c (exon ) c tissue neural sheath / cns tumor hippocampus hippocampus amygdala amygdala small intestine editing frequency %) > mismatches in regulatory motif a mismatches in regulatory motif b mismatches in regulatory motif c mismatches in regulatory motif d au content (%) in regulatory motif a au content (%) in regulatory motif b au content (%) in regulatory motif c* au content (%) in regulatory motif d spacer length* spacer au content (%) mismatches in spacer mismatches in mooring* au content (%) of downstream bases* au content (%) of downstream bases overall secondary structure canonical canonical canonical canonical canonical canonical location of edited c* loop tail tail stem loop loop location of mooring sequence stem-loop stem-loop stem-loop stem-loop stem-loop stem-loop ratio of stem-loop bases* . . . . . . free tail orientation symmetric symmetric asymmetric symmetric asymmetric asymmetric composite score cns: central nervous system * these items were used to calculate the composite score (total score = ) as follows: au content (%) in regulatory motif c: < %: , ≥ %: spacer length: ≤ : , > : mismatches in mooring: < : , ≥ : au content (%) of downstream bases: > %: , ≤ %: location of edited c in secondary structure: stem-loop: , tail: ratio of stem-loop bases: > %: , ≤ %: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure legends figure . characteristics of murine apobec -mediated c-to-u mrna editing sites. a: schematic presentation of mrna target, chromosomal editing location, and editing sites considered. each mrna target could be edited at one or more chromosomal location(s) (blue boxes). each editing location could be edited in one or more tissues giving rise to one or more editing site(s) per location (green boxes). editing site(s) of each mrna target are the sum of editing sites from all editing locations reported for that target. b: examples of canonical (apob chr : , top) and two types of non-canonical (kctd chr : and dcn chr : ) secondary structures. c: distribution of number of chromosomal editing location(s), or targeted cytidine(s), per mrna target. d: distribution of number of total editing sites per mrna target considering all chromosomal location(s) edited at different tissue(s). e: distribution of location of editing sites within gene structure. figure . base content of sequences flanking modified cytidine in rna editing and dna mutation targets. a: base content of nucleotides upstream and nucleotides downstream of edited cytidine in mouse apobec -mediated c-to-u mrna editing targets. b: base content of nucleotides upstream and nucleotides downstream of mutated cytidine in proposed human apobec-mediated dna mutation targets in patients with breast cancer. c: comparison of au base content (%) of nucleotides flanking modified cytidine in rna editing targets and dna mutation targets in mouse and human breast cancer patients, respectively. figure . characteristics of regulatory-spacer-mooring cassette and base content of individual nucleotides flanking edited cytidine in association with editing frequency. a: schematic illustration of regulatory-spacer-mooring cassette. four motifs were defined for (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . regulatory sequence: motif a for nucleotides - to - ; motif b for nucleotides - to - ; motif c for nucleotides - to - ; motif d representative of the whole sequence. b: association of the mismatches in motif d of regulatory sequence with editing frequency. c: association between the au content (%) of regulatory sequence (motif b) and editing frequency. d: association of the mismatches in spacer (nucleotides + to + downstream of the edited cytidine) with editing frequency. e: association of the mismatches in mooring sequence with editing frequency. f: heatmap plot illustrating the association between base content of nucleotides flanking the edited cytidine with editing frequency. red color density in each cell represents the beta coefficient value of corresponding base in the multivariable linear regression model fit including that nucleotide. the asteriska refer to the nucleotides that were retained in the final model. mismatches in regulatory, spacer, and mooring sequences were determined in comparison to the corresponding sequences in apob mrna (as reference). r: pearson correlation coefficient. figure . secondary structure-related features in association with editing frequency. a: distribution of different types of overall secondary structure in editing sites. c loop, c stem, c tail are three subtypes of canonical secondary structure based on the location of the edited cytidine. b: association between type of secondary structure and editing frequency. c: distribution of the mooring sequence location in editing sites. “other” refers to mooring sequences located in tail or stem/loop and not part of the main stem-loop structure. d: association of mooring sequence location with editing frequency. e: association between ratio of main stem-loop bases to total bases count and editing frequency. f: association of the ’ free tail length with editing frequency. * p<. ; ** p<. . r: pearson correlation coefficient. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . dominance and tissue-specific cofactor patterns among editing sites. a: distribution of dominant co-factor in editosomes of editing sites. b: association of dominant co- factor with editing frequency. c: distribution of number of editing tissue(s) per mrna target. d: tissue distribution of editing sites. e: average editing frequency of editing sites edited at different tissues. si, small intestine. figure . co-factor pattern and tissue-specific role in murine c-to-u mrna editing sites. a: distribution of editing tissue across subgroups of editing sites with different dominant co- factor patterns. b: location of edited cytidine in secondary structure of editing sites with different dominant co-factor patterns. c: schematic presentation of factors that correlate with dominant co-factor pattern in editing sites. this graph is based on the findings derived from pairwise multinomial logistic regression models. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental figure legends supplemental figure . chromosomal distribution of murine apobec -mediated c-to-u mrna editing sites. the black curve corresponds to left y-axis and represents average editing frequencies of editing sites related to each chromosome. the blue curve corresponds to right y axis and represents number of editing sites related to each chromosome. supplemental figure . association of editing frequency with characteristics of regulatory sequence in murine apobec -mediated c-to-u mrna editing sites. a-c. association of editing frequency with number of mismatches and au content (%). d-f association of editing frequency with different regulatory sequence motifs. mismatches were determined in comparison to the same regulatory sequence motif in apob mrna (as reference). supplemental figure . association of editing frequency with characteristics of downstream sequence in murine apobec -mediated c-to-u mrna editing sites. a. association of editing frequency with spacer length. b. association of editing frequency with spacer au content (%). c-f. association of editing frequency with and au content of successive segments downstream of the edited cytidine. supplemental figure . association of editing frequency with secondary structure- related characteristics in c-to-u mrna editing sites. a: distribution of edited cytidine location in secondary structure regardless of the overall secondary structure. b: association of editing frequency with edited cytidine location in secondary structure. c: distribution of free tail (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . orientation in editing sites. d: association of editing frequency with free tail orientation in editing sites. e: association of editing frequency with ’ free tail length. * p<. ; *** p<. . r: pearson correlation coefficient. supplemental figure . association of secondary structure-related characteristics with dominant co-factor pattern in apobec -mediated c-to-u mrna editing sites. a. distribution of mooring sequence location presented in the context of different dominant co- factor patterns. b. distribution of free tail orientation in secondary structure among editing sites, presented in the context of different dominant co-factor patterns. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . multivariable linear regression model for individual nucleotides surrounding edited cytosine (- to + ) in mouse apobec -dependent c-to-u mrna editing sites. location of nucleotide relative to edited c base preference ß ( % ci) p value nucleotide - gu . [ . , . ] . nucleotide - c . [ . , . ] . nucleotide - g . [ . , . ] . nucleotide - u . [ . , . ] . nucleotide - auc . [ . , . ] < . nucleotide - au . [ . , . ] . nucleotide + agu . [ . , . ] < . nucleotide + g . [ . , . ] < . nucleotide + g . [ . , . ] < . nucleotide + c . [ . , . ] . nucleotide + g . [ . , . ] . nucleotide + auc . [ . , . ] . nucleotide + ac . [ . , . ] . nucleotide + au . [ . , . ] . nucleotide + au . [ . , . ] . nucleotide + ac . [ . , . ] . ß: represents average change (%) in the editing frequency compared to the reference group (non- preferred group) ci: confidence interval (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . descriptive data of regulatory-spacer-mooring cassette in mouse apobec - dependent c-to-u mrna editing sites. parameter n mean sd min max sequence-related features mismatches in regulatory (motif a) . . mismatches in regulatory (motif b) . . mismatches in regulatory (motif c) . . mismatches in regulatory (motif d) . . au content (%) of regulatory (motif a) . . au content (%) of regulatory (motif b) . . au content (%) of regulatory (motif c) . . au content (%) of regulatory (motif d) . . spacer length . . mismatches in spacer . . au content (%) of spacer . . mismatches in mooring . . au content (%) of downstream sequence + to + . . au content (%) of downstream sequence + to + . . au content (%) of downstream sequence + to + . . au content (%) of downstream sequence + to + . . secondary structure-related features proportion of the bases that constitute main stem- loop . . . length of ’ free tail . . length of ’ free tail . . sd: standard deviation (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . comparing three subgroups of mouse apobec -dependent c-to-u mrna editing sites based on co-factor dominance. parameter rbm -dominant a cf-dominant co-dominant p value n mean sd n mean sd n mean sd mismatches in regulatory (motif a) . . . . . . . mismatches in regulatory (motif b) . . . . . . . mismatches in regulatory (motif c) . . . . . . . mismatches in regulatory (motif d) . . . . . . . au content (%) of regulatory (motif a) . . . . . . . au content (%) of regulatory (motif b) . . . . . . . au content (%) of regulatory (motif c) . . . . . . . au content (%) of regulatory (motif d) . . . . . . . spacer length . . . . . . . mismatches in spacer (in -base cassette) . . . . . . . mismatches in spacer (relative abundance (%)) . . . . . . . au content (%) of spacer . . . . . . . mismatches in mooring . . . . . . . au content (%) of downstream sequence + to + . . . . . . . au content (%) of downstream sequence + to + . . . . . . . au content (%) of downstream sequence + to + . . . . . . . au content (%) of downstream sequence + to + . . . . . . . proportion of the bases that constitute main stem-loop . . . . . . . length of ’ free tail . . . . . . . length of ’ free tail . . . . . . . sd: standard deviation (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplemental table . multinomial logistic regression model for determinant factors of co-factor dominancy in mouse apobec -dependent c-to-u mrna editing sites. determinant of co-factor dominancy subgroup coefficient ( % ci) p value a cf-dominant vs rbm -dominant tissue small intestine reference liver . [ . , . ] . location of edited cytosine loop reference stem - . [- . , . ] . tail - . [- . , - . ] < . mismatches in mooring sequence per unit increments . [- . , . ] . mismatches in regulatory sequence motif b per unit increments . [ . , . ] . mismatches in regulatory sequence motif c per unit increments . [- . , . ] . au content (%) of regulatory sequence motif d per unit increments . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , - . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . co-dominant vs rbm -dominant tissue small intestine reference liver - . [- . , . ] . location of edited cytosine in secondary structure c loop reference c stem . [- . , . ] . c tail . [ . , . ] . mismatches in mooring sequence per unit increments . [ . , . ] . mismatches in regulatory sequence motif b per unit increments - . [- . , - . ] . mismatches in regulatory sequence motif c per unit increments . [ . , . ] . au content (%) of regulatory sequence motif d per unit increments . [ . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , - . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , - . ] . co-dominant vs a cf -dominant tissue small intestine reference liver - . [- . , - . ] . location of edited cytosine in secondary structure c loop reference c stem . [ . , . ] . c tail . [ . , . ] < . mismatches in mooring sequence per unit increments . [- . , . ] . mismatches in regulatory sequence motif b per unit increments - . [- . , - . ] . mismatches in regulatory sequence motif c per unit increments . [ . , . ] . au content (%) of regulatory sequence motif d per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . au content (%) of downstream sequence + to + per unit increments - . [- . , . ] . model parameters: n= ; pseudo r = . ; p<. ci: confidence interval (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . evaluating the transcriptional fidelity of cancer models da peng *, rachel gleyzer *, wen-hsin tai , pavithra kumar , qin bian , bradley issacs , edroaldo lummertz da rocha , stephanie cai , kathleen dinapoli , , franklin w huang , patrick cahan , , department of biomedical engineering, johns hopkins university school of medicine, baltimore md usa institute for cell engineering, johns hopkins university school of medicine, baltimore md usa department of microbiology, immunology and parasitology, federal university of santa catarina, florianópolis sc, brazil department of cell biology, johns hopkins university school of medicine, baltimore, md usa department of electrical and computer engineering, johns hopkins university, baltimore md usa division of hematology/oncology, department of medicine; helen diller family cancer center; bakar computational health sciences institute; institute for human genetics; university of california, san francisco, san francisco, ca department of molecular biology and genetics, johns hopkins university school of medicine, baltimore md usa * these authors made equal contributions. correspondence to: patrick.cahan@jhmi.edu article type: research website: http://www.cahanlab.org/resources/cancercellnet_web code: https://github.com/pcahan /cancercellnet .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract background: cancer researchers use cell lines, patient derived xenografts, engineered mice, and tumoroids as models to investigate tumor biology and to identify therapies. the generalizability and power of a model derives from the fidelity with which it represents the tumor type under investigation, however, the extent to which this is true is often unclear. the preponderance of models and the ability to readily generate new ones has created a demand for tools that can measure the extent and ways in which cancer models resemble or diverge from native tumors. methods: we developed a machine learning based computational tool, cancercellnet, that measures the similarity of cancer models to naturally occurring tumor types and subtypes, in a platform and species agnostic manner. we applied this tool to cancer cell lines, patient derived xenografts, distinct genetically engineered mouse models, and tumoroids. we validated cancercellnet by application to independent data, and we tested several predictions with immunofluorescence. results: we have documented the cancer models with the greatest transcriptional fidelity to natural tumors, we have identified cancers underserved by adequate models, and we have found models with annotations that do not match their classification. by comparing models across modalities, we report that, on average, genetically engineered mice and tumoroids have higher transcriptional fidelity than patient derived xenografts and cell lines in four out of five tumor types. however, several patient derived xenografts and tumoroids have classification scores that are on par with native tumors, highlighting both their potential as faithful model classes and their heterogeneity. conclusions: cancercellnet enables the rapid assessment of transcriptional fidelity of tumor models. we have made cancercellnet available as freely downloadable software and as a web application that can be applied to new cancer models that allows for direct comparison to the cancer models evaluated here. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction models are widely used to investigate cancer biology and to identify potential therapeutics. popular modeling modalities are cancer cell lines (ccls) , genetically engineered mouse models (gemms) , patient derived xenografts (pdxs) , and tumoroids . these classes of models differ in the types of questions that they are designed to address. ccls are often used to address cell intrinsic mechanistic questions , gemms to chart progression of molecularly defined-disease , and pdxs to explore patient-specific response to therapy in a physiologically relevant context . more recently, tumoroids have emerged as relatively inexpensive, physiological, in vitro d models of tumor epithelium with applications ranging from measuring drug responsiveness to exploring tumor dependence on cancer stem cells. models also differ in the extent to which the they represent specific aspects of a cancer type . even with this intra- and inter-class model variation, all models should represent the tumor type or subtype under investigation, and not another type of tumor, and not a non-cancerous tissue. therefore, cancer- models should be selected not only based on the specific biological question but also based on the similarity of the model to the cancer type under investigation , . various methods have been proposed to determine the similarity of cancer models to their intended subjects. domcke et al devised a 'suitability score' as a metric of the molecular similarity of ccls to high grade serous ovarian carcinoma based on a heuristic weighting of copy number alterations, mutation status of several genes that distinguish ovarian cancer subtypes, and hypermutation status . other studies have taken analogous approaches by either focusing on transcriptomic or ensemble molecular profiles (e.g. transcriptomic and copy number alterations) to quantify the similarity of cell lines to tumors – . these studies were tumor-type specific, focusing on ccls that model, for example, hepatocellular carcinoma or breast cancer. notably, yu et al compared the transcriptomes of ccls to the cancer genome atlas (tcga) by correlation analysis, resulting in a panel of ccls recommended as most representative of tumor types . most recently, najgebauer et al and salvadores et al .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / have developed methods to assess ccls using molecular traits such as copy number alterations (cna), somatic mutations, dna methylation and transcriptomics. while all of these studies have provided valuable information, they leave two major challenges unmet. the first challenge is to determine the fidelity of gemms, pdxs, and tumoroids, and whether there are stark differences between these classes of models and ccls. the other major unmet challenge is to enable the rapid assessment of new, emerging cancer models. this challenge is especially relevant now as technical barriers to generating models have been substantially lowered , , and because new models such as pdxs and tumoroids can be derived on patient-specific basis therefore should be considered a distinct entity requiring individual validation , . to address these challenges, we developed cancercellnet (ccn), a computational tool that uses transcriptomic data to quantitatively assess the similarity between cancer models and naturally occurring tumor types and subtypes in a platform- and species-agnostic manner. here, we describe ccn’s performance, and the results of applying it to assess ccls, pdxs, gemms, and tumoroids. this has allowed us to identify the most faithful models currently available, to document cancers underserved by adequate models, and to find models with inaccurate tumor type annotation. moreover, because ccn is open-source and easy to use, it can be readily applied to newly generated cancer models as a means to assess their fidelity. results cancercellnet classifies samples accurately across species and technologies previously, we had developed a computational tool using the random forest classification method to measure the similarity of engineered cell populations to their in vivo counterparts based on transcriptional profiles , . more recently, we elaborated on this approach to allow for classification of single cell rna-seq data in a manner that allows for cross-platform and cross-species analysis . here, we used an analogous approach to build a .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / platform that would allow us to quantitatively compare cancer models to naturally occurring patient tumors (fig a). in brief, we used tcga rna-seq expression data from solid tumor types to train a top-pair multi-class random forest classifier (fig b). we combined training data from rectal adenocarcinoma (read) and colon adenocarcinoma (coad) into one coad_read category because read and coad are considered to be virtually indistinguishable at a molecular level . we included an ‘unknown’ category trained using randomly shuffled gene-pair profiles generated from the training data of tumor types to identify query samples that are not reflective of any of the training data. to estimate the performance of ccn and how it is impacted by parameter variation, we performed a parameter sweep with a -fold / cross-validation strategy (i.e. / of the data sampled across each cancer type was used to train, / was used to validate) (fig c). the performance of ccn, as measured by the mean area under the precision recall curve (auprc), did not fall below . and remained relatively stable across parameter sets (supp fig a). the optimal parameters resulted in , features. the mean auprcs exceeded . in most tumor types with this optimal parameter set (fig d, supp fig b). the auprcs of ccn applied to independent data rna-seq data from tumors across five tumor types from the international cancer genome consortium (icgc) ranged from . to . , supporting the notion that the platform is able to accurately classify tumor samples from diverse sources (fig e). as one of the central aims of our study is to compare distinct cancer models, including gemms, our method needed to be able to classify samples from mouse and human samples equivalently. we used the top-pair transform to achieve this and we tested the feasibility of this approach by assessing the performance of a normal (i.e. non-tumor) cell and tissue classifier trained on human data as applied to mouse samples. consistent with prior applications , we found that the cross-species classifier performed well, achieving mean auprc of . when applied to mouse data (supp fig c). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to evaluate cancer models at a finer resolution, we also developed an approach to perform tumor subtype classifications (supp fig d). we constructed different cancer subtype classifiers based on the availability of expression or histological subtype information , – . we also included non-cancerous, normal tissues as categories for several subtype classifiers when sufficient data was available: breast invasive carcinoma (brca), coad_read, head and neck squamous cell carcinoma (hnsc), kidney renal clear cell carcinoma (kirc) and uterine corpus endometrial carcinoma (ucec). the subtype classifiers all achieved high overall average auprs ranging from . to . (supp fig e). fidelity of cancer cell lines having validated the performance of ccn, we then used it to determine the fidelity of ccls. we mined rna-seq expression data of different cell lines across cancer types from the cancer cell line encyclopedia (ccle) and applied ccn to them, finding a wide classification range for cell lines of each tumor type (fig a, supp tab ). to verify the classification results, we applied ccn to expression profiles from ccle generated through microarray expression profiling . to ensure that ccn would function on microarray data, we first tested it by applying a ccn classifier created to test microarray data to expression profiles of tumor types. the cross-platform ccn classifier performed well, based on the comparison to study-provided annotation, achieving a mean auprc of . (supp fig a). next, we applied this cross-platform classifier to microarray expression profiles from ccle (supp fig b). from the classification results of cell lines that have both rna-seq and microarray expression profiles, we found a strong overall positive association between the classification scores from rna-seq and those from microarray (supp fig c). this comparison supports the notion that the classification scores for each cell line are not artifacts of profiling methodology. moreover, this comparison shows that the scores are consistent between the times that the cell lines were first assayed by microarray expression profiling in and by .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rna-seq in . we also observed high level of correlation between our analysis and the analysis done by yu et al (supp fig d), further validating the robustness of the ccn results. next, we assessed the extent to which ccn classifications agreed with their nominal tumor type of origin, which entailed translating quantitative ccn scores to classification labels. to achieve this, we selected a decision threshold that maximized the macro f measure, harmonic mean of precision and recall, across cross validations. then, we annotated cell lines based their ccn score profile as follows. cell lines with ccn scores > threshold for the tumor type of origin were annotated as 'correct'. cell lines with ccn scores > threshold in the tumor type of origin and at least one other tumor type were annotated as 'mixed'. cell lines with ccn scores > threshold for tumor types other than that of the cell line's origin were annotated as 'other'. cell lines that did not receive a ccn score > threshold for any tumor type were annotated as 'none' (fig b). we found that majority of cell lines originally annotated as breast invasive carcinoma (brca), cervical squamous cell carcinoma and endocervical adenocarcinoma (cesc), skin cutaneous melanoma (skcm), colorectal cancer (coad_read) and sarcoma (sarc) fell into the 'correct' category (fig b). on the other hand, no esophageal carcinoma (esca), pancreatic adenocarcinoma (paad) or brain lower grade glioma (lgg) were classified as 'correct', demonstrating the need for more transcriptionally faithful cell lines that model those general cancer types. there are several possible explanations for cell lines not receiving a 'correct' classification. one possibility is that the sample was incorrectly labeled in the study from which we harvested the expression data. consistent with this explanation, we found that colorectal cancer line nci-h , , a cell line labelled as liver hepatocellular carcinoma (lihc) by ccle, was classified strongly as coad_read (supp tab ). another possibility to explain low ccn score is that cell lines were derived from subtypes of tumors that are not well-represented in tcga. to explore this hypothesis, we first performed tumor subtype classification on ccls from tumor types for which we had trained subtype classifiers (supp tab ). we reasoned that if .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a cell was a good model for a rarer subtype, then it would receive a poor general classification but a high classification for the subtype that it models well. therefore, we counted the number of lines that fit this pattern. we found that of the lines with no general classification, ( %) were classified as a specific subtype, suggesting that derivation from rare subtypes is not the major contributor to the poor overall fidelity of ccls. another potential contributor to low scoring cell lines is intra-tumor stromal and immune cell impurity in the training data. if impurity were a confounder of ccn scoring, then we would expect a strong positive correlation between mean purity and mean ccn classification scores of ccls per general tumor type. however, the pearson correlation coefficient between the mean purity of general tumor type and mean ccn classification scores of ccls in the corresponding general tumor type was low ( . ), suggesting that tumor purity is not a major contributor to the low ccn scores across ccls (supp fig e). comparison of skcm and gbm ccls to scrna-seq to more directly assess the impact of intra-tumor heterogeneity in the training data on evaluating cell lines, we constructed a classifier using cell types found in human melanoma and glioblastoma scrna-seq data , . previously, we have demonstrated the feasibility of using our classification approach on scrna-seq data . our scrna-seq classifier achieved a high average auprc ( . ) when applied to held-out data and high mean auprc ( . ) when applied to few purified bulk testing samples (supp fig a-b). comparing the ccn score from bulk rna-seq general classifier and scrna-seq classifier, we observed a high level of correlation (pearson correlation of . ) between the skcm ccn classification scores and scrna-seq skcm malignant ccn classification scores for skcm cell lines (fig c, supp fig c). of the skcm cell lines that were classified as skcm by the bulk classifier, were also classified as skcm malignant cells by the scrna-seq classifier. interestingly, we also observed a high correlation between the sarc ccn classification score and scrna-seq cancer .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / associated fibroblast (caf) ccn classification scores (pearson correlation of . ). six of the seven skcm cell lines that had been classified as exclusively sarc by ccn were classified as caf by the scrna-seq classifier (fig d, supp fig c), which suggests the possibility that these cell lines were derived from caf or other mesenchymal populations, or that they have acquired a mesenchymal character through their derivation. the high level of agreement between scrna-seq and bulk rna-seq classification results shows that heterogeneity in the training data of general ccn classifier has little impact in the classification of skcm cell lines. in contrast, we observed a weaker correlation between gbm ccn classification scores and scrna-seq gbm neoplastic ccn classification scores (pearson correlation of . ) for gbm cell lines (fig e, supp fig d). of the gbm lines that were not classified as gbm with ccn, were classified as gbm neoplastic cells with the scrna-seq classifier. among the gbm lines that were classified as sarc with ccn, cell lines were classified as caf (fig f), which were classified as both gbm neoplastic and caf in the scrna-seq classifier. similar to the situation with skcm lines that classify as caf, this result is consistent with the possibility that some gbm lines classified as sarc by ccn could be derived from mesenchymal subtypes exhibiting both strong mesenchymal signatures and glioblastoma signatures or that they have acquired a mesenchymal character through their derivation. the lower level of agreement between scrna-seq and bulk rna-seq classification results for gbm models suggests that the heterogeneity of glioblastomas can impact the classification of gbm cell lines, and that the use of scrna-seq classifier can resolve this deficiency. immunofluorescence confirmation of ccn predictions to experimentally explore some of our computational analyses, we performed immunofluorescence on three cell lines that were not classified as their labelled categories: the ovarian cancer line sk-ov- had a high ucec ccn score ( . ), the ovarian cancer line a had a high testicular germ cell tumors (tgct) ccn score ( . ), and the prostate .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cancer line pc- had a high bladder cancer (blca) score ( . ) (supp tab ). we reasoned that if sk-ov- , a and pc- were classified most strongly as ucec, tgct and blca, respectively, then they would express proteins that are indicative of these cancer types. first, we measured the expression of the uterine-associated transcription factor hoxb , , and the ucec serous ovarian tumor biomarker wt in sk-ov- , in the ov cell line caov- , and in the ucec cell line hec- . we chose caov- as our positive control for ov biomarker expression because it was determined by our analysis and others , to be a good model of ov. likewise, we chose hec- to be a positive control for ucec. we found that sk- ov- has a small percentage ( %) of cells that expressed the uterine marker hoxb and a large proportion ( %) of cells that expressed wt (fig a). in contrast, no caov- cells expressed hoxb , whereas % of cells expressed wt . this suggests that sk-ov- exhibits both biomarkers of ovarian tumor and uterine tissue. from our computational analysis and experimental validation, sk-ov- is most likely an endometrioid subtype of ovarian cancer. this result is also consistent with prior classification of sk-ov- , and the fact that sk-ov- lacks p mutations, which is prevalent in high-grade serous ovarian cancer , and it harbors an endometrioid-associated mutation in arid a , , . next, we measured the expression of markers of ov and germ cell cancers (lin a ) in the ov-annotated cell line a , which received a high tcgt ccn score. we found that % of a cells expressed lin a whereas it was not detected in caov- (fig b). the ov marker wt was also expressed in fewer a cells as compared to caov- ( % vs %), which suggests that a could be a germ cell derived ovarian tumor. taken together, our results suggest that sk-ov- and a could represent ov subtypes of that are not well represented in tcga training data, which resulted in a low ov score and higher ccn score in other categories. lastly, we examined pc- , annotated as a prad cell line but classified to be most similar to blca. we found that % of the pc- cells expressed pparg, a contributor to urothelial differentiation that is not detected in the prad vcap cell line but is highly expressed .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in the blca rt cell line (fig c). pc- cells also expressed the prad biomarker folh suggesting that pc- has an prad origin and gained urothelial or luminal characteristics through the derivation process. in short, our limited experimental data support the ccn classification results. subtype classification of cancer cell lines next, we explored the subtype classification of ccls from three general tumor types in more depth. we focused our subtype visualization (fig a-c) on ccl models with general ccn score above . in their nominal cancer type as this allowed us to analyze those models that fell below the general threshold but were classified as a specific sub-type (supp tab - ). focusing first on ucec, the histologically defined subtypes of ucec, endometrioid and serous, differ in prevalence, molecular properties, prognosis, and treatment. for instance, the endometrioid subtype, which accounts for approximately % of uterine cancers, retains estrogen receptor and progesterone receptor status and is responsive towards progestin therapy , . serous, a more aggressive subtype, is characterized by the loss of estrogen and progesterone receptor and is not responsive to progestin therapy , . ccn classified the majority of the ucec cell lines as serous except for jhuem- which is classified as mixed, with similarities to both endometrioid and serous (fig a). the preponderance ccle lines of serous versus endometroid character may be due to properties of serous cancer cells that promote their in vitro propagation, such as upregulation of cell adhesion transcriptional programs . some of our subtype classification results are consistent with prior observations. for example, hec- a, hec- b, and kle were previously characterized as type ii endometrial cancer, which includes a serous histological subtype . on the other hand, our subtype classification results contradict prior observations in at least one case. for instance, the ishikawa cell line was derived from type i endometrial cancer (endometrioid histological subtype) , , however ccn classified a derivative of this line, ishikawa er-, as serous. the high serous ccn score .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / could result from a shift in phenotype of the line concomitant with its loss of estrogen receptor (er) as this is a distinguishing feature of type ii endometrial cancer (serous histological subtype) . taken together, these results indicate a need for more endometroid-like ccls. next, we examined the subtype classification of lung squamous cell carcinoma (lusc) and lung adenocarcinoma (luad) cell lines (fig b-c). all the lusc lines with at least one subtype classification had an underlying primitive subtype classification. this is consistent either with the ease of deriving lines from tumors with a primitive character, or with a process by which cell line derivation promotes similarity to more primitive subtype, which is marked by increased cellular proliferation . some of our results are consistent with prior reports that have investigated the resemblance of some lines to lusc subtypes. for example, hcc- , previously been characterized as classical , , had a maximum ccn score in the classical subtype ( . ) . similarly, ludlu- and eplc- h, previously reported as classical and basal respectively, had maximal tumor subtype ccn scores for these sub-types ( . and . ) (fig b, supp tab ) despite classified as unknown. lastly, the luad cell lines that were classified as a subtype were either classified as proximal inflammation or proximal proliferation (fig c). rerf-lc-ad had the highest general classification score and the highest proximal inflammation subtype classification score. taken together, these subtype classification results have revealed an absence of cell lines models for basal and secretory lusc, and for the terminal respiratory unit (tru) luad subtype. cancer cell lines’ popularity and transcriptional fidelity finally, we sought to measure the extent to which cell line transcriptional fidelity related to model prevalence. we used the number of papers in which a model was mentioned, normalized by the number of years since the cell line was documented, as a rough approximation of model prevalence. to explore this relationship, we plotted the normalized citation count versus general classification score, labeling the highest cited and highest .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / classified cell lines from each general tumor type (fig d). for most of the general tumor types, the highest cited cell line is not the highest classified cell line except for hep g , ags and ml- , representing liver hepatocellular carcinoma (lihc), stomach adenocarcinoma (stad), and thyroid carcinoma (thca), respectively. on the other hand, the general scores of the highest cited cell lines representing blca (t ), brca (mda-mb- ), and prad (pc- ) fall below the classification threshold of . . notably, each of these tumor types have other lines with scores exceeding . , which should be considered as more faithful transcriptional models when selecting lines for a study (supp tab and http://www.cahanlab.org/resources/cancercellnet_results/). evaluation of patient derived xenografts next, we sought to evaluate a more recent class of cancer models: pdx. to do so, we subjected the rna-seq expression profiles of pdx models from different types of cancer types generated previously to ccn. similar to the results of ccls, the pdxs exhibited a wide range of classification scores (fig a, supp tab ). by categorizing the ccn scores of pdx based on the proportion of samples associated with each tumor type that were correctly classified, we found that sarc, skcm, coad_read and brca have higher proportion of correctly classified pdx than those of other cancer categories (fig b). in contrast to ccls, we found a higher proportion of correctly classified pdx in stad, paad and kirc (fig b). however, similar to ccls, no esca pdxs were classified as such. this held true when we performed subtype classification on pdx samples: none of the pdx in esca were classified as any of the esca subtypes (supp tab ). ucec pdxs had both endometrioid subtypes, serous subtypes, and mixed subtypes, which provided a broader representation than ccls (fig c). several lusc pdxs that were classified as a subtype were also classified as head and neck squamous cell carcinoma (hnsc) or mix hnsc and lusc (fig d). this could be due to the similarity in expression profiles of basal and classical subtypes of hnsc and lusc , , which is .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / consistent with the observation that these pdxs were also subtyped as classical. no lusc pdxs were classified as the secretory subtype. in contrast to luad ccls, four of the five luad pdxs with a discernible sub-type were classified as proximal inflammatory (fig e). on the other hand, similar to the ccls, there were no tru subtypes in the luad pdx cohort. in summary, we found that while individual pdxs can reach extremely high transcriptional fidelity to both general tumor types and subtypes, many pdxs were not classified as the general tumor type from which they originated. evaluation of gemms next, we used ccn to evaluate gemms of six general tumor types from nine studies for which expression data was publicly available – . as was true for ccls and pdxs, gemms also had a wide range of ccn scores (fig a, supp tab ). we next categorized the ccn scores based on the proportion of samples associated with each tumor type that were correctly classified (fig b). in contrast to lgg ccls, lgg gemms, generated by nf mutations expressed in different neural progenitors in combination with pten deletion , consistently were classified as lgg (fig a-b). the gemm dataset included multiple replicates per model, which allowed us to examine intra-gemm variability. both at the level of ccn score and at the level of categorization, gemms were invariant. for example, replicates of ucec gemms driven by prg(cre/+)pten(lox/lox) received almost identical general ccn scores (fig c, supp tab ). gemms sharing genotypes across studies, such as luad gemms driven by kras mutation and loss of p , , , also received similar general and subtype classification scores (fig a,b,e). next, we explored the extent to which genotype impacted subtype classification in ucec, lusc, and luad. prg(cre/+)pten(lox/lox) gemms had a mixed subtype classification of both serous and endometrioid, consistent with the fact that pten loss occurs in both subtypes (albeit more frequently in endometrioid). we also analyzed prg(cre/+)pten(lox/lox)csf r-/- gemms. polymorphonuclear neutrophils (pmns), which play anti-tumor roles in endometrioid .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cancer progression, are depleted in these animals. interestingly, prg(cre/+)pten(lox/lox)csf r-/- gemms had a serous subtype classification, which could be explained by differences in pmn involvement in endometrioid versus serous uterine tumor development that are reflected in the respective transcriptomes of the tcga ucec training data. we note that the tumor cells were sorted prior to rna-seq and thus the shift in subtype classification is not due to contamination of gemms with non-tumor components. in short, this analysis supports the argument that tumor- cell extrinsic factors, in this case a reduction in anti-tumor pmns, can shift the transcriptome of a gemm so that it more closely resembles a serous rather than endometrioid subtype. the lusc gemms that we analyzed were lkb fl/fl and they either overexpressed of sox (via two distinct mechanisms) or were also ptenfl/fl . we note that the eight lenti-sox - cre-infected;lkb fl/fl and rosa lsl-sox -ires-gfp;lkb fl/fl samples that classified as 'unknown' had lusc ccn scores only modestly lower than the decision threshold (fig d) (mean ccn score = . ). thirteen out of the of the sox gemms classified as the secretory subtype of lusc. the consistency is not surprising given both models overexpress sox and lose lkb . on the other hand, the lkb fl/fl;ptenfl/fl gemms had substantially lower general lusc ccn scores and our subtype classification indicated that this gemm was mostly classified as 'unknown', in contrast to prior reports suggesting that it is most similar to a basal subtype . none of the three lusc gemms have strong classical ccn scores. most of the luad gemms, which were generated using various combinations of activating kras mutation, loss of trp , and loss of smarca l , , , were correctly classified (fig e). those that were not classified have modestly lower ccn score than the decision threshold (mean ccn score = . ) . there were no substantial differences in general or subtype classification across driver genotypes. although the sub-type of all luad gemms was 'unknown', the subtypes tended to have a mixture of high ccn proximal proliferation, proximal inflammation and tru scores. taken together, this analysis suggests that there is a degree of similarity, and perhaps plasticity between the primitive and secretory (but not basal or classical) subtypes of lusc. on the other .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / hand, while the luad gemms classify strongly as luad, they do not have strong particular subtype classification -- a result that does not vary by genotype. evaluation of tumoroids lastly, we used ccn to assess a relatively novel cancer model: tumoroids. we downloaded and assessed distinct tumoroid expression profiles spanning cancer categories from the nci patient-derived models repository (pdmr) and from three individual studies – (fig a, supp tab ). we note that several categories have three or fewer samples (brca, cesc, kirp, ov, lihc, and blca from pdmr). among the cancer categories represented by more than three samples, only lusc and paad have fewer than % classified as their annotated label (fig b). in contrast to gbm ccls, all three induced pluripotent stem cell-derived gbm tumoroids were classified as gbm with high ccn scores (mean = . ). to further characterize the tumoroids, we performed subtype classification on them (supp tab ). ucec tumoroids from pdmr contains a wide range of subtypes with two endometrioid, two serous and one mixed type (fig c). on the other hand, lusc tumoroids appear to be predominantly of classical subtypes with one tumoroid classified as a mix between classical and primitive (fig d). lastly, similar to the ccl and pdx counterparts, luad tumoroids are classified as proximal inflammatory and proximal proliferation with no tumoroids classified as tru subtype (fig e). comparison of ccls, pdxs, gemms and tumoroids finally, we sought to estimate the comparative transcriptional fidelity of the four cancer models modalities. we compared the general ccn scores of each model on a per tumor type basis (fig ). in the case of gemms, we used the mean classification score of all samples with shared genotypes. we also used mean classification of technical replicates found in lihc tumoroids . we evaluated models based on both the maximum ccn score, as this represents .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the potential for a model class, and the median ccn score, as this indicates the current overall transcriptional fidelity of a model class. pdxs achieved the highest ccn scores in three (ucec, paad, luad) out of the five cancer categories in which all four modalities were available (fig ), despite having low median ccn scores. notably, pdxs have a median ccn score above the . threshold in paad while none of the other three modalities have any samples above the threshold. in lihc, the highest ccn score for pdx ( . ) is only slightly lower than the highest ccn score for tumoroid ( . ). this suggest that certain individual pdxs most closely mimic the transcriptional state of native patient tumors despite a portion of the pdxs having low ccn scores. similarly, while the majority of the ccls have low ccn scores, several lines achieve high transcriptional fidelity in lusc, luad and lihc (fig ). collectively, gemms and tumoroids had the highest median ccn scores in four of the five model classes (lusc and luad for gemms and ucec and lihc for tumoroids). notably, both of the lihc tumoroids achieved ccn scores on par with patient tumors (fig ). in brief, this analysis indicates that pdxs and ccls are heterogenous in terms of transcriptional fidelity, with a portion of the models highly mimicking native tumors and the majority of the models having low transcriptional fidelity (with the exception of paad for pdxs). on the other hand, gemms and tumoroids displayed a consistently high fidelity across different models. because the ccn score is based on a moderate number of gene features (i.e. , gene pairs consisting of , unique genes) relative to the total number of protein-coding genes in the genome, it is possible that a cancer model with a high ccn score might not have a high global similarity to a naturally occurring tumor. therefore, we also calculated the grn status, a metric of the extent to which tumor-type specific gene regulatory network is established , for all models (supp fig ). we observed high level of correlation between the two similarity metrics, which suggests that although ccn classifies on a selected set of genes, its scores are highly correlated with global assessment of transcriptional similarity. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we also sought to compare model modalities in terms of the diversity of subtypes that they represent (supp fig ). as a reference, we also included in this analysis the overall subtype incidence, as approximated by incidence in tcga. replicates in gemms and tumoroids were averaged into one classification profile. in models of ucec, there is a notable difference in endometroid incidence, and the proportion of models classified as endometroid, with pdx and tumoroids having any representatives (supp fig ). all of the ccl, gemm, and tumoroid models of paad have an unknown subtype classification and no correct general classification. however, the majority of pdxs are subtyped as either a mixture of basal and classical, or classical alone. luad have proximal inflammation and proximal proliferation subtypes modelled by ccls and pdx (supp fig ). likewise, lusc have basal, classical and primitive subtypes modelled by ccls and pdxs, and secretory subtype modelled by gemms exclusively (supp fig ). taken together, these results demonstrate the need to carefully select different model systems to more suitably model certain cancer subtypes. discussion a major goal in the field of cancer biology is to develop models that mimic naturally occurring tumors with enough fidelity to enable therapeutic discoveries. however, methods to measure the extent to which cancer models resemble or diverge from native tumors are lacking. this is especially problematic now because there are many existing models from which to choose, and it has become easier to generate new models. here, we present cancercellnet (ccn), a computational tool that measures the similarity of cancer models to naturally occurring tumor types and subtypes. while the similarity of ccls to patient tumors has already been explored in previous work, our tool introduces the capability to assess the transcriptional fidelity of pdxs, gemms, and tumoroids. because ccn is platform- and species-agnostic, it represents a consistent platform to compare models across modalities including ccls, pdxs, gemms and tumoroids. here, we applied ccn to cancer cell lines, patient derived .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / xenografts, distinct genetically engineered mouse models and tumoroids. several insights emerged from our computational analyses that have implications for the field of cancer biology. first, pdxs have the greatest potential to achieve transcriptional fidelity with three out of five general tumor types for which data from all modalities was available, as indicated by the high scores of individual pdxs. notably pdxs are the only modality with samples classified as paad. at the same time, the median ccn scores of pdxs were lower than that of gemms and tumoroids in the other four tumor types. it is unclear what causes such a wide range of ccn scores within pdxs. we suspect that some pdxs might have undergone selective pressures in the host that distort the progression of genomic alterations away from what is observed in natural tumor . future work to understand this heterogeneity is important so as to yield consistently high fidelity pdxs, and to identify intrinsic and host-specific factors that so powerfully shape the pdx transcriptome. second, in general gemms and tumoroids have higher median ccn scores than those of pdxs and ccls. this is also consistent with that fact that gemms are typically derived by recapitulating well-defined driver mutations of natural tumors, and thus this observation corroborates the importance of genetics in the etiology of cancer . moreover, in contrast to most pdxs, gemms are typically generated in immune replete hosts. therefore, the higher overall fidelity of gemms may also be a result of the influence of a native immune system on gemm tumors . the high median ccn scores of tumoroids can be attributed to several factors including the increased mechanical stimuli and cell-cell interactions that come from d self- organizing cultures , . third, we have found that none of the samples that we evaluated here are transcriptionally adequate models of esca. this may be due to an inherent lability of the esca transcriptome that is often preceded by a metaplasia that has obscured determining its cell type(s) of origin . therefore, this tumor type requires further attention to derive new models. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fourth, we found that in several tumor types, gemms tend to reflect mixtures of subtypes rather than conforming strongly to single subtypes. the reasons for this are not clear but it is possible that in the cases that we examined the histologically defined subtypes have a degree of plasticity that is exacerbated in the murine host environment. lastly, we recognize that many ccls are not classified as their annotated labels. while we have suggested that the lack of immune component is not a major confounder, we suspect that the ccls could undergo genetic divergence due to high number of passages, chemotherapy before biopsy, culture condition and genetic instability – , which could all be factors that drive ccls away from their labelled tumors. currently, there are several limitations to our ccn tool, and caveats to our analyses which indicate areas for future work and improvement. first, ccn is based on transcriptomic data but other molecular readouts of tumor state, such as profiles of the proteome , epigenome , non-coding rna-ome , and genome would be equally, if not more important, to mimic in a model system. therefore, it is possible that some models reflect tumor behavior well, and because this behavior is not well predicted by transcriptome alone, these models have lower ccn scores. to both measure the extent that such situations exist, and to correct for them, we plan in the future to incorporate other omic data into ccn so as to make more accurate and integrated model evaluation possible. as a first step in this direction, we plan to incorporate dna methylation and genomic sequencing data as additional features for our random forest classifier as this data is becoming more readily available for both training and cancer models. we expect that this will allow us to both refine our tumor subtype categories and it will enable more accurate predictions of how models respond to perturbations such as drug treatment. a second limitation is that in the cross-species analysis, ccn implicitly assumes that homologs are functionally equivalent. the extent to which they are not functionally equivalent determines how confounded the ccn results will be. this possibility seems to be of limited .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / consequence based on the high performance of the normal tissue cross-species classifier and based on the fact that gemms have the highest median ccn scores (in addition to tumoroids). a third caveat to our analysis is that there were many fewer distinct gemms and tumoroids than ccls and pdxs. as more transcriptional profiles for gemms and tumoroids emerge, this comparative analysis should be revisited to assess the generality of our results. finally, the tcga training data is made up of rna-seq from bulk tumor samples, which necessarily includes non-tumor cells, whereas the ccls are by definition cell lines of tumor origin. therefore, ccls theoretically could have artificially low ccn scores due to the presence of non-tumor cells in the training data. this problem appears to be limited as we found no correlation between tumor purity and ccn score in the ccle samples. however, this problem is related to the question of intra-tumor heterogeneity. we demonstrated the feasibility of using ccn and single cell rna-seq data to refine the evaluation of cancer cell lines contingent upon availability of scrna-seq training data. as more training single cell rna-seq data accrues, ccn would be able to not only evaluate models on a per cell type basis, but also based on cellular composition. we have made the results of our analyses available online so that researchers can easily explore the performance of selected models or identify the best models for any of the general tumor types and the subtypes presented here. to ensure that ccn is widely available we have developed a free web application, which performs ccn analysis on user- uploaded data and allows for direct comparison of their data to the cancer models evaluated here. we have also made the ccn code freely available under an open source license and as an easily installed r package, and we are actively supporting its further development. included in the web application are instructions for training ccn and reproducing our analysis. the documentation describes how to analyze models and compare the results to the panel of models that we evaluated here, thereby allowing researchers to immediately compare their models to the broader field in a comprehensive and standard fashion. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / online methods training general cancercellnet classifier to generate training data sets, we downloaded , patient tumor rna-seq expression count matrix and their corresponding sample table across different tumor types from tcga using tcgaworkflowdata, tcgabiolinks and summarizedexperiment packages. we used all the patient tumor samples for training the general ccn classifier. we limited training and analysis of rna-seq data to the , genes in common between the tcga dataset and all the query samples (ccls, pdxs, gemms, and tumoroids). to train the top pair random forest classifier, we used a method similar to our previous method . ccn first normalized the training counts matrix by down-sampling the counts to , counts per sample. to significantly reduce the execution time and memory of generating gene pairs for all possible genes, ccn then selected n up-regulated genes, n down-regulated genes and n least differentially expressed genes (ccn training parameter ntopgenes = n) for each of the cancer categories using template matching as the genes to generate top scoring gene pairs. in short, for each tumor type, ccn defined a template vector that labelled the training tumor samples in cancer type of interest as and all other tumor samples as ccn then calculated the pearson correlation coefficient between template vector and gene expressions for all genes. the genes with strong match to template as either upregulated or downregulated had large absolute pearson correlation coefficient. ccn chose the upregulated, downregulated and least differentially expressed genes based on the magnitude of pearson correlation coefficient. after ccn selected the genes for each cancer type, ccn generated gene pairs among those genes. gene pair transformation was a method inspired by the top-scoring pair classifier to allow compatibility of classifier with query expression profiles that were collected through different platforms (e.g. microarray query data applied to rna-seq training data). in brief, the gene pair transformation compares genes within an expression sample and encodes the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / “gene _gene ” gene-pair as if the first gene has higher expression than the second gene. otherwise, gene pair transformation would encode the gene-pair as . using all the gene pair combinations generated through the gene sets per cancer type, ccn then selected top m discriminative gene pairs (ccn training parameter ntopgenepairs = m) for each category using template matching (with large absolute pearson correlation coefficient) described above. to prevent any single gene from dominating the gene pair list, we allowed each gene to appear at maximum of three times among the gene pairs selected as features per cancer type. after the top discriminative gene pairs were selected for each cancer category, ccn grouped all the gene pairs together and gene pair transformed the training samples into a binary matrix with all the discriminative gene pairs as row names and all the training samples as column names. using the binary gene pair matrix, ccn randomly shuffled the binary values across rows then across columns to generate random profiles that should not resemble training data from any of the cancer categories. ccn then sampled random profiles, annotated them as “unknown” and used them as training data for the “unknown” category. using gene pair binary training matrix, ccn constructed a multi-class random forest classifier of trees and used stratified sampling of sample size to ensure balance of training data in constructing the decision trees. to identify the best set of genes and gene-pair parameters (n and m), we used a grid- search cross-validation strategy with cross-validations at each parameter set. the specific parameters for the final ccn classifier using the function “broadclass_train” in the package cancercellnet are in supp tab . the gene-pairs are in supp tab . validating general cancercellnet classifier two thirds of patient tumor data from each cancer type were randomly sampled as training data to construct a ccn classifier. based on the training data, ccn selected the classification genes and gene-pairs and trained a classifier. after the classifier was built, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / held-out samples from each cancer category were sampled and “unknown” profiles were generated for validation. the process of randomly sampling training set from / of all patient tumor data, selecting features based on the training set, training classifier and validating was repeated times to have a more comprehensive assessment of the classifier trained with the optimal parameter set. to test the performance of final ccn on independent testing data, we applied it to profiles from icgc spanning projects that do not overlap with tcga (brca- kr, liri-jp, ov-au, paca-au, paca-ca, prad-fr). selecting decision thresholds our strategy for selecting a decision threshold was to find the value that maximizes the average macro f measure for each of the cross-validations that were performed with the optimal parameter set, testing thresholds between and with a . increment. the f measure is defined as: 𝑀𝑎𝑐𝑟𝑜 𝐹 = × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 we selected the most commonly occurring threshold above . that maximized the average macro f measure across the cross-validations as the decision threshold for the final classifier (threshold = . ). the same approach was applied for the subtype classifiers. the thresholds and the corresponding average precision, recall and f measures are recorded in (supp tab ). classifying query data into general cancer categories we downloaded the rna-seq cancer cell lines expression profiles and sample table from (https://portals.broadinstitute.org/ccle/data), and microarray cancer cell lines expression profiles and sample table from barretina et al . we extracted two wt control nccit rna-seq expression profiles from grow et al . we received pdx expression estimates and sample .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / annotations from the authors of gao et al . we gathered gemm expression profiles from nine different studies – . we downloaded tumoroid expression profiles from the nci patient- derived models repository (pdmr) and from three individual studies – . to use ccn classifier on gemm data, the mouse genes from gemm expression profiles were converted into their human homologs. the query samples were classified using the final ccn classifier. each query classification profile was labelled as one of the four classification categories: “correct”, “mixed”, “none” and “other” based on classification profiles. if a sample has a ccn score higher than the decision threshold in the labelled cancer category, we assigned that as “correct”. if a sample has ccn score higher than the decision threshold in labelled cancer category and in other cancer categories, we assigned that as “mixed”. if a sample has no ccn score higher than the decision threshold in any cancer category or has the highest ccn score in ‘unknown’ category, then we assigned it as “none”. if a sample has ccn score higher than the decision threshold in a cancer category or categories not including the labelled cancer category, we assigned it as ”other”. we analyzed and visualized the results using r and r packages pheatmap and ggplot . cross-species assessment to assess the performance of cross-species classification, we downloaded labelled human tissue/cell type and labelled mouse tissue/cell type rna-seq expression profiles from github (https://github.com/pcahan /cellnet). we first converted the mouse genes into human homologous genes. then we found the intersecting genes between mouse tissue/cell expression profiles and human tissue/cell expression profiles. limiting the input of human tissue rna-seq profiles to the intersecting genes, we trained a ccn classifier with all the human tissue/cell expression profiles. the parameters used for the function “broadclass_train” in the package cancercellnet are in supp tab . we randomly sampled .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / samples from each tissue category in mouse tissue/cell data and applied the classifier on those samples to assess performance. cross-technology assessment to assess the performance of ccn in applications to microarray data, we gathered , patient tumor microarray profiles across different cancer types from more than different projects (supp tab ). we found the intersecting genes between the microarray profiles and tcga patient rna-seq profiles. limiting the input of rna-seq profiles to the intersecting genes, we created a ccn classifier with all the tcga patient profiles using parameters for the function “broadclass_train” listed in supp tab . after the microarray specific classifier was trained, we randomly sampled microarray patient samples from each cancer category and applied ccn classifier on them as assessment of the cross-technology performance in supp fig a. the same ccn classifier was used to assess microarray ccl samples supp fig b. training and validating scrna-seq classifier we extracted labelled human melanoma and glioblastoma scrna-seq expression profiles , , and compiled the two datasets excluding cell types t.cd , t.cd and myeloid due to low number of cells for training. cells from each of the cell types were sampled for training a scrna-seq classifier. the parameters for training a general scrna-seq classifier using the function “broadclass_train” are in supp tab . cells from each of the cell types from the held-out data were selected to assess the single cell classifier. using maximization of average macro f measure, we selected the decision threshold of . . the gene-pairs that were selected to construct the classifier are in supp tab . to assess the cross-technology capability of applying scrna-seq classifier to bulk rna-seq, we downloaded expression .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / profiles spanning purified cell types (b cells, endothelial cells, monocyte/macrophage, fibroblast) from https://github.com/pcahan /cellnet. training subtype cancercellnet we found cancer types (brca, coad, esca, hnsc, kirc, lgg, paad, ucec, stad, luad, lusc) which have meaningful subtypes based on either histology or molecular profile and have sufficient samples to train a subtype classifier with high aupr. we also included normal tissues samples from brca, coad, hnsc, kirc, ucec to create a normal tissue category in the construction of their subtype classifiers. training samples were either labelled as a cancer subtype for the cancer of interest or as “unknown” if they belong to other cancer types. similar to general classifier training, ccn performed gene pair transformation and selected the most discriminate gene pairs for each cancer subtype. in addition to the gene pairs selected to discriminate cancer subtypes, ccn also performed general classification of all training data and appended the classification profiles of training data with gene pair binary matrix as additional features. the reason behind using general classification profile as additional features is that many general cancer types may share similar subtypes, and general classification profile could be important features to discriminate the general cancer type of interest from other cancer types before performing finer subtype classification. the specific parameters used to train individual subtype classifiers using “subclass_train” function of cancercellnet package can be found in supp tab and the gene pairs are in supp tab . validating subtype cancercellnet similar to validating general class classifier, we randomly sampled / of all samples in each cancer subtype as training data and sampled an equal amount across subtypes in the / held-out data for assessing subtype classifiers. we repeated the process times for more comprehensive assessment of subtype classifiers. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / classifying query data into subtypes we assigned subtype to query sample if the query sample has ccn score higher than the decision threshold. the table of decision threshold for subtype classifiers are in supp tab . if no ccn scores exceed the decision threshold in any subtype or if the highest ccn score is in ‘unknown’ category, then we assigned that sample as ‘unknown’. analysis was performed in r and visualizations were generated with the complexheatmap package . cells culture, immunohistochemistry and histomorphometry caov- (atcc® htb- ™), sk-ov- (atcc® htb- ™), rt (atcc® htb- ™), and nccit(atcc® crl- ™) cell lines were purchased from atcc. hec- (c ) and a ( - vl) were obtained from addexbio technologies and sigma-aldrich. vcap and pc- . sk-ov- , vcap, and rt were cultured in dulbecco's modified eagle medium (dmem, high glucose, , gibco) with % penicillin-streptomycin-glutamine ( , life technologies); caov- , pc- , nccit, and a were cultured using rpmi- medium ( , gibco) while hec- was in iscove's modified dulbecco's medium (imdm, , gibco). both media were supplemented with % penicillin-streptomycin ( , gibco). all medium included % fetal bovine serum (fbs). cells cultured in -well plate were washed twice with pbs and fixed in % buffered formalin for  hrs at °c. immunostaining was performed using a standard protocol. cells were incubated with primary antibodies to goat hoxb ( µg/ml, pa - , invitrogen), mouse wt ( µg/ml, ma - , invitrogen), rabbit pparg ( : , abn , millipore), mouse folh ( µg/ml, um , origene), and rabbit lin a ( : , # , cell signaling) in antibody diluent (s - , dako), at  °c overnight followed with three min washes in tbst. the slides were then incubated with secondary antibodies conjugated with fluorescence at room temperature for  h while avoiding light followed with three min washes in tbst and .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / nuclear stained with mounting medium containing dapi. images were captured by nikon eclipse ti-s, ds-u and ds-qi . histomorphometry was performed using imagej (version . . -rc- / . i). % n.positive cells was calculated by the percentage of the number of positive stained cells divided by the number of dapi-positive nucleus within three of randomly chosen areas. the data were expressed as means ± sd. tumor purity analysis we used the r package estimate to calculate the estimate scores from tcga tumor expression profiles that we used as training data for ccn classifier. to calculate tumor purity we used the equation described in yoshihara et al., : tumour purity = cos ( . + . × estimate score) extracting citation counts we used the r package rismed to extract the number of citations for each cell line through query search of “cell line name[text word] and cancer[text word]” on pubmed. the citation counts were normalized by dividing the citation counts with the number of years since first documented. 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 = 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 # 𝑦𝑒𝑎𝑟𝑠 𝑠𝑖𝑛𝑐𝑒 𝑓𝑖𝑟𝑠𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 grn construction and grn status grn construction was extended from our previous method . samples per cancer type were randomly sampled and normalized through down sampling as training data for the clr grn construction algorithm. cancer type specific grns were identified by determining the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / differentially expressed genes per each cancer type and extracting the subnetwork using those genes. to extend the original grn status algorithm across different platforms and species, we devised a rank-based grn status algorithm. like the original grn status, rank based grn status is a metric of assessing the similarity of cancer type specific grn between training data in the cancer type of interest and query samples. hence, high grn status represents high level of establishment or similarity of the cancer specific grn in the query sample compared to those of the training data. the expression profiles of training data and query data were transformed into rank expression profiles by replacing the expression values with the rank of the expression values within a sample (highest expressed gene would have the highest rank and lowest expressed genes would have a rank of ). cancer type specific mean and standard deviation of every gene’s rank expression were learned from training data. the modified z-score values for genes within cancer type specific grn were calculated for query sample’s rank expression profiles to quantify how dissimilar the expression values of genes in query sample’s cancer type specific grn compared to those of the reference training data: 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)xyz = [ , 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑢𝑝𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 , 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑑𝑜𝑤𝑛𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 𝑎𝑏𝑠(𝑍𝑠𝑐𝑜𝑟𝑒), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 if a gene in the cancer type specific grn is found to be upregulated in the specific cancer type relative to other cancer types, then we would consider query sample’s gene to be similar if the ranking of the query sample’s gene is equal to or greater than the mean ranking of the gene in training sample. as a result of similarity, we assign that gene of a z-score of . the same principle applies to cases where the gene is downregulated in cancer specific subnetwork. grn status for query sample is calculated as the weighted mean of the ( − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)xyz) across genes in cancer type specific grn. is an arbitrary .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / large number, and larger dissimilarity between query’s cancer type specific grn indicate high z-scores for the grn genes and low grn status. 𝑅𝐺𝑆 = e( − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)xyz)𝑤𝑒𝑖𝑔ℎ𝑡fghg i h ijk 𝐺𝑅𝑁 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑅𝐺𝑆 ∑ 𝑤𝑒𝑖𝑔ℎ𝑡fghg ihijk the weight of individual genes in the cancer specific network is determined by the importance of the gene in the random forest classifier. finally, the grn status gets normalized with respect to the grn status of the cancer type of interest and the cancer type with the lowest mean grn status. 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 = 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 mngop − 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 xih qrhqgo) 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) where “min cancer” represents the cancer type where its training data have the lowest mean grn status in the cancer type of interest, and 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 xih qrhqgo) represents the lowest average grn status in the cancer type of interest. 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) represents average grn status of the cancer type of interest in the training data. code availability cancercellnet code and documentation is available at github: https://github.com/pcahan /cancercellnet acknowledgements this work was supported by the national institutes of health nci ovarian cancer spore p ca via a development research program award to pc. fwh was supported by a prostate cancer foundation young investigator award, department of defense w xwh- - pcrp-hd (f.w.h.), the national institutes of health/national cancer institute p ca - .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (f.w.h.) u ca (f.w.h.). we would like to thank john powers, hao zhu, tian-li wang, charles eberhart, and kaloyan tsanov for comments on the manuscript and helpful discussions. some figures were created in part with biorender.com. figure legends fig. cancercellnet (ccn) workflow, training, and performance. (a) schematic of ccn usage. ccn was designed to assess and compare the expression profiles of cancer models such as ccls, pdxs, gemms, and tumoroids with native patient tumors. to use trained classifier, ccn inputs the query samples (e.g. expression profiles from ccls, pdxs, gemms, tumoroids) and generates a classification profile for the query samples. the column names of the classification heatmap represent sample annotation and the row names of the classification heatmap represent different cancer types. each grid is colored from black to yellow representing the lowest classification score (e.g. ) to highest classification score (e.g. ). (b) schematic of ccn training process. ccn uses patient tumor expression profiles of different cancer types from tcga as training data. first, ccn identifies n genes that are upregulated, n that are downregulated, and n that are relatively invariant in each tumor type versus all of the others. then, ccn performs a pair transform on these genes and subsequently selects the most discriminative set of m gene pairs for each cancer type as features (or predictors) for the random forest classifier. lastly, ccn trains a multi-class random forest classifier using gene- pair transformed training data. (c) parameter optimization strategy. cross-validations of each parameter set in which / of tcga data was used to train and / to validate was used search for the values of n and m that maximized performance of the classifier as measured by area under the precision recall curve (auprc). (d) mean and standard deviation of classifiers based on cross-validations with the optimal parameter set. (e) auprc of the final ccn classifier when applied to independent patient tumor data from icgc. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. evaluation of cancer cell lines. (a) general classification heatmap of ccls extracted from ccle. column annotations of the heatmap represent the labelled cancer category of the ccls given by ccle and the row names of the heatmap represent different cancer categories. ccls’ general classification profiles are categorized into categories: correct (red), correct mixed (pink), no classification (light green) and other classification (dark green) based on the decision threshold of . . (b) bar plot represents the proportion of each classification category in ccls across cancer types ordered from the cancer types with the highest proportion of correct and correct mixed ccls to lowest proportion. (c) comparison between skcm general ccn scores from bulk rna-seq classifier and skcm malignant ccn scores from scrna-seq classifier for skcm ccls. (d) comparison between sarc general ccn scores from bulk rna- seq classifier and caf ccn scores from scrna-seq classifier for skcm ccls. (e) comparison between gbm general ccn scores from bulk rna-seq classifier and gbm neoplastic ccn scores from scrna-seq classifier for gbm ccls. (f) comparison between sarc general ccn scores and caf ccn scores from scrna-seq classifier for gbm ccls. the green lines indicate the decision threshold for scrna-seq classifier and general classifier. fig. immunofluorescence of selected cell lines. (a) classification profiles (left) and if expression (middle) of caov- (ov positive control), hec- (ucec positive control) and sk- ov- for wt (ov biomarker) and hoxb (uterine biomarker). the bar plots quantify the average percentage of positive cells for wt (top-right) and hoxb (bottom-right). (b) classification profiles (left) and if expression (middle) of caov- , nccit (germ cell tumor positive control) and a for wt and lin a (germ cell tumor biomarker). classification of nccit were performed using rna-seq profiles of wt control nccit duplicate from grow et al . the bar plots quantify the average percentage of positive cells for wt (top-right) and lin a (bottom-right). (c) classification profiles (left) and if expression (middle) of vcap (prad positive control), rt (blca positive control) and pc- for folh (prostate biomarker) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / and pparg (urothelial biomarker). the bar plots quantify the average percentage of positive cells for folh (top-right) and pparg (bottom-right). fig. subtype classification of ccls and ccl prevalence. the heatmap visualizations represent subtype classification of (a) ucec ccls, (b) lusc ccls and (c) luad ccls. only samples with ccn scores > . in their nominal tumor type are displayed. (d) comparison of normalized citation counts and general ccn classification scores of ccls. labelled cell lines either have the highest ccn classification score in their labelled cancer category or highest normalized citation count. each citation count was normalized by number of years since first documented on pubmed. fig. evaluation of patient derived xenografts. (a) general classification heatmap of pdxs. column annotations represent annotated cancer type of the pdxs, and row names represent cancer categories. (b) proportion of classification categories in pdxs across cancer types is visualized in the bar plot and ordered from the cancer type with highest proportion of correct and mixed correct classified pdxs to the lowest. subtype classification heatmaps of (c) ucec pdxs, (d) lusc pdxs and (e) luad pdxs. only samples with ccn scores > . in their nominal tumor type are displayed. fig. evaluation of genetically engineered mouse models. (a) general classification heatmap of gemms. column annotations represent annotated cancer type of the gemms, and row names represent cancer categories. (b) proportion of classification categories in gemms across cancer types is visualized in the bar plot and ordered from the cancer type with highest proportion of correct and mixed correct classified gemms to the lowest. subtype classification heatmap of (c) ucec gemms, (d) lusc gemms and (e) luad gemms. only samples with ccn scores > . in their nominal tumor type are displayed. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. evaluation of tumoroid models. (a) general classification heatmap of tumoroids. column annotations represent annotated cancer type of the tumoroids, and row names represent cancer categories. (b) proportion of classification categories in tumoroids across cancer types is visualized in the bar plot and ordered from the cancer type with highest proportion of correct and mixed correct classified tumoroids to the lowest. subtype classification heatmap of (c) ucec tumoroids, (d) lusc tumoroids and (e) luad tumoroids. only samples with ccn scores > . in their nominal tumor type are displayed. fig. comparison of ccls, pdxs, and gemms. box-and-whiskers plot comparing general ccn scores across ccls, gemms, pdxs of five general tumor types (ucec, paad, lusc, luad, lihc). supplementary information supplementary figure assessment of ccn general classifier and subtype classifier. (a) mean auprc of repeated grid-search cross-validation for each parameter grid. (b) mean and range of ccn classifier’s pr curves from cross validations based on the optimal feature selection parameters n and m. (c) auprc of ccn human tissue classifier when applied to mouse tissue data. (d) the schematic of training a subtype classifier in ccn. ccn uses patient tumor expression profiles from cancer of interest as training data. ccn performs gene-pair transformation and selects the most discriminative gene pairs among the cancer subtypes from training data as features. ccn then applies the general classification on training data and uses the general classification profile as features in addition to gene pairs for training a random forest classifier. the weight of the general classification profiles as features can be tuned to improve auprc. (e) the mean and standard deviation of auprc for subtype classifiers based on iterations of random sampling of training and held-out data, training subtype .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / classifier using training data, classification of held-out data, and calculation of recall and precision. supplementary figure further validation of ccn and classification results. to validate the cross-platform classification performance of ccn, a new classifier specifically trained to classify microarray data was trained using rna-seq data from tcga as training data and intersecting genes between rna-seq data and microarray data. (a) auprc of ccn classifier when applied to tumor profiles assayed on microarrays. (b) classification heatmap of ccls using microarray expression data. (c) pearson correlation between ccn scores of ccle lines generated from rna-seq data and microarray data. (d) comparison between ccls’ ccn scores and the similarity metric from yu et al , median correlations of transcriptional profiles between ccls and tcga tumors from ccls’ labelled cancer category. (e) comparison of mean tumor purity of training data and mean ccn scores of ccls for each cancer category. supplementary figure single-cell classification of skcm and gbm cell lines. (a) auprc of the single-cell classifier when applied to scrna-seq held-out data. (b) auprc of the scrna- seq classifier when applied to purified bulk rna samples. (c) single-cell classification of skcm ccls. red bar-plot (top) represents general ccn scores in sarc and blue bar-plot (bottom) represents general ccn scores in skcm. (d) single-cell classification of gbm ccls. red bar- plot (top) represents general ccn scores in sarc and yellow bar-plot (bottom) represents general ccn scores in gbm. supplementary figure correlation between cancer type specific network grn status and general ccn scores. supplementary figure proportion of cancer subtypes in different cancer models and tcga tumor data across general cancer types. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary table general classification profiles of ccls. supplementary table subtype classification profiles of ccls. supplementary table general classification profiles of pdxs. supplementary table subtype classification profiles of pdxs. supplementary table general classification profiles of gemms supplementary table subtype classification profiles of gemms. supplementary table general classification profiles of tumoroids. supplementary table subtype classification profiles of tumoroids. supplementary table specific parameters used for training of all classifiers. supplementary table gene-pairs selected for final training of ccn general, subtype classifiers and single-cell classifier. supplementary table decision thresholds and the corresponding precision and recall for the general classifier and subtype classifier. supplementary table accessions of tumor microarray data used in validation. references . sharma, s. v., haber, d. a. & settleman, j. cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. nat. rev. cancer , – ( ). . kersten, k., de visser, k. e., van miltenburg, m. h. & jonkers, j. genetically engineered mouse models in oncology research and cancer medicine. embo mol. med. , – ( ). . hidalgo, m. et al. patient-derived xenograft models: an emerging platform for translational cancer research. cancer discov. , – ( ). . drost, j. & clevers, h. organoids in cancer research. nat. rev. cancer , – ( ). . klijn, c. et al. a comprehensive transcriptional portrait of human cancer cell lines. nat. biotechnol. , – ( ). . koren, s. et al. pik ca(h r) induces multipotency and multi-lineage mammary tumours. nature , – ( ). . derose, y. s. et al. tumor grafts derived from women with breast cancer authentically reflect tumor pathology, growth, metastasis and disease outcomes. nat. med. , – ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . sharpless, n. e. & depinho, r. a. the mighty mouse: genetically engineered mouse models in cancer drug development. nat. rev. drug discov. , – ( ). . mouradov, d. et al. colorectal cancer cell lines are representative models of the main molecular subtypes of primary cancer. cancer res. , – ( ). . stuckelberger, s. & drapkin, r. precious gemms: emergence of faithful models for ovarian cancer research. j. pathol. , – ( ). . domcke, s., sinha, r., levine, d. a., sander, c. & schultz, n. evaluating cell lines as tumour models by comparison of genomic profiles. nat. commun. , ( ). . jiang, g. et al. comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer. bmc genomics suppl , ( ). . chen, b., sirota, m., fan-minogue, h., hadley, d. & butte, a. j. relating hepatocellular carcinoma tumor samples and cell lines using gene expression data in translational research. bmc med. genomics suppl , s ( ). . vincent, k. m., findlay, s. d. & postovit, l. m. assessing breast cancer cell lines as tumour models by comparison of mrna expression profiles. breast cancer res. , ( ). . yu, k. et al. comprehensive transcriptomic analysis of cell lines as models of primary tumors across tumor types. nat. commun. , ( ). . najgebauer, h. et al. cellector: genomics-guided selection of cancer in vitro models. cell syst. , – .e ( ). . salvadores, m., fuster-tormo, f. & supek, f. matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns. sci. adv. , ( ). . guernet, a. & grumolato, l. crispr/cas editing of the genome for cancer modeling. methods - , – ( ). . gargiulo, g. next-generation in vivo modeling of human cancers. front. oncol. , ( ). . gao, h. et al. high-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response. nat. med. , – ( ). . cahan, p. et al. cellnet: network biology applied to stem cell engineering. cell , – ( ). . radley, a. h. et al. assessment of engineered cells using cellnet and rna-seq. nat. protoc. , – ( ). . tan, y. & cahan, p. singlecellnet: a computational tool to classify single cell rna-seq data across platforms and across species. cell syst. , – .e ( ). . cancer genome atlas network. comprehensive molecular characterization of human colon and rectal cancer. nature , – ( ). . zhang, j. et al. international cancer genome consortium data portal--a one-stop shop for cancer genomics data. database (oxford) , bar ( ). . cancer genome atlas network. comprehensive molecular portraits of human breast tumours. nature , – ( ). . parker, j. s. et al. supervised risk predictor of breast cancer based on intrinsic subtypes. j. clin. oncol. , – ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . wilkerson, m. d. et al. lung squamous cell carcinoma mrna expression subtypes are reproducible, clinically important, and correspond to normal cell types. clin. cancer res. , – ( ). . cancer genome atlas research network. electronic address: andrew_aguirre@dfci.harvard.edu & cancer genome atlas research network. integrated genomic characterization of pancreatic ductal adenocarcinoma. cancer cell , – .e ( ). . cancer genome atlas research network et al. integrated genomic characterization of endometrial carcinoma. nature , – ( ). . cancer genome atlas research network et al. integrated genomic characterization of oesophageal carcinoma. nature , – ( ). . cancer genome atlas network. comprehensive genomic characterization of head and neck squamous cell carcinomas. nature , – ( ). . cancer genome atlas research network. comprehensive molecular characterization of clear cell renal cell carcinoma. nature , – ( ). . verhaak, r. g. w. et al. integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh , egfr, and nf . cancer cell , – ( ). . cancer genome atlas research network. comprehensive molecular profiling of lung adenocarcinoma. nature , – ( ). . hu, b. et al. gastric cancer: classification, histology and application of molecular pathology. j. gastrointest. oncol. , – ( ). . barretina, j. et al. the cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. nature , – ( ). . medico, e. et al. the molecular landscape of colorectal cancer cell lines unveils clinically actionable kinase targets. nat. commun. , ( ). . park, j.-g. et al. characteristics of cell lines established from human colorectal carcinoma. cancer res. ( ). . jerby-arnon, l. et al. a cancer cell program promotes t cell exclusion and resistance to checkpoint blockade. cell , – .e ( ). . darmanis, s. et al. single-cell rna-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. cell rep. , – ( ). . patel, a. p. et al. single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. science , – ( ). . xu, b. et al. regulation of endometrial receptivity by the highly expressed hoxa , hoxa and hoxd hox-class homeobox genes. hum. reprod. , – ( ). . raines, a. m. et al. recombineering-based dissection of flanking and paralogous hox gene functions in mouse reproductive tracts. development , – ( ). . netinatsunthorn, w., hanprasertpong, j., dechsukhum, c., leetanaporn, r. & geater, a. wt gene expression as a prognostic marker in advanced serous epithelial ovarian carcinoma: an immunohistochemical study. bmc cancer , ( ). . kelly, z. et al. the prognostic significance of specific hox gene expression patterns in ovarian cancer. int. j. cancer , – ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . cancer genome atlas research network. integrated genomic analyses of ovarian carcinoma. nature , – ( ). . wiegand, k. c. et al. arid a mutations in endometriosis-associated ovarian carcinomas. n. engl. j. med. , – ( ). . murray, m. j. et al. lin expression in malignant germ cell tumors downregulates let- and increases oncogene levels. cancer res. , – ( ). . biton, a. et al. independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes. cell rep. , – ( ). . fair, w. r., israeli, r. s. & heston, w. d. prostate-specific membrane antigen. prostate , – ( ). . black, j. d., english, d. p., roque, d. m. & santin, a. d. targeted therapy in uterine serous carcinoma: an aggressive variant of endometrial cancer. womens health (lond. engl.) , – ( ). . yang, s., thiel, k. w. & leslie, k. k. progesterone: the ultimate endometrial tumor suppressor. trends endocrinol. metab. , – ( ). . huszar, m. et al. up-regulation of l cam is linked to loss of hormone receptors and e-cadherin in aggressive subtypes of endometrial carcinomas. j. pathol. , – ( ). . kozak, j., wdowiak, p., maciejewski, r. & torres, a. a guide for endometrial cancer cell lines functional assays using the measurements of electronic impedance. cytotechnology , – ( ). . korch, c. et al. dna profiling analysis of endometrial and ovarian cell lines reveals misidentification, redundancy and contamination. gynecol. oncol. , – ( ). . wu, d. et al. gene-expression data integration to squamous cell lung cancer subtypes reveals drug sensitivity. br. j. cancer , – ( ). . walter, v. et al. molecular subtypes in head and neck cancer exhibit distinct patterns of chromosomal gain and loss of canonical cancer genes. plos one , e ( ). . adeegbe, d. o. et al. bet bromodomain inhibition cooperates with pd- blockade to facilitate antitumor response in kras-mutant non-small cell lung cancer. cancer immunol res , – ( ). . blaisdell, a. et al. neutrophils oppose uterine epithelial carcinogenesis via debridement of hypoxic tumor cells. cancer cell , – ( ). . fitamant, j. et al. yap inhibition restores hepatocyte differentiation in advanced hcc, leading to tumor regression. cell rep. , – ( ). . jia, d. et al. crebbp loss drives small cell lung cancer and increases sensitivity to hdac inhibition. cancer discov. , – ( ). . kress, t. r. et al. identification of myc-dependent transcriptional programs in oncogene-addicted liver tumors. cancer res. , – ( ). . li, l. et al. gkap acts as a genetic modulator of nmdar signaling to govern invasive tumor growth. cancer cell , – .e ( ). . mollaoglu, g. et al. the lineage-defining transcription factors sox and nkx - determine lung cancer cell fate and shape the tumor immune microenvironment. immunity , – .e ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . pan, y. et al. whole tumor rna-sequencing and deconvolution reveal a clinically- prognostic pten/pi k-regulated glioma transcriptional signature. oncotarget , – ( ). . lissanu deribe, y. et al. mutations in the swi/snf complex induce a targetable dependence on oxidative phosphorylation in lung cancer. nat. med. , – ( ). . xu, c. et al. loss of lkb and pten leads to lung squamous cell carcinoma with elevated pd-l expression. cancer cell , – ( ). . nci-frederick, frederick, md. national laboratory for cancer research. the nci patient-derived models repository (pdmr). ( ). at . broutier, l. et al. human primary liver cancer-derived organoid cultures for disease modeling and drug screening. nat. med. , – ( ). . lee, s. h. et al. tumor evolution and drug response in patient-derived organoid models of bladder cancer. cell , – .e ( ). . ogawa, j., pao, g. m., shokhirev, m. n. & verma, i. m. glioblastoma model using human cerebral organoids. cell rep. , – ( ). . ben-david, u. et al. patient-derived xenografts undergo mouse-specific tumor evolution. nat. genet. , – ( ). . stratton, m. r., campbell, p. j. & futreal, p. a. the cancer genome. nature , – ( ). . balkwill, f. r., capasso, m. & hagemann, t. the tumor microenvironment at a glance. j. cell sci. , – ( ). . lancaster, m. a. & knoblich, j. a. organogenesis in a dish: modeling development and disease using organoid technologies. science , ( ). . bregenzer, m. e. et al. integrated cancer tissue engineering models for precision medicine. plos one , e ( ). . wang, d. h. & souza, r. f. biology of barrett’s esophagus and esophageal adenocarcinoma. gastrointest endosc clin n am , – ( ). . lee, j. et al. tumor stem cells derived from glioblastomas cultured in bfgf and egf more closely mirror the phenotype and genotype of primary tumors than do serum-cultured cell lines. cancer cell , – ( ). . wenger, s. l. et al. comparison of established cell lines at different passages by karyotype and comparative genomic hybridization. biosci. rep. , – ( ). . ben-david, u. et al. genetic and transcriptional evolution alters cancer cell line drug response. nature , – ( ). . cooke, s. l. et al. genomic analysis of genetic heterogeneity and evolution in high- grade serous ovarian carcinoma. oncogene , – ( ). . hristova, v. a. & chan, d. w. cancer biomarker discovery and translation: proteomics and beyond. expert rev proteomics , – ( ). . dawson, m. a. & kouzarides, t. cancer epigenetics: from mechanism to therapy. cell , – ( ). . silva, t. c. et al. tcga workflow: analyze cancer genomics and epigenomics data using bioconductor packages. [version ; peer review: approved, approved with reservations]. f res. , ( ). . morgan, m., obenchain, v., hester, j. & pag`es, h. summarizedexperiment: summarizedexperiment container. ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . pavlidis, p. & noble, w. s. analysis of strain and regional variation in gene expression in mouse brain. genome biol. , research ( ). . geman, d., d avignon, c., naiman, d. q. & winslow, r. l. classifying gene expression profiles from pairwise mrna comparisons. stat appl genet mol biol , article ( ). . krstajic, d., buturovic, l. j., leahy, d. e. & thomas, s. cross-validation pitfalls when selecting and assessing regression and classification models. j. cheminform. , ( ). . lipton, z. c., elkan, c. & naryanaswamy, b. optimal thresholding of classifiers to maximize f measure. mach. learn. knowl. discov. databases , – ( ). . grow, e. j. et al. intrinsic retroviral reactivation in human preimplantation embryos and pluripotent cells. nature , – ( ). . kolde, r. pheatmap: pretty heatmaps. (cran, ). . wickham, h. ggplot - elegant graphics for data analysis . (springer-verlag new york, ). doi: . / - - - - . gu, z., eils, r. & schlesner, m. complex heatmaps reveal patterns and correlations in multidimensional genomic data. bioinformatics , – ( ). . yoshihara, k. et al. inferring tumour purity and stromal and immune cell admixture from expression data. nat. commun. , ( ). . kovalchik, s. rismed: download content from ncbi databases. (cran.r-project, ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b figure highlow c an ce r t yp es cancer models classification score cancer cell lines (ccl) patient derived xenograft (pdx) genetically engineered mouse model (gemm) tumoroids select parameter set with maximum mean auprc. train on all tcga data cancercellnet set parameters n, m randomly select / tcga data; run training process assess performance on / held out data repeat steps ( - ) times ( ) ( ) ( ) ( ) repeat steps ( - ) for each parameter set ( ) cancercellnet rna-seq from … g en e pa irs training data training process train random forest classifier g en es samples g en es labeled rna-seq data select n genes gene pair transform select m gene pairs g en e pa irs g en es samples samples samples samples samples cancercellnet c d e .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure a f c d e ccn score b .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ccn score a b c figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / d a b figure c general classification general ccn score (ucec) sub-type classification endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification prox.-inflam prox.-prolif tru unknown .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ccn score figure a b c d e general classification general ccn score (ucec) sub-type classification endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification prox.-inflam prox.-prolif tru unknown .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure c ba d e general classification general ccn score (ucec) sub-type classification genotype endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification genotype basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification genotype prox.-inflam prox.-prolif tru unknown ccn score .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure a b c d e general classification general ccn score (ucec) sub-type classification endometrioid serous normal unknown general classification general ccn score (lusc) sub-type classification basal classical primitive secretory unknown general classification general ccn score (luad) sub-type classification prox.-inflam prox.-prolif tru unknown ccn score .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure ba d e training data samples g en es rna-seq tcga training process gene pair transform feature selection train random forest classifier g en es g en e p ai rs cancercellnetbroad class classification add on to gene pairs as additional features c c n s co re s g en e p ai rs c .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure a b d e c .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure c d a b .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / biorxiv.org - the preprint server for biology skip to main content home about submit alerts / rss search for this keyword advanced search subject areas all articles animal behavior and cognition biochemistry bioengineering bioinformatics biophysics cancer biology cell biology clinical trials developmental biology ecology epidemiology evolutionary biology genetics genomics immunology microbiology molecular biology neuroscience paleontology pathology pharmacology and toxicology physiology plant biology scientific communication and education synthetic biology systems biology zoology view by month a global cancer data integrator reveals principles of synthetic lethality, sex disparity and immunotherapy. christopher yogodzinski , ,#*, abolfazl arab - , justin r. pritchard , hani goodarzi - , luke a. gilbert , , * department of urology, university of california, san francisco, san francisco, ca, usa helen diller family comprehensive cancer center, san francisco, san francisco, ca, usa department of biochemistry and biophysics, university of california, san francisco, ca, usa department of biomedical engineering, pennsylvania state university, university park, pa department of cellular & molecular pharmacology, university of california, san francisco, ca, usa # current address: university of north carolina chapel hill school of medicine, chapel hill, nc, usa *corresponding authors correspondence: cyogodzi@unc.edu (c.y.), luke.gilbert@ucsf.edu (l.a.g) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract advances in cancer biology are increasingly dependent on integration of heterogeneous datasets. large scale efforts have systematically mapped many aspects of cancer cell biology; however, it remains challenging for individual scientists to effectively integrate and understand this data. we have developed a new data retrieval and indexing framework that allows us to integrate publicly available data from different sources and to combine publicly available data with new or bespoke datasets. beyond a database search, our approach empowered testable hypotheses of new synthetic lethal gene pairs, genes associated with sex disparity, and immunotherapy targets in cancer. our approach is straightforward to implement, well documented and is continuously updated which should enable individual users to take full advantage of efforts to map cancer cell biology. introduction large scale but often independent efforts have mapped phenotypic characteristics of more than one thousand human cancer cell lines. despite this, static lists of univariate data generally cannot identify the underlying molecular mechanisms driving a complex phenotype. we hypothesized that a global cancer data integrator that could incorporate many types of publicly available data including functional genomics, whole genome sequencing, exome sequencing, rna expression data, protein mass spectrometry, dna methylation profiling, chip- seq, atac-seq, and metabolomics data would enable us to link disease features to gene products – . we set out to build a resource that enables cross platform correlation analysis of multi-omic data as this analysis is in and of itself is a high-resolution phenotype. multi-omic analysis of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . functional genomics data with genomic, metabolomic or transcriptomic profiling can link cell state or specific signaling pathways to gene function , , , – . lastly, co-essentiality profiling across large panels of cell lines has revealed protein complexes and co-essential modules that can assign function to uncharacterized genes . problematically, in many cases publicly available data are poorly integrated when considering information on all genes across different types of data and the existing data portals are inflexible. for example, lists of genes cannot be queried against groups of cell lines stratified by mutation status or disease subtype. furthermore, one cannot integrate new data derived from individual labs or other consortia. we created the cancer data integrator (candi) which is a series of python modules designed to seamlessly integrate genomic, functional genomic, rna, protein and metabolomic data into one ecosystem. our python framework operates like a relational database without the overhead of running mysql or postgres and enables individual users to easily query this vast dataset and add new data in flexible ways. this was achieved by unifying the indices of these datasets via index tables that are automatically accessed through candi’s biologically relevant python classes. we highlight the utility of candi through four types of analysis to demonstrate how complex queries can reveal previously unknown molecular mechanisms in synthetic lethality, sex disparity and immunotherapy. these data nominate new small molecule and immunotherapy anti-cancer strategies in kras-mutant colon, lung and pancreatic cancers. results candi is a global cancer data integrator. we set out to integrate three types of data by creating programmatic and biologically relevant abstractions that allow for flexible cross referencing across all datasets. data from the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cancer cell line encyclopedia (ccle) for rna expression, dna mutation, dna copy number and chromosome fusions across more than cancer cells lines was integrated into our database with the functional genomics data from the cancer dependency map (depmap) (fig. a,b and supplementary fig. ) , , . we also integrated protein-protein interaction data from the corum database along with three additional distinct protein localization databases , , , . candi by default will access the most recent release of data from depmap although users can also specify both the release and data type that is accessed. the key advantage to this approach is that candi enables one to easily input user defined queries with multi-tiered conditional logic into this large integrated dataset to analyze gene function, gene expression, protein localization and protein-protein interactions. candi identifies genes that are conditionally essential in brca-mutant ovarian cancer. the concept that loss-of-function tumor suppressor gene mutations can render cancer cells critically reliant on the function of a second gene is known as synthetic lethality. despite the promise of synthetic lethality, it has been challenging to predict or identify genes that are synthetic lethal with commonly mutated tumor suppressor genes. while there are many underlying reasons for this challenge, we reasoned that data integration through candi could identify synthetic lethal interactions missed by others. a paradigmatic example of synthetic lethality emerged from the study of dna damage repair (ddr) . somatic mutations in the dna double-strand break (dsb) repair genes, brca / , create an increased dependence on dna single strand break (ssb) repair. this dependence can be exploited through small molecule inhibition of parp mediated ssb repair. inhibition of parp provides significant clinical responses in advanced breast and ovarian cancer (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patients but they ultimately progress . thus, new synthetic lethal associations with brca / are a potential path towards therapeutic development parp refractory patients. to illustrate the flexibility of candi to mine context specific synthetic sick lethal (ssl) genetic relationships we hypothesized that the genes that modulate response to a parp inhibitor might be enriched for selectively essential proliferation or survival of brca / -mutant cancer cells. to test this hypothesis, we integrated the results of an existing crispr screen that identified genes that modulate response to the parp inhibitor olaparib . we then tested whether any of these genes are differentially essential for cell proliferation or survival in ovarian cancer and in breast cancer cell models that are either brca / proficient or deficient (fig. c,d). this query revealed that the fanconi anemia pathway is selectively essential in brca / -mutated ovarian cancer models but not in brca / -wild type ovarian cancer, brca / -mutated breast cancer or brca / -wildtype breast cancer models (fig. e and supplementary table ). to our knowledge a ssl phenotype between fancm and brca / has never been reported although a recent paper nominated a role for fancm and brca in telomere maintenance . importantly, fancm is a helicase/translocase and thus considered to be a druggable target for cancer therapy . clinical genomics data support this ssl hypothesis although this remains to be tested in ovarian cancer patient samples . because the depmap currently only allows single genes to be queried and does not enable users to easily stratify cell lines by mutation such analysis would normally take a user several days to complete manually. our approach enabled this analysis to be completed using a desktop computer in less than two hours, which includes the visualization of data presented here (fig. e). figure . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . (a) a schematic showing human cell models integrated by candi. (b) a schematic illustrating types of data integrated by candi. (c) a cartoon of a genome-scale crispri screen to identify genes that modulate response to parp inhibition by olaparib. (d) a schematic depicting data feature inputs parsed by candi. (e) essentiality of fanconi anemia genes in ovarian and breast cancer cell lines separated by brca mutation status. a bayes factor score of gene essentiality is displayed by a heat map. n= brca / -mutant ovarian cancer, n= brca-wildtype ovarian cancer, n= brca / -mutant breast cancer, n= brca / -wildtype breast cancer. conditional genetic essentiality in kras- and egfr- mutant nsclc cells. beyond tsgs, many common driver oncogenes such as krasg d are currently undruggable, which motivates the search for oncogene specific conditional genetic dependencies. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we reasoned that candi enables us to rapidly search functional genomics data for genes that are conditionally essential in lung cancer cells driven by kras- and egfr-mutations. we stratified non-small cell lung cancer cell (nsclc) models by egfr and kras mutations and then looked at the average gene essentiality for all genes within each of these subtypes of nsclc. we observed that kras is conditionally self-essential in kras-mutant cell models but that no other genes are conditionally essential in kras-mutant, egfr-mutant, kras-wildtype or egfr-wildtype cell models (fig. a,b and supplementary table ). this finding demonstrates that very few---if any--- genes are synthetic lethal with kras- or egfr- in kras- and egfr- mutant lung cancer cell lines. it may be that these experiments are underpowered or it may be that when the genetic dependencies of diverse cell lines representing a disease subtype are averaged across a single variable (e.g. a kras-mutation) very few common synthetic lethal phenotypes are observed . candi provides potential solutions for both of these hypotheses. candi enables a global analysis of conditional essentiality in cancer. it is thought that data aggregation across vast landscapes of unknown co-variates does not necessarily increase the statistical power to identify rare associations . thus, the global analyses of aggregated cancer data sometimes lies in systematically sub setting data based on key co- variates post aggregation. this has been observed in driver gene identification . inspired by our analysis of tsg and oncogene conditionally essentiality above, we next used candi to identify genes that are conditionally essential in the context of several hundred cancer driver mutations. we first grouped driver mutations (e.g. nonsense or missense) for each driver gene. for this analysis, we selected several thousand genes that are in the - th percentile of essentiality within the depmap data and therefore conditionally essential, meaning these genes are required (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . for cell growth or survival in a subset of cell lines. importantly, it is not known why these several thousand genes are conditionally essential. we then tested whether each of these conditionally essential genes has a significant association with individual driver mutations. our analytic approach does not weight the number of cell models representing each driver mutation nor does this give information on phenotype effect sizes. our analysis nominates a large number of conditionally dependent genetic relationships with both tsg and oncogenes (fig. c,d and supplementary table ). a number of the conditional genetic dependencies identified in our independent variable analysis above are represented by a limited number of cell models and so further investigation is needed to validate these conditional dependencies, but this data further suggests that averaging genetic dependencies across diverse cell lines with un-modeled covariates obscures conditional ssl relationships. to further investigate this hypothesis, we analyzed these same conditional genetic relationships with a second analytic approach that weights the number of cell models representing each driver mutation. we observed a limited number of conditional genetic dependencies that largely consists of oncogene self-essential dependencies as previously highlighted for kras-mutant cell lines (fig. e-g and supplementary table ) , . thus, analysis that averages each conditional phenotype across diverse panels of cell lines with unknown covariates masks interesting conditional genetic dependencies. figure . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . (a) average gene essentiality for kras and egfr in groups of nsclc cell lines stratified by kras mutation status or by both kras and egfr mutation status. n= for kras-wildtype shown in blue n= for kras-mutant shown in blue. n= for kras- wildtype egfr-wildtype shown in grey and n= for kras-mutant egfr-wildtype shown in grey. gene essentiality is an averaged bayes factor score for each group of cell lines. (b) average gene essentiality for kras and egfr in groups of nsclc cell lines stratified by egfr mutation status or by both egfr and kras mutation status. n= for egfr-wildtype shown in blue, n= for egfr-mutant shown in blue. n= for egfr-wildtype kras- wildtype shown in grey and n= for egfr-mutant kras-wildtype shown in grey. gene essentiality is an averaged bayes factor score for each group of cell lines. (c) p-values from chi tests of gene essentiality and nonsense mutations. (d) p-values from chi tests of gene essentiality and missense mutations. (e) a scatter plot showing effect size of the change in gene essentiality with select missense mutations and the -log (p-value) of each essentiality/mutation pair. (f) a scatter plot showing effect size of the change in gene essentiality with select nonsense mutations and the -log (p-value) of each essentiality/mutation pair. (g) a scatter plot showing effect size of the change in gene essentiality with all mutations and the -log (p-value) of each essentiality/mutation pair. candi reveals female and male context specific essential genes in colon, lung and pancreatic cancer. cancer functional genomics data is often analyzed without consideration for fundamental biological properties such as the sex of the tumor from which each cell line is derived. it is well established that biological sex influences cancer predisposition, cancer progression and response (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to therapy . we hypothesized that individual genes may be differentially essential across male and female cell lines. this hypothesis to our knowledge has never been tested in an unbiased large-scale manner. to maximize our statistical power to identify such differences we chose to test this hypothesis in a disease setting with large number of relatively homogenous cell lines and fewer unknown covariates. using candi, we stratified all kras-mutant nsclc, pancreatic adenocarcinoma (pdac), and colorectal cancer (crc) by sex and then tested for conditional gene essentiality. this analysis identified a number of genes that are differentially essential in male or female kras-mutant nsclc, pdac and crc models (fig. a-f and supplementary table ). the genes that we identify are not common across all three disease types suggesting as one might expect that the biology of the tumor in part also determines gene essentiality. to test whether any association between differentially essential genes could be identified from expression data (e.g essential genes encoded on the y chromosome) we first used candi to identify genes that are differentially expressed between male and female cell lines within each disease . we then plotted the set of differentially essential genes against the differentially expressed genes in kras-mutant nsclc, pdac and crc models (fig. a,c,e and supplementary table ) and found little overlap between these gene lists. a number of genes that are more essential in male cells, such as ahcyl , eno , gpi and pkm, regulate cellular metabolism. this finding is consistent with previous literature on sex and metabolism . our analysis demonstrates that stratifying groups of heterogeneous cancer models by three variables, in this case tumor type, kras mutation status and sex, reveals differentially essential genes. candi enables biologically principled stratification of data in the ccle and depmap by any feature associated with a group of cell models. this stratification allows us to identify genes associated with sex, which is not possible with other covariates included. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . (a) differential gene expression and differential gene essentiality in male and female crc cell lines. n= male cell lines and n= female cell lines. (b) the distribution of bayes factor gene essentiality scores in male and female crc cell lines. the top seven and bottom (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . three differentially essential genes are shown in violin plots split by the sex of the cell lines. (c) differential gene expression and differential gene essentiality in male and female nsclc cell lines. n= male cell lines and n= female cell lines. (d) the distribution of bayes factor gene essentiality scores in male and female nsclc cell lines. the top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. (e) differential gene expression and differential gene essentiality in male and female pdac cancer cell lines. n= male cell lines and n= female cell lines. (f) the distribution of bayes factor gene essentiality scores in male and female pdac cell lines. the top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. candi enables rapid integration of external datasets to reveal new immunotherapy targets. an emerging challenge in the cancer biology is how to robustly integrate larger “resource” datasets like ccle with the vast amount of published data from individual laboratories. for example, a big challenge in antibody discovery is identifying specific surface markers on cancer cells. to approach these big questions we utilized candis ability to rapidly take new datasets, such as raw rna-seq counts data in a disparate study of interest, then normalize and integrate this data into the ccle, depmap and protein localization databases previously described. specifically, we rapidly integrated an rna-seq expression dataset that measured the set of transcribed genes in primary lung bronchial epithelial cells from donors . classes within candi enable rapid application of deseq to assess the differential expression between outside datasets and the ccle. we used this feature to identify genes that are differentially expressed between primary lung bronchial epithelial cells and kras-mutant nsclc, egfr-mutant nsclc or all nsclc models in ccle. we then used candi to identify (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . genes that are upregulated in cancer cells over normal lung bronchial epithelial cells with protein products that are localized to the cell membrane. this analysis of kras-mutant, egfr-mutant and pan-nsclc generated highly similar lists of differentially expressed surface proteins (fig. a-f and supplementary table ). notably, overexpression of several of these genes, such as cd and cd , has been observed in lung cancer and is associated with poor prognosis – . these proteins represent potential new immunotherapy targets in kras-driven nsclc. figure . figure . (a) a graph showing genes that are upregulated in kras-mutant nsclc cell lines relative to primary human bronchial epithelial cells. a cell membrane protein localization score (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . is shown for each gene. higher protein localization scores indicate higher confidence annotations. (b) a scatter plot showing gene expression for genes that encode cell surface proteins in kras-mutant nsclc cell lines and primary human bronchial epithelial cells. n= for kras-mutant nsclc cell lines and n= for primary human bronchial epithelial cells. (c) a graph showing genes that are upregulated in egfr-mutant nsclc cell lines relative to primary human bronchial epithelial cells. a cell membrane protein localization score is shown for each gene. higher protein localization scores indicate higher confidence annotations. (d) a scatter plot showing gene expression for genes that encode cell surface proteins in egfr-mutant nsclc cell lines and primary human bronchial epithelial cells. n= for egfr-mutant nsclc cell lines and n= for primary human bronchial epithelial cells. (e) a graph showing genes that are upregulated in nsclc cell lines relative to primary human bronchial epithelial cells. a cell membrane protein localization score is shown for each gene. higher protein localization scores indicate higher confidence annotations. (f) a scatter plot showing gene expression for genes that encode cell surface proteins in nsclc cell lines and primary human bronchial epithelial cells. n= for nsclc cell lines and n= for primary human bronchial epithelial cells. discussion data integration is a critical requirement in biology research in the era of genomics and functional genomics. large scale efforts such as the ccle have revealed genomic features of more than cell line models. this data has not to our knowledge previously been integrated with functional genomics data in a manner that individual users can enter batched queries that are stratified by disease subtype or mutation status. this is not just a small improvement in functionality, but rather it is an enabling format that makes possible the types of conditional (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . genomics analyses that drive discovery. moreover, it fills a fundamental gap in the cancer research community that integrates large scale projects with investigator initiated studies our data framework enables biologists without specialized expertise in bioinformatics to use the full spectrum of data in the ccle and depmap in a higher throughput and precise manner. using candi, we identified genes that are selectively essential in male versus female kras-mutant nsclc, pdac and crc models. to our knowledge, such analysis has never been performed to begin to query the biologic basis of sex disparity in cancer or cancer therapy. we illustrate another feature of our framework by analyzing a list of hit genes nominated by a bespoke crispr drug screen for gene essentiality in brca / -wild type and brca / - mutated breast and ovarian cancer. in a third application, we analyzed the principle of synthetic lethality for genes in kras-mutant and egfr-mutant nsclc models. we then used candi to globally identify genes that are conditionally essential in the context of common cancer driver mutations. finally, we nominated potential new immunotherapy targets in kras-mutant, egfr-mutant and pan -nsclc models by using candi to identify genes that are differentially expressed in normal bronchial epithelial cells versus nsclc models that are localized at the plasma membrane. our data reveal a wealth of new hypotheses that can be rapidly generated from publicly available cancer data. by sharing data flows and use cases with a candi community we illustrate the ways in which individual research groups can interact with massive cancer genomics projects without reinventing tools or relying upon depmap tool releases. we anticipate that candi will be widely used in cell biology, immunology and cancer research. methods (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . candi the candi data integrator is available at https://github.com/yogiski/candi. candi module structure the candi data integrator is a python library built on top of the pandas that is specialized in integrating the publicly available data from the cancer dependency map (depmap release: quarter ) , the cancer cell line encyclopedia (ccle release: quarter ) , the pooled in-vitro crispr knockout essentiality screens database (pickles library: avana quarter ) , the comprehensive resource of mammalian protein complexes (corum) and protein localization data from the cell atlas , the map of the cell , and the in silico surfaceome , . data from depmap and ccle used in the following analyses are from the q release. data from pickles is from the quarter release of depmap using the avana library. access to all datasets is controlled via a python class called data. upon import the data class reads the config file established during installation and defines unique paths to each dataset and automatically loads the cell line index table and the gene index table. installation of candi, configuration, and data retrieval is handled by a manager class that is accessed indirectly through installation scripts and the data class. interactions with this data are controlled through a parent entity class and several handlers. the biologically relevant abstraction classes (gene, cellline cancer, organelle, genecluster, celllinecluster) inherit their methods from entity. entity methods are wrappers for hidden data handler classes who perform specific transformations, such as data indexing and high throughput filtering. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . differential expression in all cases where it is mentioned differential expression was evaluated using the deseq r package (release . ) . significance was considered to be an adjusted p-value of less than . . differential essentiality essentiality scores are taken from the pickles database (avana q ). to reduce the number of hypotheses posed during this analysis the mutual information of gene essentiality was calculated using the mutual information metric from the python package scikitlearn (version . . ). genes with mutual information scores greater than one standard devation above the median were removed from consideration. differential essentiality was evaluated by performing a mann-whitney u-test between two groups on every gene that passed the mutual information filter. significance was considered to be a p-value of less than . . magnitude of differential essentiality of a given gene was shown as the difference in mean bayes factors between two groups of cell lines. protein localization confidence protein localization data was assembled from the cell atlas , the map of the cell , and the in silico surfaceome , . confidence annotations were taken from the supplemental data of each paper and put on a number scale from to and summed for a total confidence score for each localization annotation for every gene where across all three papers. the analysis shown in figure represents a gene list that was further manually curated to remove the genes that are (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . localized to the intracellular space at the cell membrane revealing cell surface protein targets that are highly expressed in nsclc cancer models over normal lung bronchial epithelial cells , , , . depmap creative commons license when an individual user runs candi they are downloading depmap data and thus are agreeing to a cc attribution . license (https://creativecommons.org/licenses/by/ . /). synthetic lethality of fanconi anemia genes in ovarian and breast cancer models we made a list of the top gene hits that confer sensitivity to parp inhibition in hela cells . using candi the essentiality scores of these top hits were visualized across all ovarian cancer cell models in pickles (avana q ). fanca and fance showed selective essentiality in the brca / mutant ovarian cancer cell lines. following this observation candi was used to gather the gene essentiality for all fanc genes in the fanconi anemia pathway. candi was then used to visualize these data across all ovarian and breast cancer cell lines, sorting by brca / mutation status. synthetic lethality in kras and egfr mutant cell lines candi was leveraged to bin nsclc cell lines present in both ccle (release: q ) and pickles (avana q ) into groups. kras mutant and kras wild type cell lines with and without egfr mutants removed as well as egfr mutant and egfr wild type cell lines with and without kras mutants removed. the mean essentiality score for every gene in the genome was calculated for every group of cell lines. synthetic lethality score per gene is defined as the change in mean essentiality from the mutant groups to the wild type groups. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . pan cancer synthetic lethality analysis a set of core oncogenes and tumor suppressor driver mutations was chosen for analysis . to test the effect of these gene’s mutations on gene essentiality candi was leveraged to split into two groups: a nonsense mutation group containing genes annotated as tumor suppressors (n= ) and a missense mutation group containing genes annotated as oncogenes with specific driver protein changes (n= ). candi was then used to collect a core set of genes with highly variable essentiality. to do this the bayes factors from the pickles database (avana q ) were converted to binary numeric variables. bayes factors over were assigned a =essential and bayes factors under were assigned a =non-essential. genes were then sorted buy their variance across cell lines and genes between the th and th percentile were used for this analysis (n= ). to determine a short list of genes with which to follow up on chi tests were applied to the gene pairs in the missense group and the gene pairs in the tumor suppressor group. three new groups were formed for further analysis: the first consisted of the significant gene/mutation pairs from the oncogenic group, the second consisted of the significant gene/mutation pairs from the tumor suppressor group, and the third was a combination of the significant pairs from both groups with no discrimination on the type of mutations considered. these groups were further analyzed for differential essentiality via the mann whitney method described above and the cohens d effect size were calculated to measure the extent of the phenotype. differential expression and essentiality of male and female kras driven cancers (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . we used candi to gather all cell lines that are present in both pickles (avana q ) and ccle (release q ). candi was then leveraged to put these cell lines into the following tissue groups: kras mutant colon/colorectal, pdac, and nsclc. each tissue group was then split into male and female sub-groups. differential expression was analyzed by applying the methods described above to raw rna-seq counts data from ccle (release: q ). genes with adjusted p-values less than . were considered significantly differentially expressed. differential essentiality was analyzed using the methods described above on the previously described sex-subgroups for each tissue type. genes with p-values less than . were considered significantly differentially essential between male and female cell models. for each tissue type the distributions of the top significantly differentially essential genes were highlighted in comparison with the bottom as a negative control. differential expression of benign and malignant cancer cell lines we downloaded human bronchial epithelial (hbe) rna-seq data from gillen et al via the european nucleotide archive to use as a benign lung tissue model . this data set contains gene expression data for primary hbe cells cultured from three different donors and also nhbe cells (lonza cc- , a mixture of hbe and human tracheal epithelial cells). we then used candi to put nsclc models into three different groups: kras mutant, egfr mutant, and all cell lines. for our benign model raw counts were quantified via kallisto . raw counts for our malignant cell lines were queried via candi. deseq was then applied to evaluate the differential expression between our normal lung tissue model and our three malignant lung tissue groups. the results from deseq were then filtered by significance (adjusted p-value < . ). to filter based on potential immunotherapy targets we removed all genes not annotated as being (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . localized to the plasma membrane, and genes with localization confidence scores lower than six. genes that were obviously mis-annotated as surface proteins were also manually removed. supplementary figure/table legends supplementary figure . supplementary figure . an object-oriented schema diagram showing core structure of candi software. supplementary table . a table containing raw pickles bayes factors displayed in the heat map of fig. e. supplementary table . a table containing mean pickles bayes factors for each series displayed in fig. a,b. a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary table . a table containing the data for all chi tests performed to generate fig. c,d. supplementary table . a table containing the data for scatter plots shown in fig. e,f,g. supplementary table . a table containing the data from the differential essentiality analysis for all three tissues in fig. a-f. supplementary table . a table containing the data from the differential expression analysis for all three tissues in fig. a,c,e. supplementary table . a table containing the differential expression analysis data merged with the location data for all three tissues shown in fig. . acknowledgements we thank everyone in the gilbert lab for helpful comments and discussion. lag is supported by k /r ca and dp ca as well as the goldberg-benioff endowed professorship in prostate cancer translational biology. conflicts of interest none bibliography . ghandi, m. et al. next-generation characterization of the cancer cell line encyclopedia. nature , – ( ). . li, h. et al. the landscape of cancer cell line metabolism. nat. med. , – ( ). . tsherniak, a. et al. defining a cancer dependency map. cell , - .e ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . thul, p. j. et al. a subcellular map of the human proteome. science , ( ). . cancer cell line encyclopedia consortium & genomics of drug sensitivity in cancer consortium. pharmacogenomic agreement between two cancer cell line data sets. nature , – ( ). . barretina, j. et al. the cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. nature , – ( ). . bausch-fluck, d. et al. the in silico human surfaceome. pnas , e –e ( ). . giurgiu, m. et al. corum: the comprehensive resource of mammalian protein complexes- . nucleic acids res. , d –d ( ). . nusinow, d. p. et al. quantitative proteomics of the cancer cell line encyclopedia. cell , - .e ( ). . szklarczyk, d. et al. the string database in : quality-controlled protein-protein association networks, made broadly accessible. nucleic acids res. , d –d ( ). . itzhak, d. n., tyanova, s., cox, j. & borner, g. h. global, quantitative and dynamic mapping of protein subcellular localization. elife , ( ). . meyers, r. m. et al. computational correction of copy number effect improves specificity of crispr-cas essentiality screens in cancer cells. nat. genet. , – ( ). . behan, f. m. et al. prioritization of cancer therapeutic targets using crispr–cas screens. nature , – ( ). . wang, t. et al. identification and characterization of essential genes in the human genome. science , – ( ). . hart, t. et al. high-resolution crispr screens reveal fitness genes and genotype- specific cancer liabilities. cell , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . wang, t. et al. gene essentiality profiling reveals gene networks and synthetic lethal interactions with oncogenic ras. cell , - .e ( ). . chan, e. m. et al. wrn helicase is a synthetic lethal target in microsatellite unstable cancers. nature , – ( ). . adamson, b. et al. a multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. cell , - .e ( ). . wainberg, m. et al. a genome-wide almanac of co-essential modules assigns function to uncharacterized genes. http://biorxiv.org/lookup/doi/ . / ( ) doi: . / . . lenoir, w. f., lim, t. l. & hart, t. pickles: the database of pooled in-vitro crispr knockout library essentiality screens. nucleic acids res , d –d ( ). . bausch-fluck, d. et al. a mass spectrometric-derived cell surface protein atlas. plos one , ( ). . o’connor, m. j. targeting the dna damage response in cancer. mol. cell , – ( ). . zimmermann, m. et al. crispr screens identify genomic ribonucleotides as a source of parp-trapping lesions. nature , – ( ). . pan, x. et al. fancm, brca , and blm cooperatively resolve the replication stress at the alt telomeres. pnas , e –e ( ). . lou, k., gilbert, l. a. & shokat, k. m. a bounty of new challenging targets in oncology for chemical discovery. biochemistry , – ( ). . narayan, g. et al. promoter hypermethylation of fancf: disruption of fanconi anemia- brca pathway in cervical cancer. cancer res , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . ideker, t., dutkowski, j. & hood, l. boosting signal-to-noise in complex biology: prior knowledge is power. cell , – ( ). . chang, m. t. et al. identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. nat. biotechnol. , – ( ). . lou, k. et al. krasg c inhibition produces a driver-limited state revealing collateral dependencies. sci signal , ( ). . cancer disparities - national cancer institute. https://www.cancer.gov/about- cancer/understanding/disparities ( ). . love, m. i., huber, w. & anders, s. moderated estimation of fold change and dispersion for rna-seq data with deseq . genome biology , ( ). . rubin, j. b. et al. sex differences in cancer mechanisms. biol sex differ , ( ). . gillen, a. e. et al. molecular characterization of gene regulatory networks in primary human tracheal and bronchial epithelial cells. j. cyst. fibros. , – ( ). . mj, k. et al. prognostic significance of cd overexpression in non-small cell lung cancer. lung cancer (amsterdam, netherlands) vol. https://pubmed.ncbi.nlm.nih.gov/ / ( ). . ko, y. h. et al. prognostic significance of cd s expression in resected non-small cell lung cancer. bmc cancer , ( ). . penno, m. b. et al. expression of cd in human lung tumors. cancer res , – ( ). . bailey, m. h. et al. comprehensive characterization of cancer driver genes and mutations. cell , - .e ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . bray, n. l., pimentel, h., melsted, p. & pachter, l. near-optimal probabilistic rna-seq quantification. nat biotechnol , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . count sgrnas abundance by deep sequencing to measure gene/drug phenotypes t samplecrispr hela cell line lentiviral transduction of genome-scale crispr sgrna library olaparib untreated hela cell line cal cell line kpl cell line zr cell line ... cov cell line jhos cell line tov g cell line ... breast cancer cervical cancer ovarian cancer ca b d e candi integration cancer data integrator essentiality mutation ... candi cellular genomics functional genomics transcriptomics proteomics vs. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . − − differential essentiality (Δ average bf) − . − . − . − . . . . . . ppp r b cflar nxt ctnnb slc a mansc ahcyl arhgef l mrpl efcab c ol on non-sigfnificant differentially expressed differentially essential shown in violin plots pp p r b cf la r nx t ct nn b sl c a ma ns c ah cy l ar hg ef l mr pl ef ca b gene − − − b ay es f ac to r top hit female top hit male − − − differential essentiality (Δ average bf) − . − . − . − . . . . . d iff er en ti al e xp re ss io n ( lo g (f c )) bcl l gpi eno rtcb pkm wac pcid arhgap slc a gpr bc l l gp i en o rt cb pk m w ac pc id ar hg ap sl c a gp r gene − − b ay es f ac to r − − − differential essentiality (Δ average bf) − − chmp chmp haus wls katnb id acsl kcne rufy krt pa nc re as ch mp ch mp ha us w ls ka tn b id ac sl kc ne ru fy kr t gene − − b ay es f ac to r lu ng negative control female negative control male essential gene thresholdm or e es se nt ia l le ss e ss en tia l m or e es se nt ia l le ss e ss en tia l m or e es se nt ia l le ss e ss en tia l female cell linesmale cell lines more essential in more essential in male cell lines more essential in female cell lines more essential in male cell lines more essential in female cell lines more essential in u p re gu la te d in u p re gu la te d in d iff er en ti al e xp re ss io n ( lo g (f c )) u p re gu la te d in m al e c el l l in es u p re gu la te d in fe m al e c el l l in es d iff er en ti al e xp re ss io n ( lo g (f c )) u p re gu la te d in u p re gu la te d in m al e c el l l in es fe m al e c el l l in es m al e c el l l in es fe m al e c el l l in es a b c d e f (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . log (fold change) -l og (q v al ue ) cd slc a b m itga slc a hla-c cd lrpap ddr vdac slc a slco a kras mutant cd slc a b m itga slc a hla-c cd lrpap ddr vdac slc a slco a gene lo g ( tp m + ) kras mutant cell line type benign bronchial malignant log (fold change) -l og (q v al ue ) b m slc a cd itga atp a slc a cd ddr hla-clrpap itga tfpi egfr mutant b m slc a cd itga atp a slc a cd ddr hla-c lrpap itga tfpi gene lo g ( tp m + ) egfr mutant log (fold change) -l og (q v al ue ) b m cd thy slc a slc a lrpap hla-c ddr slc a itga ptgfrn vdac all lung cancer b m cd thy slc a slc a lrpap hla-c ddr slc a itga ptgfrn vdac gene lo g ( tp m + ) all lung cancer location confidence a b c d e f (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gene essentiality in kras mt cell lines (average bf) g en e es se nt ia lit y in k r as w t c el l l in es ( av er ag e bf ) kras egfr kras egfr more essentialless essential m ore essential less essential essential gene threshold egfr mt included egfr mt removed gene essentiality in egfr mt cell lines (average bf) g en e es se nt ia lit y in e g fr w t c el l l in es ( av er ag e bf ) kras egfr kras egfr more essentialless essential m ore essential less essential essential gene threshold kras mt included kras mt removed a b c es se nt ia lit y nonsense tumor supressor genes context speci�c effect size . braf/braf nras/nras kras/kras hras/hras effect size effect size kras/kras nras/nras braf/braf hras/hras nras/kras non-hit signi�cant hit essentiality/mutation missense all mutations nonsense e f g more essential less essential . . . p-value d missense oncogenes tumor supressor genes context speci�c mutations (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ancestralclust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees ancestralclust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees lenore pipes ,∗ and rasmus nielsen , , ∗ department of integrative biology, university of california-berkeley, berkeley, , usa, department of statistics, university of california-berkeley, berkeley, ca , usa, and globe institute, university of copenhagen, københavn k, denmark ∗to whom correspondence should be addressed. abstract motivation: clustering is a fundamental task in the analysis of nucleotide sequences. despite the expo- nential increase in the size of sequence databases of homologous genes, few methods exist to cluster divergent sequences. traditional clustering methods have mostly focused on optimizing high speed clus- tering of highly similar sequences. we develop a phylogenetic clustering method which infers ancestral sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences. results: we describe a clustering program ancestralclust, which is developed for clustering divergent sequences. we compare this method with other state-of-the-art clustering methods using datasets of homologous sequences from different species. we show that, in divergent datasets, ancestralclust has higher accuracy and more even cluster sizes than current popular methods. availability and implementation: ancestralclust is an open source program available at https://github.com/lpipes/ancestralclust contact: lpipes@berkeley.edu supplementary information: supplementary figures and table are available online. introduction traditional clustering methods such as uclust (edgar, ), cd-hit (fu et al., ), and dnaclust (ghodsi et al., ) use hierarchical or greedy algorithms that rely on user input of a sequence identity threshold. these methods were developed for high speed clustering of a high quantity of highly similar se- quences (ghodsi et al., ; li et al., ; edgar, ) and, generally, these methods are considered unreliable for identity thresholds < % because of either the poor quality of alignments at low identities (zou et al., ) or because the performance of the threshold used to count short words drops dramatically with low identities (huang et al., ). at low identities, these meth- ods produce uneven clusters where the majority of sequences are contained in only a few clusters (chen et al., ) and the high variance in cluster sizes reduces the utility of the clustering step for many practical purposes. clustering of divergent sequences is a fundamental step in genomics analysis because it allows for an early divide-and-conquer strategy that will significantly increase the speed of downstream analyses (zheng et al., ) and clus- tering of divergent sequences is a frequent request of users of at least one clustering method (huang et al., ). currently, there are no clustering methods that can accurately cluster large taxo- nomically divergent metabarcoding reference databases such as the barcode of life database (ratnasingham and hebert, ) in relatively even clusters. only a few other methods, such as sp- clust (matar et al., ) and treecluster (balaban et al., ), exist for clustering potentially divergent sequences. spclust cre- ates clusters based on the use of laplacian eigenmaps and the gaussian mixture model based on a similarity matrix calculated on all input sequences. while this approach is highly accurate, the calculation of an all-to-all similarity matrix is a computation- ally exhaustive step. treecluster uses user-specified constraints for splitting a phylogenetic tree into clusters. however, treeclus- ter requires an input tree and thus can also be prohibitively slow .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pipes and nielsen for large numbers of sequences where a phylogenetic tree is dif- ficult to estimate reliably. with the increasing size of reference databases (schoch et al., ), there is a need for new compu- tationally efficient methods that can cluster divergent sequences. here we present ancestralclust that was specifically developed for clustering of divergent metabarcoding reference sequences in clusters of relatively even size. methods to cluster divergent sequences, we developed ancestralclust which is written in c (figure ). firstly, k random sequences are chosen and the sequences are aligned pairwise using the wavefront algorithm (marco-sola et al., ). a jukes-cantor distance ma- trix is constructed from the alignments and a neighbor-joining phylogenetic tree is constructed. the jukes-cantor model is cho- sen for computational speed, but more complex models could in principle be used to potentially increase accuracy but also in- crease computational time. the c − longest branches in the tree are then cut to yield c clusters. these subtrees comprise the initial starting clusters. the sequences in each starting clus- ter are aligned in a multiple sequence alignment using kalign (lassmann, ). the ancestral sequences at the root of the tree of each cluster is estimated using the maximum of the posterior probability of each nucleotide using standard programming algo- rithms from phylogenetics (see e.g., yang, ). the ancestral sequences are used as the representative sequence for each cluster. next, the rest of the sequences are assigned to each cluster based on the shortest nucleotide distance from the wavefront alignment between the sequence and the c ancestral sequences. if the short- est distance to any of the c ancestral sequences is larger than the average distance between clusters, the sequence is saved for the next iteration. we iterate this process until all sequences are as- signed to a cluster. in each iteration after the first iteration, a cut of a branch in the phylogenetic tree is chosen if the the branch is longer that the average length of branches cut in the first iteration. in praxis, only one or two iterations are needed for most data sets if k is defined to be sufficiently large. we compared ancestralclust to five other state-of-the-art clustering methods: uclust (edgar, ), meshclust (james and girgis, ), dnaclust (ghodsi et al., ), cd-hit (fu et al., ), and spclust (matar et al., ). we used a variety of measurements to assess the accuracy and evennness of the clustering. we calculated two traditional measures of accu- racy, purity and normalized mutual information (nmi), used in bonder et al. ( ). the purity of clusters is calculated as: purity(Ω, c) = n ∑ k max j |ωk ∩ cj| ( ) where Ω = w , w , ..., wk is the set of clusters, c = c , c , ..., cj is the set of taxonomic classes and n is the total number of sequences. nmi is calculated as: nmi(Ω, c) = i(Ω, c) [h(Ω) + h(c)]/ ( ) where mutual information gain is i(Ω, c) and h is the entropy function. to measure the evenness of the clusters, we used the coefficient of variation which is calculated as: cv = √∑j i (ni − m) /j m ( ) where ni is the number of sequences in cluster i, j is the total number of clusters, and m is the mean size of the clusters. we also used a taxonomic incompatibility measure to assess the ac- curacy of the clusters. let a,b be a pair of species found in cluster i. incompatibility at a given taxonomic rank is calculated by first identifying the number of times a and b exist in clusters other than cluster i. the total incompatibility is calculated by summing over all pairs of sequences (a,b) and all i. both nmi and taxonomic incompatibility are very sensitive to the number of clusters and also to unevenness of cluster sizes. to allow fair comparison when numbers of clusters and evenness of cluster sizes vary we, therefore, calculate the relative nmi and relative incompatibility. these measures are calculated by scaling them relative to their expected values under random as- signments given the number of clusters and the cluster sizes. we estimated relative nmi by dividing the raw nmi score by the average nmi of clusterings in which sequences have been as- signed at random with equal probability to clusters, such that the cluster sizes are same as the cluster sizes produced in the original clustering. the same procedure was used to convert the taxonomic incompatibility measure into relative incompatibility. results to first assess performance of clustering methods on divergent nucleotide sequences, we used random samples of , sequences from three metabarcode reference databases ( , s, and cytochrome oxidase i (coi)) from the caledna project meyer et al. ( ). we chose to compare our method on this dataset against uclust because it is the most widely used clus- tering program and it performs better than cd-hit on low identity thresholds (chen et al., ). we first compared ancestralclust against uclust using relative nmi and coefficient of variation (figure ). we used k = random initial sequences, which is % of the total num- ber of sequences in each sample and c = cuts in the initial phylogenetic tree. notice that the relative nmi tends to be higher with a lower coefficient of variation for ancestralclust across all barcodes. this suggests, that for these divergent edna sequences, ancestralclust provides clusterings that are more even in size and that are more consistent with conventional taxonomic assignment. as a second measure of accuracy we measured relative incom- patibility and coefficient of variation using ancestralclust and uclust using for the same datasets under the same running .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ancestralclust conditions. notice in figure , ancestralclust tends to create balanced clusters with lower relative taxonomic incompatibilities compared to uclust at all taxonomic levels. similar results are seen for metabarcode s (fig s ). however, for metabar- code s (fig s ), ancestralclust performs noticeably better than uclust at the species, genus, and family levels but at the order, class, and phylum levels it performs either the same or worse. also, at the species, genus, and family levels, it is apparent that as the uclust clusters approach a lower coefficient of variation, the relative incompatibility increases dramatically. next, we analyzed two datasets with different properties: one dataset of diverse species from the same gene and another dataset of homologous genes from species of the same phyla. in the first dataset, we expect that the sequences to cluster according to species. in the second dataset, we expect the sequences to cluster according to different genes. we compared ancestralclust to four commonly used clustering programs (uclust, meshclust , cd- hit , and dnaclust) and one clustering program designed for divergent sequences, spclust. the first dataset contained , sequences from the coi caledna database from divergent species that were from different phyla and different classes and the second data set contained sequences from different genes from taxonomically similar species. first, we compared all meth- ods using , coi sequences from the different species (table ). we expect these sequences to form different clus- ters, each including all the sequences from one species. we chose identity thresholds to enforce the expected number of clusters for each method. we were unable to form clusters using cd-hit because the program does not allow clustering of sequences with identity thresholds < % at default parameters. for spclust, we used the three precision modes available for the method. in this analysis, ancestralclust achieved a perfect clustering (the purity was and relative incompatibility was ) although it was the second slowest, and had the second lowest memory require- ments. uclust was one of the fastest methods and used the least amount of memory but had the second lowest purity with third highest relative nmi values. meshclust had no incompatibilities and the second highest purity and relative nmi values but was the third slowest method. dnaclust had the most uneven clusters and the second lowest relative nmi value with the highest relative incompatibility. spclust only identified one cluster, with a com- putational time of ~ days. in comparison, ancestralclust took ~ minutes and uclust used < second. next, we analyzed ’genomic set ’ from matar et al. ( ), which consists of sequences from homologous genes (fcer g, s a , s a , s a , s a , and sh bgrl in table ). we expect these sequences to form clusters. we varied the identity thresholds for uclust and meshclust using thresholds . , . , and . . for cd-hit, we used the lowest identity threshold available on default parameters which is . . we were unable to use dnaclust for this anal- ysis because it cannot handle sequences longer than bp (the average sequence length was , . bp and the longest sequence was , bp). since this dataset contained different genes, we calculated relative nmi using genes as the classes and did not use incompatibility as an accuracy measure. only ancestralclust, uclust, and meshclust produced the expected number of clus- ters, and among the methods that created the expected number of clusters, ancestralclust had the highest purity value. ancestral- clust was the second slowest method and had the highest memory requirements which is due to the wavefront algorithm alignment which iso(s ) in memory requirements where s is the alignment score. since alignments were performed using different genes that were longer than . kb, this resulted in a high value of s. sp- clust had the highest relative nmi using all precision modes and the same purity as ancestralclust for its moderate and maximum precision modes, however, failed to produce the expected number of clusters. conclusions we developed a phylogenetic-based clustering method, ances- tralclust, specifically to cluster divergent metabarcode sequences. we performed a comparative study between ancestralclust and widely used clustering programs such as uclust, cd-hit, dnaclust, meshclust , and for divergent sequences, spclust. uclust and dnaclust are substantially faster than ances- tralclust and should be the preferred method if computational speed is the main concern. however, ancestralclust tends to form clusters of more even size with lower taxonomic incompatibility and higher nmi than other methods, for the relatively divergent sequences analyzed here. we recommend the use of ancestral- clust when sequences are divergent, especially if a relatively even clustering is also desirable, for example for various divide-and- conquer approaches where computational speed of downstream analyses increases faster than linearly with cluster size. acknowledgements this work used the extreme science and engineering discov- ery environment (xsede) bridges system at the pittsburgh supercomputing center through allocation bio . references balaban, m., moshiri, n., mai, u., jia, x., and mirarab, s. ( ). treecluster: clustering biological sequences using phylogenetic trees. plos one, ( ), e . bonder, m. j., abeln, s., zaura, e., and brandt, b. w. ( ). compar- ing clustering and pre-processing in taxonomy analysis. bioinformatics, ( ), – . chen, q., wan, y., zhang, x., lei, y., zobel, j., and verspoor, k. ( ). comparative analysis of sequence clustering methods for deduplication of biological databases. j. data and information quality, ( ). edgar, r. c. ( ). search and clustering orders of magnitude faster than blast. bioinformatics, ( ), – . fu, l., niu, b., zhu, z., wu, s., and li, w. ( ). cd-hit: accelerated for clustering the next-generation sequencing data. bioinformatics, ( ), – . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pipes and nielsen ghodsi, m., liu, b., and pop, m. ( ). dnaclust: accurate and efficient clustering of phylogenetic marker genes. bmc bioinformatics, ( ), – . huang, y., niu, b., gao, y., fu, l., and li, w. ( ). cd-hit suite: a web server for clustering and comparing biological sequences. bioinformatics, ( ), – . james, b. t. and girgis, h. z. ( ). meshclust : application of alignment-free identity scores in clustering long dna sequences. biorxiv, page . lassmann, t. ( ). kalign : multiple sequence alignment of large datasets. li, w., jaroszewski, l., and godzik, a. ( ). clustering of highly homologous sequences to reduce the size of large protein databases. bioinformatics, ( ), – . marco-sola, s., moure lópez, j. c., moreto planas, m., and es- pinosa morales, a. ( ). fast gap-affine pairwise alignment using the wavefront algorithm. bioinformatics, (btaa ), – . matar, j., khoury, h. e., charr, j.-c., guyeux, c., and chrétien, s. ( ). spclust: towards a fast and reliable clustering for potentially divergent biological sequences. computers in biology and medicine, , . meyer, r. s., curd, e. e., schweizer, t., gold, z., ramos, d. r., shirazi, s., kandlikar, g., kwan, w.-y., lin, m., freise, a., et al. ( ). the california environmental dna “caledna” program. biorxiv, page . ratnasingham, s. and hebert, p. d. ( ). bold: the barcode of life data system (http://www. barcodinglife. org). molecular ecology notes, ( ), – . schoch, c. l., ciufo, s., domrachev, m., hotton, c. l., kannan, s., khovanskaya, r., leipe, d., mcveigh, r., o’neill, k., robbertse, b., et al. ( ). ncbi taxonomy: a comprehensive update on curation, resources and tools. database, . yang, z. ( ). molecular evolution: a statistical approach. oxford university press. zheng, w., mao, q., genco, r. j., wactawski-wende, j., buck, m., cai, y., and sun, y. ( ). a parallel computational framework for ultra-large- scale sequence clustering analysis. bioinformatics, ( ), – . zou, q., lin, g., jiang, x., liu, x., and zeng, x. ( ). sequence clus- tering in bioinformatics: an empirical study. briefings in bioinformatics, ( ), – . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ancestralclust figure . overview of ancestralclust. in ( ), k random sequences are chosen for the initial clusters. ( ) using the k sequences a distance matrix is constructed. using the distance matrix, a neighbor-joining tree is constructed and c − cuts are made to create c clusters. in ( ), each cluster is multiple sequenced aligned and the ancestral sequences are reconstructed in the root node of each tree. the rest of the unassigned sequences are then aligned to the ancestral sequences of each cluster and the shortest distance to each ancestral sequence is calculated. the process is iterated until all sequences are assigned to a cluster. figure . relative nmi against coefficient of variation for ancestralclust and uclust for samples of , randomly chosen s, s, and coi reference sequences from the caledna project (meyer et al., ). the similarity threshold for uclust was . . for ancestralclust, we used initial random sequences with initial clusters. relative nmi was calculated by dividing nmi by the average of random samples of the same fixed cluster size. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pipes and nielsen figure . relative incompatibility against coefficient of variation for ancestralclust and uclust for samples of , randomly chosen coi reference sequences. coi reference sequences are from the caledna project (meyer et al., ). the similarity threshold for uclust was . . for ancestralclust, we used initial random sequences with initial clusters. table . comparisons of clustering methods using , coi sequences from different species. the list of species can be found in table s . incompatibility was calculated at the taxonomic rank of species. for uclust, meshclust , and dnaclust, the identity thresholds were chosen to force the expected number of clusters. for cd-hit, the lowest possible identity was chosen which is . . in the case of spclust, coefficient of variation cannot be calculated for cluster. spclust clusters were created with version . method # of clusters time (sec) mem (mb) purity relative incompat. (species) relative nmi coeff. of var. ancestralclust . . . . uclust < . . . . . meshclust . . . . . cd-hit . . . . . dnaclust < . . . . . spclust (fast) . . - spclust (moderate) . . - spclust (maxprecision) . . - .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ancestralclust table . comparisons of clustering methods using sequences from homologous genes from matar et al. ( ).’id’ refers to the identity threshold used. we used identity thresholds of . , . , and . for uclust and meshclust . we used precision levels of fast, moderate, and maximum for spclust using version since version only produced cluster for all modes. dnaclust has a maximum sequence length of bp and could not be used on this dataset. method # of clusters time (sec) memory (mb) purity relative nmi coefficient of variation ancestralclust . . . . . uclust (id= . ) . . . . uclust (id= . ) . . . . uclust (id= . ) . . . . . meshclust (id= . ) . . . . . meshclust (id= . ) . . . . . meshclust (id= . ) . . . . . spclust (fast) . . . . . spclust (moderate) . . . . . spclust (max precision) . . . . . cd-hit (id= . ) . . . . . dnaclust - - - - - - .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / rdrugtrajectory: an r package for the analysis of drug prescriptions in electronic health care records jss journal of statistical software mmmmmm yyyy, volume vv, issue ii. reddoi: . /jss.v .i rdrugtrajectory: an r package for the analysis of drug prescriptions in electronic health care records anthony nash university of oxford tingyee e. chang university of oxford benjamin wan kings college london m. zameel cader university of oxford abstract primary care electronic health care records are rich with patient and clinical infor- mation. studying electronic health care records has resulted in marked improvements to national health care processes and patient-care decision making, and is a powerful supple- mentary source of data for drug discovery effort. we present the r package rdrugtrajec- tory, designed to yield demographic and patient-level characteristics of drug prescriptions in the uk clinical practice research datalink dataset. the package operates over clin- ical practice research datalink gold clinical, referral and therapy datasets and includes features such as first drug prescriptions analysis, cohort-wide prescription information, cu- mulative drug prescription events, the longitudinal trajectory of drug prescriptions, and a survival analysis timeline builder to identify risks related to drug prescription switching. the rdrugtrajectory package has been made freely available via the github repository. keywords: ehr, electronic health care records, cprd, clinical practice research datalink, prescriptions, r, therapeutics, drug discovery, clinical epidemiology. . introduction the uk clinical practice research datalink (cprd) service offers high quality longitudinal data on million patients with up to years of follow-up for % of those patients. the service provides drug treatment patterns, feasibility studies and health care resource use stud- ies. patient electronic health care records (ehr) are stored as coded and anonymised data and sourced from over , primary care practices across england. cprd holds informa- tion on consultation events, medical diagnoses, symptoms, prescriptions, vaccination history, laboratory tests, and referrals. cprd can provide routine linkage to other health-related patient datasets, for example: small area level data, such as patient and/or practice postcode .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dx.doi.org/ . /jss.v .i https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records linked deprivation measures; data from nhs digital which includes hospital episode statistic, outpatient and accident and emergency data; and cancer data from public health england. evidence from ehrs is making an impact on primary care decision-making and best prac- tice oyinlola et al. ( ). with nationwide longitudinal datasets more readily available, the evaluation of treatments over long timescales can contribute to clinical decision-making hepp et al. ( ). for example, adverse events caused by prescription medication can be studied using retrospective data in situations where randomized clinical trials may prove impracti- cal ghosh et al. ( ); bally et al. ( ). this publication serves as an introduction to the rdrugtrajectory r package and whilst this publication is by no means a complete tutorial, we will expand on some of the main pack- age features, such as, how to: isolate patients by first drug prescriptions at given clinical events; calculate time-invariant prescriptions; construct survival analysis timelines (compati- ble with cox proportional hazard regression and kaplan meier curves), and; visualise patient prescription switching. for a comprehensive list of functions please visit the github reposi- tory https://github.com/acnash/rdrugtrajectory. almost all features can be controlled by covariates or stratified by some variable, for example, by gender, age, medical codes or treatment product codes. the example code, figures and data structures presented here mimic a small fraction of our own research. in the interest of patient confidentiality, the clinical data used in the analysis have been fabricated. we present a brief tour of some of the functions available, starting with a discussion on the cprd data structure and how records must be formatted. a glossary of terms has been provided (table ) to assist the reader. . rdrugtrajectory package and data structures . . rdrugtrajectory availability and installation rdrugtrajectory is free to download from the github repository https://github.com/acnash/ rdrugtrajectory and holds an mit license. fabricated cprd clinical and cprd prescrip- tion records in addition to age, gender and index of multiple deprivation scores are included for test and tutorial purposes. before installing the package, the following r dependencies are required: plyr, dplyr, foreach, doparallel, data.table, parallel, splus r, rlist, reda, ggplot , ggalluvial, stats, utils and useful. the latest rdrugtrajectory binary is install using: install.packages("path/to/tar/file", source = true, repos=null) rdrugtrajectory was developed and tested on r version . . . please consult the github page for release notes, the latest version and up to date installation instructions. . . cprd product descirption several rdrugtrajectory functions use the cprd product.txt file for assigning a text descrip- tion to a prescription prodcode. the product.txt (and medical.txt for medcode description) is available in the cprd data dictionary windows software. it is important that the file .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/acnash/rdrugtrajectory https://github.com/acnash/rdrugtrajectory https://github.com/acnash/rdrugtrajectory https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software term description rdrugtrajectory an r packaged designed for the management of cprd prescription data. clinical the clinicalnnn.txt dataset presented in a rdrugtrajectory dataframe. referral the referralnnn.txt dataset presented in a rdrugtrajectory dataframe. therapy the therapynnn.txt dataset presented in a rdrugtrajectory dataframe. additionalnnn.txt the cprd dataset of additional clinical information, for example, patient smoking status and alcohol comsumption. data can be retrieved using cprdlookups.r. modecode a cprd identifier that denotes medical conditions, diagnosis and com- plaints made by a patient. medcodes are recorded in the clinicalnnn.txt and referralnnn.txt files. prodcode a cprd identifier that denotes treatment products, including drugs, foods, and medical apparatus. prodcodes are recorded in the thera- pynnn.txt files. patid a unique cprd patient identifier. used to link datasets. event any procode or medcode in a patient’s ehr. eventdate the date of an event recorded by a general practitioner. present in all three datasets and corresponding rdrugtrajectory dataframe. imd index of multiple deprivation score - a uk government socioeconomic measurement based on postcode of the clinic or a patient’s registered ad- dress. prescription a general time for any prodcode prescribed for treatment. medical history indicates a combination of one or more sets of cprd data, for example, the collection of all clinical and therapy ehr for patients with a medcode for migraine. product.txt a plain text file that contains all prodcodes with a description and comes bundled with the cprd data dictionary. the file is used to link a prodcode with a description. table : table of frequently used terms. remains in plain text, with columns tab-delimited. the files can be simplified by removing all non-essential products. finally, all the eleven columns that make up the product.txt file must be available, with the first column containing all prodcodes and the fourth column containing the product description. a simplified product.txt file, presented below, can be downloaded from the github page. > library(rdrugtrajectory) > productdf <- read.csv("../rdrugtrajectory_data/product.txt", + sep="\t", + header=false) > head(productdf) v v v v v atenolol mg tablets atenolol atenolol mg tablets atenolol atenolol mg tablets atenolol .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records amitriptyline mg tablets amitriptyline hydrochloride lisinopril mg tablets lisinopril lisinopril mg tablets lisinopril v v v v mg tablet oral mg tablet oral mg tablet oral mg tablet oral / / mg tablet oral mg tablet oral v beta-adrenoceptor blocking drugs beta-adrenoceptor blocking drugs beta-adrenoceptor blocking drugs tricyclic and related antidepressant drugs/neuropathic pain/prophylaxis of migraine angiotensin-converting enzyme inhibitors angiotensin-converting enzyme inhibitors v v feb- feb- feb- feb- feb- feb- . . rdrugtrajectory package structure rdrugtrajectory contains three r files: ( ) all functions related to data curating and search- ing reside within prddrugtrajectory.r; ( ) analysis tools and timeline construction reside within cprddrugtrajectorystats.r; and, ( ) all utilities including input/output operations reside within cprddrugtrajectoryutils.r. the packages contains several fabricated cprd datasets: testclinicaldf, testtherapydf, agegenderdf, imddf, and druglistdf. a de- scription of each, along with information on data types and structures are given below. . . the cprd ehr data structure the structure of cprd gold data may depend on whether the cprd license holder per- forms intermediate data management steps before releasing data to the user. however, typ- ically, cprd gold data follows the cprd gold specification https://cprdcw.cprd.com/ _docs/cprd_gold_full_data_specification_v . .pdf. currently, rdrugtrajectory sup- ports ehr data from the flat files clinicalnnn.txt, referralnnn.txt, and therapynnn.txt. the additional clinical details files (additionalnnn.txt) are currently supported using our re- leased r script cprdlookups.r https://github.com/acnash/cprd_additional_clinical ?. patients are assigned a unique numerical patid value. the operations performed by rdrugtra- jectory requires the patid to identify patients and subset patient groups. we recommend that patid, medcode, prodcode are kept as character data throughout any preliminary data curating .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://cprdcw.cprd.com/_docs/cprd_gold_full_data_specification_v . .pdf https://cprdcw.cprd.com/_docs/cprd_gold_full_data_specification_v . .pdf https://github.com/acnash/cprd_additional_clinical https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software steps. medical events are recorded as codes and stored in the clinicalnnn.txt and refer- ralnnn.txt under the column header medcode. prescription events, such as drug prescriptions are also recorded as codes and stored in the therapynnn.txt file under the column header prodcode and the sequences of repeat prescriptions are under the issueseq column header. dates associated medical and prescription events, recorded by the general practitioner, are stored under the column header eventdate. . . essential data types and data structures rdrugtrajectory can operate over cprd gold ehr clinical, referral and prescription data provided each dataset format is presented as separate r dataframes or combined into a rdrug- trajectory medical history dataframe. the construction of clinical, referral and prescription dataframes require, as a minimum, a patid and eventdate column, and either medcode or prod- code (for therapy data, issueseq is necessary), and presented in that order. every record of medcode or prodcode must be accompanied by an eventdate entry (encoded as a date class of the form yyyy-mm-dd). patients can have duplicate events within the same data set and between data sets. medical and prescription codes can be retrieved from the corresponding medical.txt and product.txt files which come bundled with the cprd data dictionary win- dows application. rdrugtrajectory comes packaged with fabricated ehr data in the structure of: > library(rdrugtrajectory) > #fabricated clinical data (referral data follows the same format) > names(testclinicaldf) [ ] "patid" "eventdate" "medcode" "consid" > #fabricated prescription data > names(testtherapydf) [ ] "patid" "eventdate" "prodcode" "consid" "issueseq" users can check if the structure of an ehr dataframe meets the requirements for this package by calling checkcprdrecord; additional columns such as consultation identification number (consid) are not considered. in the following instance, a prescription dataset with the required columns and the optional consultation identification number is presented. > library(rdrugtrajectory) > #check the structure of testtherapy, specify that it is therapy data > checkcprdrecord(df=testtherapydf, datatype="therapy") [ ] "the data.frame is appropriately formatted. returning true." [ ] true > #display the rdrugtrajectory ehr therapy dataframe > str(testtherapydf, strict.width="wrap") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records 'data.frame': obs. of variables: $ patid : int ... $ eventdate: date, format: " - - " " - - " ... $ prodcode : int ... $ consid : int ... $ issueseq : int ... users can combine with the rdrugtrajectory ehr dataframes any number of patient and ehr data to act as covariates and stratifying variables, typically this can be done using the r cbind operation. for example, bmi and smoking status, both of which can be retrieved from the additionalnnn.txt dataset files using cprdlookups.r, can be linked by searching for and binding with the record patid values. the rdrugtrajectory package contains several utility functions to retrieve cprd data, including, patient year of birth, gender (male or female) and either patient-level or clinical-level index of multiple deprivation score (imd). the patient age can be determined by adding to the value in yob column in the patient cprd ehr dataset and then subtracting that value (birth year) from the year of the cprd database release. this data requires preliminary treatment before presenting to the rdrugtrajectory package. patient age, gender and imd score must be presented in a dataframe with the linked patient column patid, along with the columns age, gender, and score. providing the patid column is preserved, patient characteristics can be presented in separate dataframe, for example: > library(rdrugtrajectory) > #patient age and gender as one dataframe > str(agegenderdf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ yob : num ... $ gender: int ... > #clinic-level imd score as one datafrmae > str(imddf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ pracid: int ... $ score : int ... the patid patient identifier is fundamental in every operation performed by rdrugtrajectory. the examples presented here and those in the reference manual rely on searching and subset- ting ehr data using a list or vector of patient identifier. the function getuniquepatidlist will retrieve an r list of patient identification numbers from any dataframe with a patid column. the aforementioned rdrugtrajectory ehr dataframes, clinical, referral and therapy, can be combined into a single dataframe. we refer to this dataset instance as the patient’s medical .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software history and can be constructed using constructmedicalhistory. this dataframe expects events to be in chronological order, and will introduce a new column, code and codetype to denote each of the combined events. the code (medcode and/or prodcode) can be distinguished by a codetype value of c (clinical events), r (referral events), and t (prescription events). events are returned in chronological order using the eventdate data. the following code demonstrates how to retrieve a list of patient identifier from a prescription dataframe and from a medical history dataframe, followed by how to subset using base r operations and, finally, the medical history dataframe structure. > library(rdrugtrajectory) > #retrieve patids from therapy data. > idlist <- getuniquepatidlist(testclinicaldf) > medhistorydf <- constructmedicalhistory(testclinicaldf, null, testtherapydf) [ ] "using clinical data." [ ] "using therapy data." [ ] "building with clinical and therapy data." > #retrieve patid from medical history. > medhistoryidlist <- getuniquepatidlist(medhistorydf) > numofpatients <- length(medhistoryidlist) > #subset using the first patients. > smallmedhistorydf <- subset(medhistorydf, + medhistorydf$patid %in% medhistoryidlist[ : ]) > #separate out the first patient with a clinical record. > smallclinicalonlydf <- subset(smallmedhistorydf, + smallmedhistorydf$codetype == "c") > #separate out the first patient with a therapy record. > smalltherapyonlydf <- subset(smallmedhistorydf, + smallmedhistorydf$codetype == "t") > #subset only or those patient records beyond st jan . > latermedhistorydf <- subset(medhistorydf, + medhistorydf$eventdate > as.date(" - - ")) > #medical history dataframe structure > str(medhistorydf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ eventdate: date, format: " - - " " - - " ... $ code : int ... $ codetype : chr "c" "c" "c" "t" ... the patid data can also be used to retrieve patient characteristics, for example, the gender of the patient using getgenderofpatients: .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records > library(rdrugtrajectory) > idlist <- getuniquepatidlist(testtherapydf) > #only use half of the cohort. > idlist <- idlist[ :(length(idlist)/ )] > #get gender data by specific gender. > malecode <- > femalecode <- > malepatientsdf <- getgenderofpatients(idlist, agegenderdf, malecode) > femalepatientsdf <- getgenderofpatients(idlist, agegenderdf, femalecode) > #get all gender data > allpatientsdf <- getgenderofpatients(getuniquepatidlist(testtherapydf), + agegenderdf) > #structure of the patient gender data. > str(allpatientsdf, strict.width="wrap") 'data.frame': obs. of variables: $ patid : int ... $ gender: int ... imd data can be retrieved by combining getuniquepatidlist and getimdofpatients func- tions: > library(rdrugtrajectory) > idlist <- getuniquepatidlist(testtherapydf) > #get patients with an imd score of or > onepatientsdf <- getimdofpatients(idlist, imddf, ) > twopatientsdf <- getimdofpatients(idlist, imddf, ) > #get all imd scores for all patients in testtherapydf > allpatientsdf <- getimdofpatients(getuniquepatidlist(testtherapydf), imddf) > #structure of the patient gender data. > str(allpatientsdf, strict.width="wrap") 'data.frame': obs. of variables: $ patid: int ... $ score: int ... the final example of ehr dataframe manipulation presented here demonstrates how to re- trieve all prescription records for patients prescribed a specific prescription treatment. for example, such an operation can be used to retrieve all prescription records for any patient prescribed amitriptyline. in addition, it is also possible to return only prescription records matching specific prescription treatments. importantly, prescription prodcodes can be grouped into lists and used to collect those patients with at least one record that matches an element of that list. this approach is useful if the dose is not relevant to the study or the prescription is dispensed under multiple product names. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > library(rdrugtrajectory) > #it is easy to retrieve a list of all unique prodcodes in the cohort. > prodcodesvector <- unique(testtherapydf$prodcode) > reducedprodcodesvector <- prodcodesvector[ : ] > #all records are maintained for those patients with a matching prodcode. > therapyofinterestdf <- getpatientswithprodcode(testtherapydf, + reducedprodcodesvector) > #only those records that match are retained. > reducedtherapyofinterestdf <- getpatientswithprodcode(testtherapydf, + reducedprodcodesvector, + removeexcessdrugs=true) . ehr drug prescription results and discussion having briefly demonstrated some basic operation on retrieving patient records by matching ehr dataframes against sets of patid values, we move on to showcase several operations available to the user. we begin by presenting examples of cohort prescription summary statistics followed by methods of dataset curating and stratifying by patient groups. we then present examples on how to search for patients prescribed with a first-line treatments, followed by presenting some of these patient groups as sequences of prescriptions. finally, we demonstrate several examples of building time-lines. for futher examples, please see the github page and reference manual. . . cohort summmary statistics geteventdatesummarybypatient rdrugtrajectory can return summary based statistics on patient and cohort level prescription data with geteventdatesummarybypatient and getpopulationdrugsummary, respectively. for example, a single patient (via getuniquepatidlist and [] dataframe subsetting) pre- scription history returns the patient patid, number of prescription events, median number of days between events, fewest number of days between events, the most number of days between events (maxtime and longestduration are the same), and record duration (number of days between the first and last prescription event on record): > library(rdrugtrajectory) > idlist <- getuniquepatidlist(testtherapydf) > resultlist <- geteventdatesummarybypatient( + testtherapydf[testtherapydf$patid==idlist[[ ]],]) > str(resultlist, strict.width="wrap") list of $ timeserieslist: num [ : ] $ summarydf :'data.frame': obs. of variables: ..$ patid : int ..$ numberofevents : int .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records ..$ mediantime : num ..$ mintime : num ..$ maxtime : num ..$ longestduration: num ..$ recordduration : int - attr(*, "class")= chr "eventdatesummaryobj" getpopulationdrugsummary this approach can be extended across the cohort of patients with getpopulationdrugsummary. the returning populationeventdatesummary s object is a list of three elements. the first element is the summarydf dataframe derived from calling geteventdatesummarybypatient per patient, with the set of statistics retrievable through the accompanied patid. the second element is the timeserieslist, which holds a vector per patient of the number of days between consecutive prescription events. vectors can be accessed using the patid element name: > library(rdrugtrajectory) > resultlist <- getpopulationdrugsummary(df = testtherapydf, + prodcodesvector = null) > str(resultlist, strict.width="wrap", list.len = ) list of $ summarydf :'data.frame': obs. of variables: ..$ patid : int [ : ] ... ..$ numberofevents : int [ : ] ... ..$ mediantime : num [ : ] . ... ..$ mintime : num [ : ] ... ..$ maxtime : num [ : ] ... .. [list output truncated] $ timeserieslist:list of ..$ : num [ : ] ..$ : num [ : ] ... ..$ : num ..$ : num ..$ : num [ : ] ... .. [list output truncated] - attr(*, "class")= chr "populationeventdatesummary" > #get all patids for patients younger than . > ageidlist <- getuniquepatidlist(agegenderdf[agegenderdf$yob < ,]) > timeserieslist <- resultlist[[ ]] > #get all patids of available data. > recordpatids <- names(timeserieslist) > #get time data for the intersect of those patids of patients < and the patids > #of available data. > subtimelist <- timeserieslist[intersect(ageidlist, recordpatids)] > str(subtimelist, strict.width="wrap", list.len = ) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software list of $ : num $ : num $ : num $ : num $ : num [list output truncated] . . curating drug prescription records there is no direct link between a prescription event and a medcode in the cprd data. the relationship between the two can be inferred from the event dates of the prescription and clinical events, in addition, to information provided by the consultation id and the prescription issue number. matchdrugwithdisease rdrugtrajectory provides several methods for curating prescription datasets with the aim of es- tablishing a relationship between prescription and clinical events. the matchdrugwithdisease function returns a subset of all prescription events with an established relationship between therapy and clinical event. to what degree these patients are included in the search is con- trolled with a function argument. there are three scenarios: all patients with a record of a specific prescription event and specific clinical event, at any point; all patients with a record of a specific prescription event on the same date as a specific clinical event; and, all patients with a record of a specific prescription event on the same date as a specific clinical event and clear from additional clinical events on that day. one would expect fewer patients as the stringency of the search criteria is increased: > library(rdrugtrajectory) > prodcodes <- unique(testtherapydf$prodcode) > amitriptylinecodes <- prodcodes[ : ] > propranololcodes <- prodcodes[ : ] > medcodelist <- unique(testclinicaldf$medcode) > headachecodes <- medcodelist[ : ] > amitriptylineresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = ) > amitriptylineresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = ) > amitriptylineresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records + drugcodelist = amitriptylinecodes, + severity = ) > propranololresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = propranololcodes, + severity = ) > propranololresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = propranololcodes, + severity = ) > propranololresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = propranololcodes, + severity = ) getgenderofpatients the example presented, demonstrates how to identify patients prescribed amitriptyline and patients prescribed propranolol (there is patient overlap, easily controlled for by subsetting) whilst controlling for clinical overlap with or without consideration for off topic clinical events. with the identified patients, we can, for example, stratify by gender: > library(rdrugtrajectory) > library(ggplot ) > ami gender <- getgenderofpatients(amitriptylineresult , agegenderdf) > ami gender <- getgenderofpatients(amitriptylineresult , agegenderdf) > ami gender <- getgenderofpatients(amitriptylineresult , agegenderdf) > prop gender <- getgenderofpatients(propranololresult , agegenderdf) > prop gender <- getgenderofpatients(propranololresult , agegenderdf) > prop gender <- getgenderofpatients(propranololresult , agegenderdf) > amidf <- data.frame(freq=c(nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]), + nrow(ami gender[ami gender$gender== , ]) + ), + search=c("prescribed","with headache","no comorbidities", + "prescribed","with headache","no comorbidities"), + drug="amitriptyline", + gender=c("male","male","male", + "female","female","female") + ) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > propdf <- data.frame(freq=c(nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]), + nrow(prop gender[prop gender$gender== , ]) + ), + search=c("at any time","with clinical","clinical & no comorbidities", + "at any time","with clinical","clinical & no comorbidities"), + drug="propranolol", + gender=c("male","male","male", + "female","female","female") + ) > drugprescriptiondf <- rbind(amidf, propdf) > ggprescriptionami <- ggplot(drugprescriptiondf[ + drugprescriptiondf$drug=="amitriptyline",], + aes(x=search, y=freq, fill=gender)) + + geom_bar(stat="identity", position=position_dodge()) + + theme_bw() + xlab("search critera (severity)") + ylab("patient count") + + theme(axis.text.x = element_text(angle= ,hjust= )) + + ggtitle("amitriptyline") > ggprescriptionprop <- ggplot(drugprescriptiondf[ + drugprescriptiondf$drug=="propranolol",], + aes(x=search, y=freq, fill=gender)) + + geom_bar(stat="identity", position=position_dodge()) + + theme_bw() + xlab("search critera (severity)") + ylab("patient count") + + theme(axis.text.x = element_text(angle= ,hjust= )) + + ggtitle("propranolol") > filtering through prescription events can also be controlled by a date range. for example, if one was calculating the number of patients prescribed amitriptyline per year from to and matched to a headache event, one can apply a date range: > library(rdrugtrajectory) > library(ggplot ) > prodcodes <- unique(testtherapydf$prodcode) > amitriptylinecodes <- prodcodes[ : ] > #clinical event of interest are headaches. > medcodelist <- unique(testclinicaldf$medcode) > #medcodes can be refined further. > headachecodes <- medcodelist[ : ] > #dataframes defined for binned dates are constructed by providing all the > #patients to consider and the binned start and stop date. > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records no c om or bi di tie s pr es cr ib ed w ith h ea da ch e search critera (severity) p a tie n t co u n t gender female male amitriptylinea at a ny ti m e cl in ica l & n o co m or bi di tie s w ith c lin ica l search critera (severity) p a tie n t co u n t gender female male propranololb figure : the number of patients prescribed (a) amitriptyline or (b) propranolol. the criteria to match against clinical data is indicated: at any time, with a clinical record, and with a clinical record clear off topic clinical events. > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > date df <- data.frame(patid=unlist(getuniquepatidlist(testtherapydf)), + start=as.date(as.character(" - - ")), + stop=as.date(as.character(" - - "))) > #retrieve prescription frequencies per binned range > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > amitresult <- matchdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodelist = headachecodes, + drugcodelist = amitriptylinecodes, + severity = , + datedf = date df) > #the number of patids returned by matchdrugwithdisease is equal to the number > #of patients with a drug - disease match per year > datadf <- data.frame(year=c(" "," "," "," "," "), + count=c(length(amitresult ),length(amitresult ), + length(amitresult ),length(amitresult ), + length(amitresult ))) > ggprescriptionyear <- ggplot(datadf, aes(x=year, y=count)) + + geom_bar(stat = "identity") + theme_bw() getpatientswithfirstdrugwithdisease unlike matchdrugwithdisease which retrieves patients with a prescription event matching clinical criteria at any time within a cprd ehr record, getpatientswithfirstdrugwithdisease identifies patients with a first prescription event that matches a desired clinical event. please note, care must be taken when searching for medication with off-label uses. for example, beta-blockers are frequently prescribed to treat hypertension and arrhythmia, however, the beta-blocker propranolol is also prescribed to treat migraine. without in depth analysis into the patient history, patients propranolol with records for hypertension or arrhythmia in addi- tion to migraine on a matching eventdate with the first propranolol prescription, could result .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records year c o u n t figure : the number of patients prescribed amitriptyline from the start of the year to the end of , stratified in year intervals. in a misleading disease-drug association. in cases where a health care professional suggests a change in the patient’s lifestyle choices, that patient may have several clinical events free from prescriptions before the first prescription of interest is prescribed. using basic subsetting one can calculate the number of clinical events before the patient’s first prescription intervention (figure a). further more, we can stratify patients into subgroups (figure b): > library(rdrugtrajectory) > library(ggplot ) > #a vector of prescriptions of interest. > druglist <- unique(testtherapydf$prodcode) > sampledrugs <- druglist[ : ] > #a vector of clinical events to match prescriptions against. > medcodes <- unique(testclinicaldf$medcode) > samplemedcodes <- medcodes[ : ] > #returns the subset of the first prescription event prescribed on the same > #eventdate as those clinical events of interest .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > firstdf <- getpatientswithfirstdrugwithdisease(clinicaldf = testclinicaldf, + therapydf = testtherapydf, + medcodesvector = samplemedcodes, + drugcodesvector = sampledrugs) > #ensure the only clinical data are for those with an assume first-drug-disease > firstclinicaldf <- subset(testclinicaldf, + testclinicaldf$patid %in% getuniquepatidlist(firstdf)) > #only keep the diseases of interest > firstclinicaldf <- subset(firstclinicaldf, + firstclinicaldf$medcode %in% samplemedcodes) > #only keep the prescriptions of interest > firstdf <- subset(firstdf, firstdf$prodcode %in% sampledrugs) > idlist <- getuniquepatidlist(firstclinicaldf) > beforeresultdf <- data.frame(patid=unlist(idlist), freq= ) > for(id in idlist) { + #retrieve the clinical/therapy data for each patients, one by one. + indclinicaldf <- subset(firstclinicaldf, firstclinicaldf$patid == id) + indtherapydf <- subset(firstdf, firstdf$patid == id) + #get the first event date on record; this will match a clinical date. + firsteventdate <- indtherapydf$eventdate[ ] + clinicalbeforetherapydf <- subset(indclinicaldf, + indclinicaldf$eventdate < firsteventdate) + #number of clinical complaints before first prescription. + ncomplaints <- nrow(clinicalbeforetherapydf) + beforeresultdf[beforeresultdf$patid==id,]$freq <- ncomplaints + } > ggbefore <- ggplot(beforeresultdf, aes(x=freq)) + + geom_histogram(binwidth= , color="black", fill="white") + + ylab("patients") + xlab("clinical events before prescription") + + theme_bw() > #note: not every patient will have a clinical imd score. > imdidsdf <- getimdofpatients(idlist = idlist, + imddf = imddf) > #only work with those with an imd score. > imdresultsdf <- subset(beforeresultdf, + beforeresultdf$patid %in% getuniquepatidlist(imdidsdf)) > imdresultsdf <- imdresultsdf[order(imdresultsdf$patid),] > imdidsdf <- imdidsdf[order(imdidsdf$patid),] > imdresultsdf <- cbind(imdresultsdf, imd_score=as.factor(imdidsdf$score)) > ggbeforeimd <- ggplot(imdresultsdf, + aes(x=freq, fill=imd_score)) + + geom_histogram(binwidth= ) + theme_bw() + + ylab("patients") + xlab("clinical events before prescription") getmultiprescriptionsamedaypatients the function getmultiprescriptionsamedaypatients returns all prescription events for .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records clinical events before prescription p a tie n ts a clinical events before prescription p a tie n ts imd_score b figure : the number of clinical events before the first treatment across the whole cohort (a), and by imd score (b). those patients prescribed more than two prescriptions on the same date. all events of those pa- tients without a prescription prodcode event can be removed. combining getmultipleprescriptionsamedaypatients with getpatientswithfirstdrugwithdisease or matchdrugwithdisease is useful for filter- ing patients for specific prescription patterns. for example, to retrieve all patient prescription records if specific prescriptions are (a) never recorded together on the same date and (b) are used as a first line treatment for a given complaint: > library(rdrugtrajectory) > prodcodesvector = unique(testtherapydf$prodcode)[ : ] > #ensure only patients with specific prescriptions are returned providing a > #patient is prescribed those drugs on different dates, never on the same date. > uniquetherapydf <- getmultiprescriptionsamedaypatients(df = testtherapydf, + prodcodesvector = prodcodesvector, + removepatientswithoutdrugs = true) > #ensure that the patients (patid) in the therapy and clinical dataframes > #are the same. subsetting might not be enough. > reducedclinicaldf <- subset(testclinicaldf, .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software + testclinicaldf$patid %in% getuniquepatidlist(uniquetherapydf)) > #specific medcodes have not been provided. all medcodes in the clinical > #dataframe are considered. this is possible if one either one is not interested > #in the nature of the clinical complaint or the clinical dataframe has been > #adjusted to only include clinical complaints of interest. > firstdf <- getpatientswithfirstdrugwithdisease(clinicaldf = reducedclinicaldf, + therapydf = uniquetherapydf, + drugcodesvector = sampledrugs) in the above example, patients with more than one prescription on the same date or without a prescription at all (from the set of desired prescription prodcodes) were removed from the cohort. this reduced the number of patients from patients to . next, only those patients with a first line treatment (first prescription event on the same date as a clinical event) were kept, reducing the sample size to patients. removepatientsbyduration longitudinal ehr cohort studies often requires careful time-related consideration. currently, rdrugtrajectory presents two functions that identify prescription records of patients that match two time constraints. the first, removepatientsbyduration, removes all patients with prescription events that are no more than n years between consecutive events or removes patients if the duration between the first and last prescription event on record is less than n years. > library(rdrugtrajectory) > df <- removepatientsbyduration(minobsyr = , + minbreakyr = , + therapydf = testtherapydf) getburninpatients the second time-related function, getburninpatients identifies all patient prescription records with at least n days free from prescription events before a specific prescription event. this is useful if one requires a period of time free from prescription intervention before a given prescription event: > library(rdrugtrajectory) > drugofinterestvector <- c( , , , , , ) > patientlist <- getburninpatients(df = testtherapydf, + startcodesvector = drugofinterestvector, + perioddaysbefore = ) > burnintherapydf <- subset(testtherapydf, + testtherapydf$patid %in% patientlist) in the above example, from a cohort of patients, patients had a period of up to days free from of prescription events before the first prescription prodcode specified via the startcodesvector argument. the functionality relies on the patient having prescription events before the burn-in period (required to define whether the patient had a cprd record early .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records enough before the burn-in period began). for example, this patient had over three years of prescription events before the prescription of interest (from - - to - - with over days free from exposure before the prescription event of interest prodcode : > head(burnintherapydf[burnintherapydf$patid == ,], n= ) [ ] patid eventdate prodcode consid issueseq < rows> (or -length row.names) . . first drug prescriptions getfirstdrugprescription a patient’s first prescription event on cprd record can be identified by supplying getfirstdrugprescription with a list of prescription prodcodes. the functions returns firstdrugobject, an r s ob- ject of type list. only the first prescription event to match anyone one of the prescription prodcodes provided is identified. the first element of firstdrugobject contains a named list of patid vectors. each vector contains the patids of all those patients that share the same first prescription prodcode. the list element is named after the corresponding prescription prodcode. the second element in firstdrugoject, like the first, is a list of date vectors, each named after the corresponding prescription prodcode. each date vector contains the eventdate of the prescription event for the patient identified by the patid in the identical position of the preceding list. the third list element contains a table of prescription frequencies for each first prescription prodcode on record. the prodcode is accompanied by a product description providing a file of cprd prescription products has been provided. below we demonstrate how to retrieve information on first-line treatment: > library(rdrugtrajectory) > library(ggplot ) > #an adjusted data dictionary file. > filelocation <- "product.txt" > #without supplying a vector of product files all prodcodes in the therapy > #dataset are considered. > resultfdo <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = null, + descriptionfile = filelocation) > patidlist <- resultfdo[[ ]] > eventdatelist <- resultfdo[[ ]] > drugfrequencydf <- resultfdo[[ ]] > drugfrequencydf <- drugfrequencydf[order(drugfrequencydf$frequency, + decreasing = true), ] > ggfreq <- ggplot(data=drugfrequencydf, aes(x=description, y=frequency)) + + geom_bar(stat="identity") + theme_bw() + + theme(axis.text.x = element_text(angle= , hjust= )) + + xlab("drug product description") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software > #the structure of the firstdrugobject. > str(resultfdo, strict.width="wrap", list.len = ) am itr ip ty lin e m g ta bl et s am itr ip ty lin e m g ta bl et s am itr ip ty lin e m g ta bl et s at en ol ol m g ta bl et s at en ol ol m g ta bl et s at en ol ol m g ta bl et s ca nd es ar ta n m g ta bl et s ca nd es ar ta n m g ta bl et s li sin op ril m g ta bl et s li sin op ril . m g ta bl et s li sin op ril m g ta bl et s pr op ra no lo l m g ta bl et s pr op ra no lo l m g ta bl et s pr op ra no lo l m g m od ifie d− re le as e ca ps ul es pr op ra no lo l m g ta bl et s to pi ra m at e m g ta bl et s ve nl af ax in e . m g ta bl et s ve nl af ax in e m g m od ifie d− re le as e ca ps ul es ve nl af ax in e m g m od ifie d− re le as e ta bl et s drug product description f re q u e n cy figure : the frequency of first line treatment prescription. getagegroupbyevents in the next example we explore stratifying first-line prescription events by patient character- istics, such as, age, gender, imd, and number of medcodes (for instance, by comorbidities) or prodcodes (for instance, to separate those patients by additional prescriptions), or by any additional clinical event retrieved using cprdlookups.r ?. rdrugtrajectory provides several utility functions to stratify patients (see reference manual for further information). the func- tion getagegroupbyevents calculates the number of first-line prescription events by patient age. by specifying a set of patids and eventdates from the firstdrugobject, we can calculate the number of first-line prescriptions by age-group for patients linked with a specified medical condition: > library(rdrugtrajectory) > filelocation <- "product.txt" .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records > resultfdo <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = null, + descriptionfile = filelocation) > patidlist <- resultfdo[[ ]] > eventdatelist <- resultfdo[[ ]] > names(agegenderdf) <- c("patid","age","gender") > #the age-groups: [ , ), [ , ), [ , ), ..., [ , +). > agegroupvector <- c( , , , , , , , , ) > #cprd database release year. > ageatyear <- " " > agegrouplist <- getagegroupbyevents(idlist = as.list(patidlist[ : ]), + eventdatelist = eventdatelist[ : ], + agedf = agegenderdf, + agegroupvector = agegroupvector, + ageatyear = ageatyear) > agegrouplist [[ ]] - - - - - - - - + [[ ]] - - - - - - - - + in the above example, the age of each patient (agedf) was provided using year-of-birth calcu- lated against the release year of the cprd gold database (explained above). by providing the database release year (in ageatyear) and the first prescription eventdate (in eventdatelist), the age of each patient is adjusted against the prescription eventdate year. finally, by using a list slice on idlist and eventdatelist, (individual prescriptions can be specified using their prodcode, for example, eventdatelist$‘ ‘), first prescription prescriptions frequencies by age-group are retrievable (figure ). > library(ggplot ) > agegroupdrugdf <- data.frame(age=names(agegrouplist[[ ]]), + count=unlist(agegrouplist[[ ]]), + drug="amitriptyline mg") > ggamitriptyline <- ggplot(agegroupdrugdf, aes(x=age, y=count)) + + geom_bar(stat="identity") + + theme_bw() + ggtitle("amitriptyline mg") + + theme(axis.text.x = element_text(angle= , hjust= )) + + xlab("age-group") + ylab("frequency") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software − − − − − − − − + age−group f re q u e n cy amitriptyline mg figure : the distribution of amitriptyline mg as a first-line treatment by age-group. . . prescription sequences mapdrugtrajectory identifying patient prescription trajectories in longitudinal ehrs remains our biggest motiva- tor behind the development of rdrugtrajectory. therefore, we developed mapdrugtrajectory to identify the chronological of patient prescription events. we restrict the calculation to only look for prescription prodcodes as supplied to groupinglist as a named list (named prodcode vectors). the required number of grouped-prescription events is defined by specifying the mindepth and the number of those changes to display is controlled by maxdepth maximum number. by keeping mindepth and maxdepth the same, only patients with a valid number of prescription changes are displayed (figure (a) and (c)). patient records with fewer than mindepth number of changes to prescription sequences are ignored (figure (b)). for further information please refer to the reference manual. in the code below, mapdrugtrajectory returns patients with at least first five grouped pre- scriptions. prodcodes that have not been grouped are ignored. duplication of prodcodes (those from the same group) do not count as a change in treatment: .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records figure : the distribution of grouped prodcodes across three patients. (a) five groups of valid prescription prodcodes, (b) only three groups, (c) five valid groups, in addition to prodcodes and which are ignored. > library(ggplot ) > library(ggalluvial) > structurelist <- list(amitriptyline = c( , , ), + propranolol = c( , , ), + topiramate = c( ), + venlafaxine = c( , , ), + lisinopril = c( , , ), + atenolol = c( , , ), + candesartan = c( ) + ) > resultlist <- mapdrugtrajectory(df = testtherapydf, + mindepth = , + maxdepth = , + groupinglist = structurelist, + removeundefinedcode = true) > df <- resultlist[[ ]] > ggswitch <- ggplot(df, + aes(y = freq, axis = firstdrug, axis = switch , + axis = switch , axis = switch , axis = switch )) + + geom_alluvium(aes(fill = firstdrug), width = / ) + + geom_stratum(width = / , fill = "black", color = "grey") + + geom_label(stat = "stratum", infer.label = true) + + scale_fill_brewer(type = "qual", palette = "set ") + + theme_bw() + theme(legend.position = "none") + + scale_x_discrete(limits = c("first drug", " st switch", " nd switch", + " rd switch"," th switch"), + expand = c(. , . )) + + ggtitle("migraine preventative switching among patients") .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software venlafaxine propranolol lisinopril atenolol amitriptyline candesartan venlafaxine propranolol lisinopril atenolol amitriptyline topiramatecandesartan venlafaxine propranolol lisinopril atenolol amitriptyline topiramate candesartan venlafaxine propranolol lisinopril atenolol amitriptyline topiramatecandesartan venlafaxine propranolol lisinopril atenolol amitriptyline first drug st switch nd switch rd switch th switch f re q migraine preventative switching among patients figure : prescription pattern switching of seven different migraine preventatives. a patient required a a minimum of five changes in prescriptions (including the initial prescription) and, equally, the display was set to five changes in prescription. . . prescription timeline construction rdrugtrajectory contains several functions that transforms patient data into a format com- patible with mean cumulative function (mcf) semi-parametric estimates, prescription per- sistence, prescription incidence, and survival analysis. generatemcfonegroup prescription events are binned into weekly units to increase the statistical power at each time point. the user presents a group at a time, for example, all clinical events of male patients with a first-line prescription of amitriptyline for a migraine. the clinical data has already been refined using the steps for first-line prescription, as described above. the function generatemcfonegroup accepts a dataframe or events, the mcf start date (eventdates are adjusted so all patient records in the dataset begin at the same time), and the minimum number of events per patients (by default this is two events). the following example presents the calculation of first prescription events, the assignment of gender and the calculation of .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records mcf of prescription (therapy dataframe) burden of amitriptyline and propranolol: > library(rdrugtrajectory) > filelocation <- "product.txt" > resultlist <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = null, + descriptionfile = filelocation) > patidlist <- resultlist[[ ]] > eventdatelist <- resultlist[[ ]] > drugfrequencydf <- resultlist[[ ]] > drugfrequencydf <- drugfrequencydf[order(drugfrequencydf$frequency, + decreasing = true), ] > amitriptylinepatid <- patidlist$` ` > propranololpatid <- patidlist$` ` > malecode <- > malepatidsdf <- getgenderofpatients(idlist = getuniquepatidlist(testtherapydf), + genderdf = agegenderdf, + gendercodevector = malecode) > amitriptylinemalepatids <- subset(amitriptylinepatid, + amitriptylinepatid %in% malepatidsdf$patid) > propranololmalepatids <- subset(propranololpatid, + propranololpatid %in% malepatidsdf$patid) > amimaletherapydf <- subset(testtherapydf, + testtherapydf$patid %in% amitriptylinemalepatids) > propmaletherapydf <- subset(testtherapydf, + testtherapydf$patid %in% propranololmalepatids) > amimalemcfdf <- generatemcfonegroup(therapydf = amimaletherapydf, + startdatecharvector = " - - ", + minrecords = ) > propmalemcfdf <- generatemcfonegroup(therapydf = propmaletherapydf, + startdatecharvector = " - - ", + minrecords = ) > amimalemcfdf <- cbind(amimalemcfdf, drug = "amitriptyline") > propmalemcfdf <- cbind(propmalemcfdf, drug = "propranolol") > drugmcfdf <- rbind(amimalemcfdf, propmalemcfdf) > resultmcf <- reda::mcf(reda::recur(week, id, no.) ~ drug, data = drugmcfdf) > mcfplot <- reda::plot(resultmcf, conf.int=true) + + ggplot ::xlab("weeks") + ggplot ::theme_bw() + ggplot ::ggtitle("") getfirstdrugincidencerate prescription incidence be calculated with getfirstdrugincidencerate. the following code demonstrates how to use a firstdrugobject to calculate incidence rates for a set of prodcodes. the study observation starts from the enrollmentdate and ends at the studyenddate: > library(rdrugtrajectory) .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software weeks m c f e st im a te s drug amitriptyline propranolol figure : mcf of drug prescriptions of patients with a first drug prescription for either amitriptyline or propranolol, stratified by gender. the dotted lines indicate a % confidence interval. > filelocation <- "product.txt" > druglist <- unique(testtherapydf$prodcode) > requiredprods <- druglist[ : ] > firstdrugobject <- getfirstdrugprescription(df = testtherapydf, + idlist = null, + prodcodesvector = requiredprods, + descriptionfile = filelocation) > medhistorydf <- constructmedicalhistory(testclinicaldf, null, testtherapydf) > patidlist <- unlist(firstdrugobject$patidlist) > resultmatrix <- getfirstdrugincidencerate(firstdrugobject = firstdrugobject, + medhistorydf = medhistorydf, + enrollmentdate = as.date(" - - "), + studyenddate = as.date(" - - ")) > incidencedf <- as.data.frame(t(resultmatrix), stringsasfactors = true) the above example returns an incidence rate of . per person years over a cohort of .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records patients. for a detailed description please see detail for getfirstdrugincidencerate in the reference manual. getdrugpersistence prescription persistence is calculated as the fraction of patients with a prescription for a specific treatment n-days after the first prescription event. for example, if we wanted to calculate the fraction of patients with a prescription -days after their first prescription, with a -day buffer either side, one specifies a duration of -days and a preceding buffer of -days (therefore, capturing the range to , -days either side of one calender year): > library(rdrugtrajectory) > patientlist <- getdrugpersistence(therapydf = testtherapydf, + idlist = null, + prodcodelist = null, + duration = , + buffer = , + endofrecorddate = " - - ") of patient therapy records, patients had a prescription (+/- ) days after the first prescription event on record, resulting in a crude fraction of only . patients. getdrugpersistence only observes events recorded precisely duration days after the first prescription. the buffer can be used to identify patients who received a prescription shortly after the end of the duration, but more importantly, to ensure patients actively undergoing treatment (indicated by a prescription shortly before the desired duration days) are included. as the buffer is reduced, the fraction of prescription persistence is reduced until the algorithm attempts to only identify patients with a prescription exactly duration of days after the first prescription. future software updates will incorporate repeat prescription data to increase the accuracy of the calculation. . closing remarks and future work rdrugtrajectory is an r package which has the potential for exciting applications such as im- proving clinical decision-making, identifying possible new treatments and analysing outcomes from existing treatments. we have demonstrated several functions, some of which detail sorting and matching records whilst others demonstrate fundamental statistical analysis. we used fabricated clinical and prescription dataframes, along with the age, gender and index of multiple deprivation score of each patient and presented analyses of cohort-wide prescrip- tion patterns, first-line treatment distributions, how to stratify by patient characteristics, and some basic tools to assist longitudinal analysis of prescriptions. the descriptions presented in this publication are not substitutes for the material in the reference manual. we recommend the reader consults the r ? help command or reference manual before running a function. in particular, functions related to the construction of timelines for survival analysis (time dependent/independent cox regression, kaplan meier survival curves and mean cumulative function) or a matrix for drug incidence rate requires fine tuning of several parameters. .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / journal of statistical software . . . . buffer size (n days before ) f ra ct io n o f p re sc ri p tio n p e rs is te n ce figure : the fraction of prescription persistence adjusted by a buffer number of days before a calender year. as the buffer approaches the value of duration the fraction approaches . the latest release of rdrugtrajectory along with source code and reference manual is available for download from https://github.com/acnash/rdrugtrajectory. whilst active members of the scientific research community we will continue to add new features to rdrugtrajectory whilst making necessary improvements to existing features. acknowledgements oxford science innovation, nihr oxford biomedical research centre and nihr oxford health biomedical research centre (informatics and digital health theme, grant brc- - ). thanks to dr michelle hardy for assistance with the article. references bally m, dendukuri n, rich b, nadeau l, helin-salmivaara a, garbe e, brophy jm ( ). “risk of acute myocardial infarction with nsaids in real world use: bayesian meta- .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/acnash/rdrugtrajectory https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / rdrugtrajectory: analysing drug prescriptions in electronic health care records analysis of individual patient data.” british medical journal, , j . doi: . / bmj.j . ghosh re, crellin e, beatty s, donegan k, myles p, williams r ( ). “how clinical practice research datalink data are used to support pharmacovigilance.” therapeutic advances in drug safety, , – . doi: . / . hepp z, dodick dw, varon sf, chia j, matthew n, gillard p, hansen rn, devine eb ( ). “persistence and switching patterns of oral migraine prophylactic medications among patients with chronic migraine: a retrospective claims analysis.” cephalalgia, ( ), – . doi: . / . oyinlola jo, campbell j, kousoulis aa ( ). “is real world evidence influencing practice? a systematic review of cprd research in nice guidance.” bmc health service research, ( ), – . doi: . /s - - - . affiliation: nuffield department of clinical neurosciences medical sciences division university of oxford oxford uk ox du e-mail: anthony.nash@ndcn.ox.ac.uk journal of statistical software http://www.jstatsoft.org/ published by the foundation for open access statistics http://www.foastat.org/ mmmmmm yyyy, volume vv, issue ii submitted: yyyy-mm-dd doi: . /jss.v .i accepted: yyyy-mm-dd .cc-by-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dx.doi.org/ . /bmj.j http://dx.doi.org/ . /bmj.j http://dx.doi.org/ . / http://dx.doi.org/ . / http://dx.doi.org/ . /s - - - mailto:anthony.nash@ndcn.ox.ac.uk http://www.jstatsoft.org/ http://www.foastat.org/ http://dx.doi.org/ . /jss.v .i https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / partition quantitative assessment (pqa): a quantitative methodology to assess the embedded noise in clustered omics and systems biology data partition quantitative assessment (pqa): a quantitative methodology to assess the embedded noise in clustered omics and systems biology data camacho-hernández, diego a. , †, nieto-caballero, victor e. , †, león-burguete, josé e. , , and freyre-gonzález, julio a. ,* regulatory systems biology research group, laboratory of systems and synthetic biology and undergraduate program in genomic sciences, center for genomic sciences, universidad nacional autónoma de méxico (unam), morelos, mexico. † these authors contributed equally to this work. * corresponding author: jfreyre@ccg.unam.mx abstract: identifying groups that share common features among datasets through clustering analysis is a typical problem in many fields of science, particularly in post-omics and systems biology research. in respect of this, quantifying how a measure can cluster or organize intrinsic groups is important since currently there is no statistical evaluation of how ordered is, or how much noise is embedded in the resulting clustered vector. many of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical measures; but none measure has been developed to statistically quantify the noise in an arranged vector posterior a clustering algorithm, i.e., how much of the clustering is due to randomness. here, we present a quantitative methodology, based on autocorrelation, to assess this problem. keywords: omics data; hierarchical clustering; noise quantification. . introduction a common task in today’s research is the identification of specific markers, as predictors of a classification yielded in clustering analysis of the data. for instance, this approach is particularly useful after high-throughput experiments to compare gene expression or methylation profiles among different cell lines [ ]. this task is coming handful in the nascent field of single-cell sequencing, leading the important step of clustering cells to further classification or as a qualifying metric of the sequencing process [ ]. regarding the vastly used gene expression assays, the vector of profiles for each marker across different cell lines is recorded using hierarchical clustering algorithms. these algorithms yield a dendrogram and a heat map representing the vector of marker profiles, illustrating the arrangement of the clusters. to assess how well the clustering is segregating different cell lines, a class stating the desired partitioning of each cell line is provided a posteriori. then, a simple visual inspection of the vector of classes is used to estimate whether the clustering is providing a good partition. such partition vector is colored according to the classification that each item is associated with, and it is expected that similar items will be contiguous, so the formed groups are assessed qualitatively on the biological background of each item. this procedure should not be confused with “supervised clustering”, which provides a vector of classes starting the desired partitioning a priori. this is then used to guide the clustering algorithms by allowing the learning of the metric distances that optimizes the partitioning [ ]. additionally, it may get confused with the metric assessment of the clustering algorithms, especially with the external cluster evaluation. for this, various metrics have been developed to qualify the clustering algorithm itself, such as intrinsic and extrinsic measures. these metrics are used for clustering algorithm validation. the extrinsic validation compares the clustering to a goal to say whether it is a good clustering or not. the internal validation compares the elements within the cluster and their differences [ ]. pqa involves characteristics of both kinds of validation, through using both the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / crafted goal standard and the yielded signal itself (clustered vector). however, pqa gathers these elements not qualifying the clustering algorithm itself but to quantify the noise embedded in the cluster, this noise may be due to the intrinsic metric or marker used to order the data set. a possible caveat of the qualitative assessment discussed above is that humans tend to perceive meaningful patterns within random data leading to a cognitive bias known as apophenia [ ]. while interpreting the partitions obtained from unsupervised clustering analysis, researchers attempt to visually assess how close the classifications are to each other finding patterns that are not well supported by the data. such an effect is raised because the adjacency between items may give a notion of the dissimilarity distance in the dendrogram leaves. unfortunately, as much as we know, there is no method to quantitatively assess the quality of the groups of classifications from the clustering or, at least, there is no attempt to quantify whether certain configuration or order of the items may be due to randomness. this is a serious caveat, since the insertion of noise can lead to false conclusion or misleading results. furthermore, the purging of this noise can lead to a more efficient descriptions of markers and its phenomena, accelerating the advance in many fields. in statistics, serial correlation (sc) is a term used to describe the relationship between observations of the same variable over specific periods. it was originally used in engineering to determine how a signal, for instance, a radio wave, varies with itself over time. later, sc was adapted to econometrics to analyze economic data over time principally to predict stock prices and, in other fields, to model-independent random variables [ ]. we applied the sc to propose a manner to quantify how well is the grouping of a posterior classification just by retrieving the results of unsupervised clustering analysis. thus, we propose a novel relative score, pqa, to solve the subjectivity of the visual inspection and to statistically quantify how much noise is embedded in the results of clustering analysis. . methodology . . assigning numeric labels to classifications a vector denoting the putative similarities among the variables in a study is usually obtained after a clustering analysis. each variable is classified to generate a vector of profiles (vp). such a vector of classifications is usually translated into a colors vector, in which each color represents a classification. it is common to inspect this vector to find groups that make sense according to the analyzed data. to the method presented in this work, the vp may be as simple as a vector of strings or numbers that represent the input. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . the pipeline of the pqa methodology. whatever representation of the classifications may be, it is necessary to transform the classifications to a vector of numeric labels, in which a number represents a classification, to be able to calculate sc. to accomplish this, we assign the first numeric label (number ) to the first item in the vector, which usually lays at one of the vector’s extremes. then, if the classification o the next item is different from the previous one, the next number in the sequence is assigned, and so on. this way of labeling warrants that the changes in the sc values are due to the order of numbers, that is to say, the grouping of the classifications resulting from the clustering, and it is not an artifact of the labeling itself (figure ). . . pqa score because the order of the vp could be interpreted as the grouping of the classifications, we measure how well the same classifications are held together in the vp through an sc shifted one position. such sort of correlation is defined as the pearson-product-moment correlation between the vp discarding the first item, and the vp discarding the last (equation , xi (order vector i-th position), n (length of x), 𝜌𝑖 (resulting sc)). 𝜌𝑖 = ∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛 𝑗= 𝑛− ) ∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛− 𝑗= 𝑛− ) 𝑛− 𝑖= 𝑛 𝑖= √∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛 𝑗= 𝑛− ) 𝑛 𝑖= √∑ (𝑥𝑖 − ∑ 𝑥𝑖 𝑛− 𝑗= 𝑛− ) 𝑛− 𝑖= ( ) we then define the pqa as the sc of the vp after removing background noise, normalized for the sc of the percent grouping partitions (defined as the sorted vector in ascending order). this, the more similar vp is to its sorted vector, the higher the score is yielded (equation , 𝝆𝒙 (sc of the vp), 𝝆𝑹𝒂𝒏𝒅𝒙̅̅ ̅̅ ̅̅ ̅̅ ̅ (mean of the sc of one thousand randomizations), 𝝆𝑷𝒆𝒓𝒇𝒆𝒄𝒕𝒙 sc of the sorted vector in ascending order)). 𝑷𝑸𝑨𝒙 = 𝝆𝒙−𝝆𝑹𝒂𝒏𝒅𝒙̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅ 𝝆𝑷𝒆𝒓𝒇𝒆𝒄𝒕𝒙 ( ) . . background-noise correlation factor in the pqa score .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / to compute the background-noise correlation factor in the pqa score definition, we sample the indexes of the vp and the swapping the corresponding items. this background correction is aimed to remove inherent noise in the data, even though the score may still be subjected to noise from the chosen clustering algorithm or discrepancies in the posterior classification. . . statistical significance of the pqa score to quantify the statistical significance of the pqa score, we calculate a z-score (equation ), 𝒛𝒙 = 𝑷𝑸𝑨𝒙−𝑷𝑸𝑨𝑹𝒂𝒏𝒅̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅ 𝑺𝑫𝑷𝑸𝑨𝑹𝒂𝒏𝒅 ( ) where 𝑃𝑄𝐴𝑥 is the pqa score of the vp, 𝑃𝑄𝐴𝑅𝑎𝑛𝑑̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅̅ is the mean of pqa scores of one thousand randomizations of the vp. these randomizations have the purpose of generating a solid random background to compare it to the real signal. the number of randomizations does not depend on the size of the vp. it is worth to notice that there are two randomization processes, one is meant to generate the input population of random vectors to calculate the pqa score to further calculate a z- score and the other is representing the noise in equation . . . defining noise proportions to provide a quantification of the embedded noise in the vp, we calculate the z-scores from the distribution of pqa values of the randomized vectors. this shuffling is yielded by scrambling the vector. then this z-score is interpolated to retrieve the estimated noise in the vp cluster. . . effect of the length and number of partitions of the vector in the z-score distributions. since we want to compare the pqa with the noise, we randomized times the vp. we opted to describe the dynamic of the z-score given the different percentage of noise and the number of partitions. for this, we synthetically crafted vector of both ranging from to elements and number of classifications. the z-scores were retrieved from the crafted vectors using the formulas described above. . results and discussion . . effects of permuted numeric labels on the partition we wondered whether the correct assigning of numeric labels to alter the less possible the sc calculations, so we analyzed how the sc changes over the synthetic partitions with permuted labels. we began generating synthetic partitions in ascending and descending order, increasing both the number of classifications and the number of items, up to . it is important to highlight that the number of items belonging to each classification was kept constant. because trying all the possible permutations for each vector would be implausible, we created a subset of permutations of each vector, then we calculated the mean sc (figure , see methodology). we observed that the mean sc got high when the number of items in the vp was greater or equal to times the number of classifications, nevertheless, we got the highest sc when the numeric labels we assigned by sequential order, either ascending or descending (figure ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . z-scores of the pqa scores from partitions varying in the number of classifications and the length of the partition. . . length of partitions as a proxy of the number of classifications we wonder whether the number of classifications and the length of the vp may change the statistical significance of the pqa score because of the less the number of items in the vp, the greater the chance to group each item with any order. we then tested such effect by calculating a z-score from ordered synthetic partitions increasing both the number of classifications and the number of items up to . we also kept constant the number of classifications for the sake of this analysis. we noticed that only the length of the partition has a true effect on the z-score, but that is not the case for the number of classifications. we observed that every partition minor than could be considered as pure noise, however, we consider a z-score cutoff of greater than (p-value of . ). we also observed z-score values still greater than with a length of , , and , but lesser than with lengths between and (figure ). if we were more flexible, we could have laid out a length cutoff on those values without losing statistical significance, since a z-score of corresponds roughly to a p-value of . . the results of this analysis were expected by intuition because the probability of an item to occupy a position in the vp increases the number of items does the same. . . proof of concept: quantifying real noise after a literature revision, we noticed that some datasets were subject to visual inspection in their respective papers, so we applied our method to quantify the proportion of noise embedded in those datasets and to test whether they may lead to apophenia. we choose two datasets from literature because of two main reasons, first, the data should have a high number of items that are way above our z-score significance threshold (> ) and, second, we wanted contrasting orderings of the partitions so to have one dataset that looks very disordered and another that looks somewhat ordered to compare the noise proportions. lastly, we assessed the behavior of the metric in highly ordered data. this also matches our threshold mentioned above. . . . cancer methylation signatures the first dataset consists of methylation profiles of different cancerous and non-cancerous samples [ ] (figure ). though the classifications look very sparse and the groups are torn apart in many subgroups distributed along with the data’s vp. we detected . % of noise and a pqa score .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / of . (figure , with a z-score of . and a p-value of . x - ), both numbers imply that even though there may be disordered in the vp, there is not a very high noise proportion nor a high pqa score. these results suggest that, like any other statistical test, the longer the number of items in the partition the more diluted is the effect of disorder in the vp, and the results also lead to a greater statistical significance as shown in the analysis of the number of items and classifications. besides the authors concluded that their clustering analysis results made sense from their molecular and biological background, as well as the perspectives about the analyzed profiles, they only assessed grouping just by visual inspection and concluded the grouping was well done. however, understanding the noise in the cluster can help to pursue better markers since it could help to narrow the search space in these kinds of studies. (a) (b) figure . visual representation of clustered data used to assess the method. (a) dataset from jie shen et. al. (b) dataset from tooyoka et. al. . . . distribution of micrornas in cancer the second dataset consists of expression profiles of micrornas from three classes of samples: invasive breast cancer, those with ductal carcinoma in situ (dcis), and health (figure ) [ ]. the authors visually identified three clusters, though selecting the right cutting height threshold is difficult. besides, one of the clusters is a mix of classes in different proportions, leading the authors to arguably conclude that the dcis and control sample profiles are not different. on this matter, the pqa score and the proportion of noise are . and . %, respectively (figure , with z-score of . and a p-value of . x - ) providing a quantitative assay to support the grouping that the authors claimed. furthermore, in comparison with the methylation profiles discussed above, we can appreciate that a partition which appear even less fuzzy has even a higher noise ratio, supporting the idea of how visual inspection could lead to misleading results. (a) (b) figure . z-score distribution by percentage of randomized items. (a) dataset from jie shen et. al. (b) dataset from tooyoka et. al. the red dots represent the z-score interpolation of the corresponding data sets. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . . . comparison of genetic regulatory networks with theoretical models finally, to assess the pqa methodology using systems biology data we clustered networks according to their pairwise dissimilarity [ ]. first, curated biological networks were retrieved from abasy atlas (v . ) [ ]. for each biological network, we then constructed four networks each according to a theoretical model (barabasi-alberts, erdos-renyi, scale-free, and hierarchical- modular). we estimated the parameters of each theoretical model from the properties of the corresponding biological network. the models used reproduce one or more intrinsic characteristics of the biological networks, such as power-law distribution, hubs, and scale-free degrees, and hierarchical modular structure [ ]. visual inspection suggested that the classification yielded a highly ordered pv, distinguishing according to the nature of each network (figure ). the pqa score for this vp is . (p-value = . x - , z-score = . ) and the proportion of noise was . % (figure ). in contrast to the previous examples, here we obtained a highly ordered clustering and a very low proportion of noise, which suggests that although the models recapitulate some of the properties of genetic regulatory networks, each of them is not sufficient to capture their structural properties. figure . cluster analysis of distance among gene regulatory networks and theoretical network models. the abbreviations and colors used in the posterior classification are as follows: barabasi- alberts (ba, red), erdos-renyi (er, blue), scale-free (sf, green), hierarchical modularity (hm, purple), and biological networks (bi, orange). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . z-score distribution by percentage of randomized items of vp from genetic regulatory networks. the red dot represents the z-score interpolation of the actual data set. . conclusions in this work, we presented a novel method to quantify the proportion of noise embedded in the grouping of associated classes of the elements in hierarchical clustering. we proposed a relative score derived from an sc of the vp from the dendrogram of any clustering analysis and calculated z- statistics as well as an extrapolation to deliver an estimation of noise in the vp. we explain how the method is formulated and show the tests we made to systematically refine it. we additionally made a proof of concept by using clustering data from two works that we think perfectly represent overfitting by apophenia. additionally, we added an example from network biology where clustered networks are separated by intrinsic characteristics. although in this work we focused on examples where hierarchical clustering is performed, this framework can apply to any partition algorithm in which the elements are identified and a vector of the order can be acquired. we concluded that the clustered sets of biologic data have a high measure of noise, despite looking well grouped. we proved what a minimum number of classifications should be considered in this sort of clustering analysis to have a significant reduction of noise. on the other hand, we permuted the labels of the associated classes and concluded that the effect is negligible. we proved that randomness still plays an important role by biasing the results, though it may not be evident through visual inspection. the pqa could be used as a benchmark to test what clustering algorithm should be appropriate for the analyzed dataset by minimizing the noise proportion and to guide omics experimental designs. nevertheless, a word of caution, the pqa score alone can be subject to subjectivity if not used properly since it depended on the characteristics of the analyzed data. thus, the pqa score is thought to be considered as a quantification of noise in clustered data and should be used with discretion. author contributions: conceptualization, j.a.f.g.; methodology, j.a.f.g.; software, d.a.c.h., v.e.n.c., and j.a.f.g.; validation, d.a.c.h., v.e.n.c., and j.a.f.g.; formal analysis, d.a.c.h., v.e.n.c., and j.a.f.g.; investigation, d.a.c.h., v.e.n.c., j.r.l.b., and j.a.f.g.; resources, j.a.f.g.; data curation, d.a.c.h., v.e.n.c., and j.e.l.b.; writing—original draft preparation, d.a.c.h., v.e.n.c., j.e.l.b., and j.a.f.g.; writing—review and editing, d.a.c.h., v.e.n.c., and j.a.f.g.; visualization, d.a.c.h., v.e.n.c., j.e.l.b., and j.a.f.g.; supervision, j.a.f.g.; project administration, j.a.f.g.; funding acquisition, j.a.f.g. all authors have read and agreed to the published version of the manuscript. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / funding: this work was supported by the programa de apoyo a proyectos de investigación e innovación tecnológica (papiit-unam) [in to j.a.f.g.]. conflicts of interest: the authors declare no conflict of interest. references . kang, s., kim, b., park, s.-b., et al. . stage-specific methylome screen identifies that nefl is downregulated by promoter hypermethylation in breast cancer. international journal of oncology ( ), pp. – , doi: . /ijo. . . . kiselev, v. y., andrews, t. s., & hemberg, m. ( ). challenges in unsupervised clustering of single-cell rna-seq data. nature reviews genetics, ( ), - , doi: . /s - - - . . al-harbi, s.h. and rayward-smith, v.j. . adapting k-means for supervised clustering. applied intelligence ( ), pp. – , doi: . /s - - - . . hassani, m., & seidl, t. ( ). using internal evaluation measures to validate the quality of diverse stream clustering algorithms. vietnam journal of computer science, ( ), - , doi: . /s - - - . . fyfe, s., williams, c., mason, o.j. and pickup, g.j. . apophenia, theory of mind and schizotypy: perceiving meaning and intentionality in randomness. cortex ( ), pp. – , doi: . /j.cortex. . . . . getmansky, m., lo, a.w. and makarov, i. . an econometric model of serial correlation and illiquidity in hedge fund returns. journal of financial economics ( ), pp. – , doi: . /j.jfineco. . . . . shen, j., hu, q., schrauder, m., et al. . circulating mir- b and mir- a as biomarkers for breast cancer detection. oncotarget ( ), pp. – , doi: . /oncotarget. . . toyooka, s., toyooka, k. o., maruyama, r., virmani, a. k., girard, l., miyajima, k., ... & brambilla, e. ( ). dna methylation profiles of lung tumors. molecular cancer therapeutics, ( ), - . . schieber, t. a., carpi, l., díaz-guilera, a., pardalos, p. m., masoller, c., & ravetti, m. g. ( ). quantification of network structural dissimilarities. nature communications, ( ), - . . escorcia-rodríguez, j. m., tauch, a., & freyre-gonzález, j. a. ( ). abasy atlas v . : the most comprehensive and up-to-date inventory of meta-curated, historical, bacterial regulatory networks, their completeness and system-level characterization. computational and structural biotechnology journal, doi: . /j.csbj. . . . . barabasi, a. l., & oltvai, z. n. ( ). network biology: understanding the cell's functional organization. nature reviews genetics, ( ), - , doi: . /nrg . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ http://f .com/work/bibliography/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / linus: conveniently explore, share, and present large-scale biological trajectory data from a web browser   linus​: conveniently explore, share, and present large-scale biological trajectory data  from a web browser.    authors:  johannes waschke ​ , ​, mario hlawitschka ​ ​, kerim anlas ​ ​, vikas trivedi ​ , ​, ingo roeder ​ , ​, jan huisken ​ ​,  and nico scherf ​ , *  ​ max planck institute for human cognitive and brain sciences, stephanstr. a, leipzig, germany  ​ faculty of computer science and media, leipzig university of applied sciences, leipzig, germany  ​ embl barcelona, c/ dr. aiguader , barcelona, spain.  ​ embl heidelberg, developmental biology unit, heidelberg, germany.  ​ national center of tumor diseases (nct), partner site dresden, dresden, germany  ​ institute for medical informatics and biometry, carl gustav carus faculty of medicine, school of medicine, tu  dresden, dresden, germany  ​ morgridge institute for research, madison, wisconsin , usa    * correspondence: to ​nscherf@cbs.mpg.de    abstract  in biology, we are often confronted with information-rich, large-scale trajectory data, but exploring and communicating  patterns in such data is often a cumbersome task. ideally, the data should be wrapped with an interactive visualisation in  one concise package that makes it straightforward to create and test hypotheses collaboratively. to address these  challenges, we have developed a tool, ​linus​, which makes the process of exploring and sharing d trajectories as easy  as browsing a website. we provide a python script that reads trajectory data and enriches them with additional features,  such as edge bundling or custom axes and generates an interactive web-based visualisation that can be shared offline  and online. the goal of ​linus​ is to facilitate the collaborative discovery of patterns in complex trajectory data.              .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:nico.scherf@tu-dresden.de https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   introduction  in biology, we often face large-scale trajectory data from dense spatial pathways, such as the brain connectivity obtained  from diffusion mri imaging ​(liu et al., )​, or tracking data such as cell trajectories or animal trails ​(romero-ferrero et  al., ) ​. although this type of data is becoming increasingly prominent in biomedical research ​(kwok, ; mcdole et  al., ; wallingford, )​, exploring, sharing, and communicating patterns in such data are often cumbersome tasks  requiring a set of different software that are often complex to install, learn and use. recently, new tools have become  available for efficiently visualising d volumetric data ​(pietzsch et al., ; royer et al., ; schmid et al., )​, and  some of those allow the user to overlay tracking data to cross-check the quality of the results or to visualise simple  predefined features (such as speed or time). however, given the more general-purpose design of such software, these  are not ideal solutions to efficiently and collaboratively explore and share the visualisations.​ ​an interactive, scriptable, and  easily shareable visualisation ​(shneiderman )​ would open up novel ways of communicating and discussing  experimental results and findings ​(callaway )​. the analysis of complex and large-scale trajectory data and the  creation and testing of hypotheses could then be done collaboratively. importantly, since such bioinformatics tools would  be right at the interface of computational and life sciences, they need to be accessible and usable for scientists with little  or no background in programming. ideally, the data should be bundled with a guided, interactive presentation in one  concise visualisation packet that can be passed to a collaborator. to address these challenges, we have developed our  visualisation tool ​linus​, making it easier to explore d trajectory data from any device without a local installation of  specialised software. ​linus​ creates interactive visualisation packets that can be explored in a web browser, while keeping  data presentation straightforward and shareable, both offline and online (fig a). we began to develop this tool when we  struggled to find adequate software to explore cell trajectories during zebrafish gastrulation from large-scale fluorescence  microscopy datasets ​(shah et al., ) ​. ​linus​ allowed us now to interactively visualise and analyse the tracks of around  . cells (starting number) as they moved across the zebrafish embryo throughout hrs. more importantly, it  enabled us to share and discuss visualisations with collaborators across disciplines.    results and discussion  linus is a python-based tool that is easy to install and use for scientists at the interface between disciplines.  our overall goal when developing ​linus​ was to create a versatile and lightweight visualisation tool that runs on a wide  range of devices. to this end, we based the visualisation part on web technologies. specifically, we used typescript,  which compiles to javascript and webgl. however, a core component of the visualisation process, the data  preparation, requires local file access and fast computations, both of which are limited in javascript. for that reason, we  also created a python (> v . ) script that handles the computationally demanding parts of data processing and  automatically generates the web-based visualisation packages.   creating a visualisation package with ​linus​ is done in a few simple steps (fig. a): the user imports trajectory data from a  generic, plain csv format (see methods) or from a variety of established trajectory formats such as svf ​(mcdole et al.,  )​, tgmm xml ​(amat et al., ) ​, or the community standard biotracks ​(gonzalez-beltran et al., )​, which itself  supports import from a wide variety of cell tracking tools such as cellprofiler ​(mcquin et al., ) ​ or trackmate ​(tinevez  et al., ) ​. during the data conversion, ​linus​ can enrich the trajectory data with additional attributes or spatial context.  for example, we declutter dense trajectories by highlighting the major “highways” through edge-bundling (fig. b). ​linus  can automatically add generic attributes that are useful in a range of applications, such as the local angle of the  trajectories or a timestamp. the user can simply add custom numerical attributes for specific applications by providing  these measurements as extra columns in csv files (see methods). the data attributes form the basis for advanced  rendering effects. if users want to give a spatial context, ​linus​ can generate axes automatically, or users can define  custom axes.   for more efficient computing, the preprocessing script uses established and optimised packages from python’s rich  ecosystem, like numpy and (py)opencl. in particular, the edge bundling algorithm runs highly parallel on the graphics  card and thus, about - times faster than a cpu-based calculation (with opencl-enabled hardware). however,  only the creator of a ​linus​-based visualisation package needs to run this preprocessor script. the target audience  requires only a web browser to view and explore the data. ​the result of the preprocessing is a ready-to-use visualisation  package that can be opened in a web browser on any device with webgl support. ​the package is a folder containing  html, javascript, and related files.   .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/at m https://paperpile.com/c/m zzqk/rjuf https://paperpile.com/c/m zzqk/rjuf https://paperpile.com/c/m zzqk/ srq+xjeg+ sfn https://paperpile.com/c/m zzqk/ srq+xjeg+ sfn https://paperpile.com/c/m zzqk/n pa+ ifm+njch https://paperpile.com/c/m zzqk/ k https://paperpile.com/c/m zzqk/nhfw https://paperpile.com/c/m zzqk/ ld https://paperpile.com/c/m zzqk/xjeg https://paperpile.com/c/m zzqk/xjeg https://paperpile.com/c/m zzqk/qisc https://paperpile.com/c/m zzqk/meu https://paperpile.com/c/m zzqk/q l https://paperpile.com/c/m zzqk/c cd https://paperpile.com/c/m zzqk/c cd https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   interactive visualisation with configurable filters allows in-depth data exploration for a variety of applications  across sciences.  after configuring and creating the visualisation package with the python toolkit, further adjustments are possible within  the web browser. ​opening the index.html file starts the visualisation and shows the trajectories with baseline render  settings (semi-transparent, single-coloured rendering on a grey background). ​the browser renders an interactive  visualisation of the trajectories and an interface for the user to update and adapt the visualisation to their needs (e.g.  colour scales, projections, clipping planes) (fig. b). ​the user interface itself is adapted to each dataset: the  preprocessing script generates a separate property and the corresponding slider (filters and colour mapping) for each  given data attribute in the user interface. if more than one state is available for the dataset (e.g. an edge bundled copy of  the data, or custom projections), the interface automatically offers the functionality to fade between the states (see  methods).    the user can carve out patterns from the original “hairball” of lines by setting general visualisation parameters like shading  and colour maps (fig. a). to focus on particular parts of the dataset, the user filters the data for the various attributes  such as specific time intervals or user-specified numerical properties such as marker expression in cell tracking (fig. b).  alternatively, the user can select spatial regions of interest (rois) either with cutting planes or with progressively refinable  selections (fig. c). the visual attributes can then be separately defined for the selected in-focus areas and the  (non-selected) context regions (fig. c) to create a focused visualization. apart from the purpose of qualitative  visualization, the selected trajectories can also be downloaded as csv files for subsequent quantitative analysis (see  methods).    one important problem with large-scale trajectory data is the sheer density of tracks that often leads to extreme visual  clutter. to tackle this problem, one prominent feature of ​linus ​is the ability to blend between different data  transformations seamlessly. we provide two main sorts of transformations out-of-the-box: the user can smoothly  transition between original and bundled state to focus on major “highways” (fig. d, fig. b), or between original ( d  cartesian) view and different d projections (e.g. a mercator map) to provide a global, less cluttered perspective on the  trajectories (fig. e,f). if other, application-specific transformations are needed, such as a spatial transformation or any  form of trajectory clustering, the user can provide such an alternative state during preprocessing and then interactively  blend between those states.    however, the choice of a web-based visualisation solution brings some drawbacks. the amount of data that can be  fluently visualised depends on the underlying hardware (smartphones: > , trajectories, notebooks, and desktop  computers: > , trajectories). another limitation is the reduced feature set which common web browsers offer  regarding graphics card access: compared to the api of opengl, the browser-based webgl api offers fewer shader  features. these restrictions lead to some limitations for the rendering process. ​a drawback of our rendering approach is  that it creates artifacts related to the rendering order when we rotate the camera. thus, we have to order the line  fragments ​offline ​(i.e. not on the graphics card, but in javascript), which is a time-consuming process. to maintain high  framerates, we only sort line fragments within a second after a user interaction has finished, leading to artifacts during  camera motions (see methods). furthermore, we cannot provide correct render order when rendering two datasets in the  same view, and thus ​linus ​ works best when only rendering one dataset at once.  data and visualisations are easily shareable with collaborators via interactive visualisation packets.  as a straightforward solution to share the results, the user directly exports the visualisations from the webview as static  images and videos (e.g. such as supplementary video ). but sharing the visualisation of the data can go a step beyond  image or video data. the user can conveniently record all these visualisation properties directly in the web-interface of  linus ​to create information-rich, interactive tours. the user adjusts these tours on a detailed level using a timeline-based  editor (supplementary fig. ). an icon represents each action that can be moved along the time axis to develop a visual  storyline. smooth transitions and textual markers that can be precisely timed, facilitate understanding and storytelling. to  communicate and distribute new findings, these tours can easily be shared online or offline with the community  (colleagues, readers of a manuscript, audience of a real or virtual presentation). ​the tours are copied into the source  code of the visualisation package or, if they consist of a limited number of actions (see methods for details), they can be  shared by a dynamically created url or a qr code. ​fig. shows examples of visualisations that have been created with  linus​ ranging from dynamic trajectories in d (fig. a) or on surfaces (fig. b) to static (fig. c) or dynamic d (fig. d)  tracks across applications from ethology, neuroscience, and developmental biology. an interactive version of each  example can be found online by simply scanning the respective qr codes in the figure.    .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   we tested ​linus ​ visualisation packages across various devices and found that performance is the most important aspect  of the user experience that varies between different devices. desktop computers with mid-range graphics cards (e.g. the  graphics processors that are built-in with current cpus) can easily handle more than , trajectories at smooth  framerates. mid-range smartphones handle the same data with low framerates (ca. fps), which is still usable but does  not feel as smooth. for virtual reality applications, we also tested ​linus ​ on the oculus go vr goggles. here, a high frame  rate is essential as the user experience would be quite discomforting otherwise and we recommend reducing the number  of trajectories further to about , in this use-case. due to the differences in performance and user experience, we  recommend creating dedicated visualisation packages (or tours) for the intended type of output device.  in the future, we would like to support further advanced preprocessing options such as trajectory clustering, more  generic transforms or feature extraction. we also would like to extend the visualization part of ​linus, ​ so the user can  interactively annotate the data. here, we envision that the user can easily label subsets of trajectories and then use this  information for downstream analysis (such as building a trajectory classifier).   our experience with ​linus​ shows that sharing relatively complex data visualisations in this interactive way makes it much  more efficient to collaboratively find patterns in data and to create and discuss figures or videos for presentations and  manuscripts. more generally, interactive data sharing is helpful when collaborations, presentations, or teaching occur  remotely, as it has been a common situation during the current pandemic. at the same time, during an in-person event  such as a talk or poster session at a conference, the target audience can explore the data instantly on their computers,  tablets, or smartphones. in any case, touch screens or even virtual reality goggles increase the immersion with more  natural controls and true d-rendering, helping to grasp the trajectories’ spatial relation. with these features, we are  convinced that approaches like ​linus​ will improve considerably how we collectively explore, communicate, and teach the  spatio-temporal patterns from information-rich, multi-dimensional, experimental data.  methods  our software consists conceptually of two parts: a python-based preprocessing and a web-based visualisation tool.​ we  aimed to move all static and computationally expensive adjustments to the preprocessor, whereas dynamic adjustments  to tweak the visualisations are all be performed directly in the web browser later. after running the preprocessor, a folder  containing html, css, and javascript files is created (called a visualization packet). these files are opened directly or  uploaded to a web server.  types of input data  we currently support different trajectory file types directly: tgmm ​(amat et al., ) ​, biotracks ​(gonzalez-beltran et al.,  )​, svf ​(mcdole et al., )​, and custom csv. most formats are designed to store d coordinates plus a  timestamp primarily, but no other custom data. however, ​linus ​ supports additional numerical attributes that can then be  used to filter or colour the trajectories accordingly. we, therefore, offer a generic csv format which can be supplemented  with custom numerical data: each csv file contains the data for a single trajectory, the first three columns represent the  coordinates (x, y, z) and any further column is interpreted as another attribute. the columns are delimited by semicolons,  and the number of columns must be identical for all csv files. ​linus ​ reads the first line of a csv file by default as the  header and uses this information to automatically name the respective properties in the user interface. the data  converter script then expects a folder that exclusively contains csv files as input.  implementation of data preprocessing  the trajectory data are then converted to a custom json format by our python-based preprocessor. python has the  advantage of being executable on a wide range of operating systems and hardware. the preprocessor is used with a  command-line interface or by calling the respective commands directly. the command-line interface is easier to use, and  it covers the most common cases (e.g. visualising a dataset with custom attributes, and automatically adding an  edge-bundled version). for more complex cases, e.g. visualising two datasets at once, or using multiple custom states of  the data (e.g. custom projections), users can write their own python script. we provide detailed and up-to-date  documentation in our repository at ​https://gitlab.com/imb-dev/linus​.  time-consuming operations are implemented using numpy, and the most demanding process (edge bundling) is  handled by an opencl script, which increases calculation speed by - fold. all trajectories are resampled to equal  length during the preprocessing step, enabling us to use numpy’s fast matrix-based algorithms (we use -matrices,n * m   storing trajectories with points in each trajectory). the resulting json file then contains a list of datasets. eachn m   dataset holds a set of trajectories that optionally can be further organised into several states, for example, the original  data and a projected version. at this point, all data are organized in the same structure as it is required by webgl  (supplementary fig. ), which allows faster loading of the data in the next step.   .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/qisc https://paperpile.com/c/m zzqk/meu https://paperpile.com/c/m zzqk/meu https://paperpile.com/c/m zzqk/xjeg https://gitlab.com/imb-dev/linus https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   implementation of the web-based tool  the visualisation part runs in web languages (html, javascript, css, webgl). the json file containing the  preprocessed data is directly loaded as an object by javascript. this part of the software copies the numeric arrays from  the json file into webgl's data buffers like the position buffer, index buffers, and attribute buffers. if a dataset contains  more than one state (e.g. an original state and a projected state), these states are stored in additional attribute buffers.  depending on the provided data, we also adjust the shader source code dynamically. for example, we inject variables  and specific statements into the shader source code before it is compiled by webgl. with the dynamic creation of  buffers as well as code statements and variables, we pre-build a shader program that is directly tailored to the properties  of the respective data. as a result, rendering the data allows quick changes of the visualisation (e.g. color mapping or  projections) without the need for updating the datasets on the graphics card, which results in higher frame rates and  smooth transitions compared to approaches where data is transformed offline.  in principle, ​linus​ supports an arbitrary number of attributes and states. however, practically this number is limited by the  particular device’s abilities (i.e. its graphics card) and webgl in general. typically, we have eight attribute arrays on  smartphones and sixteen or more on desktop computers. our software requires four such attribute arrays for internal  purposes, plus one more array for each state or attribute. thus, for a dataset containing original data, bundled data and  two custom attributes (that are shared between the states) we would need eight attribute buffers in total, which can still  be managed by a smartphone. visualising adding additional states or attributes requires devices with more capabilities,  like a desktop computer.  the graphical user interface (gui)   the user interface (see fig. and supplementary fig. ) consists of a general part that includes options to change the  size of the gui, the background colour, and camera controls. furthermore, the user can choose how often the render  order should be restored (see section "current technical limitations"). additionally, several data-specific settings are  shown, and this section is further divided into:  ● filters​ for each attribute to only show data within a defined range; if window is a positive value, it will be used to  automatically display a range [min, min+window] (while max is ignored).  ● render settings​, including colour mapping, shading, transparency, which can be independently set for selected  and unselected trajectories.  ● mercator projection​ plus rotations that are applied to the d positions before the d transformation, and  mapping the "free" z component to attributes for d + feature plots (e.g. space-time trajectories).  ● cutting planes​ can be used to generate ​a generic d projection. here, the projection plane can be defined by  selecting a centre point and a normal direction. everything above the projection plane is then mapped onto the  plane.   ● the last part of the gui offers options to export selected trajectories and also shows a list of available tours.  this list is used to start or to load a tour into the tour editor.  sharing visualisations and tours  as explained above, the user receives a self-contained package. this package can be opened with any web browser  that supports webgl and can be distributed in multiple ways: it can be locally shared (e.g. sent by email or copied  using, e.g. a usb stick) or made easily accessible to a broad audience by uploading it to a web server (as done e.g. on  our companion website for this manuscript https://imb-dev.gitlab.io/linus-manuscript/).   the method of sharing the actual visualisation package also influences how an interactive tour can be distributed. in  order to make a tour reproducible, they are internally represented by a textual list of actions. this script can be copied  directly into the source code of the file main.html of the visualisation package. this method works both for server-based  and for file-based distribution of the package. if the visualisation package is hosted on a web server, the tours can also  be shared simply with a custom url and qr code that encodes a tour’s actions. however, the length of such tours is  restricted: qr codes are limited in the amount of information they can store, and urls are usually limited as well (but  typically this limit can be configured in the web server's settings). the commands for camera motion and parameter  adjustment (e.g. changing the colour) are concise and only require a few bytes of the url or qr code. in contrast,  textual annotations and especially spatial selections require considerably more space. thus, sharing a tour by qr codes  or urls usually works for tours without selections and without extensive text annotations.   specific considerations for virtual reality devices  the virtual reality mode works only when the visualisation package is hosted on a web server. further, the way of  navigation changes slightly because the head position takes over the task of the camera. for convenience, we introduce  .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   the possibility to adjust the height of the dataset and to rotate the data horizontally. inside the vr environment, no gui is  rendered. to allow controlling the gui, the user can switch between " d mode" and "vr mode" instantly.  export of trajectories  the user can select trajectories and download this selection. the download may take several minutes as the data must  internally be converted into csv format. the result is a zip folder containing one folder for each data set (usually a single  folder), each containing a separate folder for each state of the data (e.g. "original" and "bundled"). each trajectory is  saved as a separate csv file. it should be noted, however, that the user can only download the resampled trajectories  and not trajectories in the raw (temporal or spatial) resolution before the data preprocessing.  screenshots and videos  at any time, the user can take screenshots and record videos with the respective buttons in the bottom left corner. video  recording requires an up-to-date chrome-based browser (chrome version or later; other browsers might support it  as well but only with enabled experimental features). the output format is webm, which is currently the only file type that  can be directly saved from webgl.  additional technical limitations  in order to offer the tool for a broader range of platforms, we decided to utilise webgl . . this web standard provides  the feature set of opengl es . (https://www.khronos.org/webgl/), which is limited compared to regular opengl  versions. webgl . is implemented by a wide range of browsers, such as chrome version , firefox . , safari . ,  ios , chrome mobile (or newer, respectively).  when rendering a scene containing both trajectories and context, our application must render two different types of  geometric primitives (lines and triangles) simultaneously. this can only be performed by two consecutive draw calls: the  program first renders all triangles, and then we subsequently render the line segments. since we need to support  transparent rendering, we cannot rely on the z-buffer for determining the spatial order of the segments as this works only  for non-transparent geometries (the z-buffer usually tells us if a segment should be drawn or not by checking if already  another closer segment has been drawn that would cover the new segment). thus, we use an alternative to the z-buffer:  we sort the geometry first and render it starting with the most distant element. step by step, we draw elements that are  closer to the observer over more distant ones ensuring the correct depth ordering of elements. however, we cannot use  this idea to compute the overlap between the set of triangles and the set of line segments since they are different types  of primitives and as such, require separate draw calls. as webgl currently does not have a geometry shader, we cannot  mix triangles and lines in one draw call. a consequence is that context can only be rendered as a background silhouette.  our internal resorting procedure can require a noticeable amount of time (e.g. around . s for . trajectories). to  ensure a fluent user experience, we use an adaptive strategy and only sort the data when the user stops moving the  camera. this can lead to some visual artifacts during the rotation of the camera, but after stopping the motion, the  correct rendering order is established quickly. for huge amounts of data, or for devices with low cpu performance (the  sorting happens on the cpu, not on the gpu), it is also possible to completely disable the sorting. in that case, we  shuffle the rendering order, which at least avoids distracting global patterns introduced by these artifacts.  data availability  exemplary visualizations are available by scanning the qr codes in fig. directly or by visiting  https://imb-dev.gitlab.io/linus-manuscript/   code availability  the ​linus​ software including source code and documentation is freely available at our repository at  https://gitlab.com/imb-dev/linus​.  acknowledgments  the authors are grateful to gopi shah and konstantin thierbach for sharing data and contributing useful feedback. j.w.  received funding from the international max planck research school on neuroscience of communication: function,  structure, and plasticity (leipzig, germany; ​https://imprs-neurocom.mpg.de ​). k.a. and v.t. acknowledge funding from  european molecular biology laboratory (embl) barcelona and mesoscopic imaging facility, embl barcelona for help  with imaging.   .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://imb-dev.gitlab.io/linus-manuscript/ https://imb-dev.gitlab.io/linus-manuscript/ https://imprs-neurocom.mpg.de/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   author contributions   n.s., j.h., and i.r. conceived the project. j.w. wrote the software code. m.h. and n.s. supervised the project. n.s. and  j.w. wrote the manuscript. k.a. and v.t. generated the dataset on zebrafish blastoderm explants. all authors read,  edited, and approved the manuscript.   references  amat f, lemon w, mossing dp, mcdole k, wan y, branson k, myers ew, keller pj. . fast, accurate  reconstruction of cell lineages from large-scale fluorescence microscopy data. ​nat methods​ ​ ​: – .  bailey h, mate br, palacios dm, irvine l. . behavioural estimation of blue whale movements in the northeast pacific  from state-space model analysis of satellite tracks. ​endanger species res​.  callaway e. . the visualizations transforming biology. ​nature news​ ​ ​: .  egevang c, stenhouse ij, phillips ra, petersen a, fox jw, silk jrd. . tracking of arctic terns sterna paradisaea  reveals longest animal migration. ​proc natl acad sci u s a​ ​ ​: – .  gonzalez-beltran an, masuzzo p, ampe c, bakker g-j, besson s, eibl rh, friedl p, gunzer m, kittisopikul m, le  dévédec se, leo s, moore j, paran y, prilusky j, rocca-serra p, roudot p, schuster m, sergeant g, strömblad s,  swedlow jr, van erp m, van troys m, zaritsky a, sansone s-a, martens l. . community standards for open  cell migration data. ​biorxiv​. doi:​ . /   imirzian n, zhang y, kurze c, loreto rg, chen dz, hughes dp. . automated tracking and analysis of ant  trajectories shows variation in forager exploration. ​sci rep​ ​ ​: .  kwok r. . deep learning powers a motion-tracking revolution. ​nature​ ​ ​: – .  liu c, ye fq, newman jd, szczupak d, tian x, yen cc-c, majka p, glen d, rosa mgp, leopold da, silva ac. . a  resource for the detailed d mapping of white matter pathways in the marmoset brain. ​nat neurosci​ ​ ​: – .  mcdole k, guignard l, amat f, berger a, malandain g, royer la, turaga sc, branson k, keller pj. . in toto  imaging and reconstruction of post-implantation mouse development at the single-cell level. ​cell​ ​ ​.  doi: ​ . /j.cell. . .   mcquin c, goodman a, chernyshev v, kamentsky l, cimini ba, karhohs kw, doan m, ding l, rafelski sm, thirstrup  d, wiegraebe w, singh s, becker t, caicedo jc, carpenter ae. . cellprofiler . : next-generation image  processing for biology. ​plos biol ​ ​ ​:e .  pietzsch t, saalfeld s, preibisch s, tomancak p. . bigdataviewer: visualization and processing for large image data  sets. ​nat methods ​ ​ ​: – .  romero-ferrero f, bergomi mg, hinz r, heras fjh, de polavieja gg. . idtracker.ai: tracking all individuals in large  collectives of unmarked animals. ​arxiv [cscv]​.  royer la, weigert m, günther u, maghelli n, jug f, sbalzarini if, myers ew. . clearvolume: open-source live d  visualization for light-sheet microscopy. ​nat methods​ ​ ​: – .  schmid b, tripal p, fraaß t, kersten c, ruder b, grüneboom a, huisken j, palmisano r. . dscript: animating  d/ d microscopy data using a natural-language-based syntax. ​nat methods ​ ​ ​: – .  shah g, thierbach k, schmid b, waschke j, reade a, hlawitschka m, roeder i, scherf n, huisken j. . multi-scale  imaging and analysis identify pan-embryo cell dynamics of germlayer formation in zebrafish. ​nat commun​ ​ ​: .  shneiderman b. . the eyes have it: a task by data type taxonomy for information visualizationsproceedings   ieee symposium on visual languages. pp. – .  tinevez j-y, perry n, schindelin j, hoopes gm, reynolds gd, laplantine e, bednarek sy, shorte sl, eliceiri kw. .  trackmate: an open and extensible platform for single-particle tracking. ​methods​. doi:​ . /j.ymeth. . .   trivedi v, fulton t, attardi a, anlas k, dingare c, martinez-arias a, steventon b. . self-organised symmetry  breaking in zebrafish reveals feedback from morphogenesis to pattern formation. ​biorxiv​. doi:​ . /   wallingford jb. . the -year effort to see the embryo. ​science​ ​ ​: – .      .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/qisc http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/ey c http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/nhfw http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/iwwe http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://paperpile.com/b/m zzqk/meu http://dx.doi.org/ . / http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/uwiz http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/ sfn http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/at m http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://paperpile.com/b/m zzqk/xjeg http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/q l http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/n pa http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/rjuf http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/ ifm http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/njch http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ ld http://paperpile.com/b/m zzqk/ k http://paperpile.com/b/m zzqk/ k http://paperpile.com/b/m zzqk/c cd http://paperpile.com/b/m zzqk/c cd http://paperpile.com/b/m zzqk/c cd http://paperpile.com/b/m zzqk/c cd http://dx.doi.org/ . /j.ymeth. . . http://paperpile.com/b/m zzqk/ iq http://paperpile.com/b/m zzqk/ iq http://paperpile.com/b/m zzqk/ iq http://paperpile.com/b/m zzqk/ iq http://dx.doi.org/ . / http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq http://paperpile.com/b/m zzqk/ srq https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   figures    figure browser-based exploration and sharing of trajectory visualizations with ​linus​.​ (a) ​control workflow of ​linus​. starting  with the data, a python-converter is used to enrich the data with further features (e.g. numeric metrics, an edge-bundled version of the  data, visual context) and to prepare the visualisation package. (b) within minutes, the data can be visualised and explored in the  browser, and different aspects of the data can be interactively highlighted (example shows the effect of changing the degree of trajectory  bundling).     .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /     figure configurable filters allow deep data exploration. ​ ​the user can choose from a range of several visualisation methods  directly in the browser interface to highlight aspects of interest in the data (zebrafish tracking results from ​(shah et al., ) ​ as an  example). (a) the line data is visualized using a range of options for shading and colour mapping. (b-d) ​ the user can filter parts of the  data with respect to specific attributes, such as (b) time intervals or (c) a specific range of signals (marker expression in cells in this case).  (d) the user can further create subselections of the tracks in space using cutting planes or refinable spatial selection. the visual  attributes can be defined separately for the selected focus region and the non-selected context region. (e-g) the web interface can  blend seamlessly between different states of the data. this feature can be used to map between (e) original tracks and their  edge-bundled version, to visualize planar projections of the d data (f) locally on a definable (oblique) plane or (g) globally using a  mercator projection (with definable parameters).     .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/ ld https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /     figure ​s​harable interactive visualization packets for a multitude of applications ranging across a variety of sciences. ​ ​the  user can combine the visualization methods, annotations, and camera motion paths in a scheduled tour that can be shared by a custom  url or qr code generated directly in the browser interface. ​panels (a)-​(d) demonstrate use cases for real-world datasets with different  characteristics and dimensionality. (a) ant trails ( d+t) from ​(imirzian et al., ) ​. bundling and colour-coding (spatial orientation by  mapping (x,y,z) to (r,g,b) values) indicate the major trails running in opposing directions. (b) gps animal tracking data for two species  (blue whales ​(bailey et al., ) ​ - blue and arctic tern ​(egevang et al., ) ​ - red) shown on a mercator projection of the earth’s  surface. for a better orientation, the outline of the continents is included as axes into the visualization that dynamically adapt to the  projections and viewpoint changes ( d surface data + t). (e) cell movements during the elongation process of zebrafish blastoderm  explants ( d+t) ​(trivedi et al., ) ​. bundling, colour coding, and spatial selection highlight collective cell movements as the explant  starts elongating, focusing on a subpopulation of cells driving this process. colour code shows time from early (yellow) to late (red) for  selected tracks. (f) brain tractography data showing major white matter connectivity from diffusion mri ( d). the spatial selection  highlights the left hemisphere while anatomical context is provided by the outline of the entire brain (from mesh data) and the defocused  tracts of the right hemisphere.        .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/m zzqk/uwiz https://paperpile.com/c/m zzqk/ey c https://paperpile.com/c/m zzqk/iwwe https://paperpile.com/c/m zzqk/ iq https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /   supplementary figures    supplementary figure : ​overview of data structure. ​the coordinate list holds the x/y/z values for each supporting point of the  trajectories. for each such point, an arbitrary number (only limited by the graphics card's capabilities) of attributes can be stored. the  attributes must be provided in the same order as the points. to create trajectories from the point set, an index list is provided as well.  each pair of indices describes one segment of a trajectory. the number of such segments is not restricted, as any point (and its  respective attributes) can be used multiple times.    .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /     supplementary figure : overview of settings. ​ ​an overview of the different visualisation settings available to the user from the gui  (two screenshots merged). for explanations regarding different settings, see text or documentation at ​https://gitlab.com/imb-dev/linus​.    .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gitlab.com/imb-dev/linus https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /     supplementary figure : tour editor. ​ ​the tour actions can be organised by drag and drop (reading order: from left to right, top to  bottom). every action can be scheduled with a time delay with respect to the end of the previous action. some actions use transitions  (e.g. camera motions or the adjustment of numeric values) whose duration can be configured as well. eventually, a url or a qr code  can be created.  .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / review and performance evaluation of trait-based between-community dissimilarity measures title review and performance evaluation of trait-based between-community dissimilarity measures author details attila lengyel * & zoltán botta-dukát * *centre for ecological research, institute of ecology and botany, alkotmány u. - ., h- vácrátót, hungary corresponding author, lengyel.attila@ecolres.hu botta-dukat.zoltan@ecolres.hu (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract . in the recent years a variety of indices have been proposed with the aim of quantifying functional dissimilarity between communities. these indices follow different approaches to account for between-species similarities in the calculation of community dissimilarity, yet they all have been proposed as straightforward tools. . in this paper we reviewed the trait-based dissimilarity indices available in the literature, contrasted the approaches they follow, and evaluated their performance in terms of correlation with an underlying environmental gradient using individual-based community simulations with different gradient lengths. we tested how strongly dissimilarities calculated by different indices correlate with environmental distances. using random forest models we tested the importance of gradient length, the choice of data type (abundance vs. presence/absence), the transformation of between-species similarities (linear vs. exponential), and the dissimilarity index in the predicting correlation value. . we found that many indices behave very similarly and reach high correlation with environmental distances. there were only a few indices (e.g. rao’s dq, and representatives of the nearest neighbour approach) which performed regularly poorer than the others. by far the strongest determinant of correlation with environmental distance was the gradient length, followed by the data type. the dissimilarity index and the transformation method seemed not crucial decisions when correlation with an underlying gradient is to be maximized. . synthesis: we provide a framework of functional dissimilarity indices and discuss the approaches they follow. although, these indices are formulated in different ways and follow different approaches, most of them perform similarly well. at the same time, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sample properties (e.g. gradient length) determine the correlation between trait-based dissimilarity and environmental distance more fundamentally. keywords beta diversity, dissimilarity index, distance metric, community ecology, functional traits abbreviations cdf = cumulative distribution function, cwm = community-weighted mean, fdissim = functional dissimilarity, vis = variable importance score introduction understanding and explaining the variation of living communities along dimensions of space and time have been in the focus of ecological research ever since. the widely applied scheme by whittaker ( , ) to tackle questions of different aspects of community variation divides community diversity into alpha (within-community), beta (between-community) and gamma (across-community) components. it is no exaggeration to say that among these three, beta diversity sparked the most controversy due to the multitude of ways how it can be formulated (tuomisto a,b, anderson et al. , podani & schmera , baselga & leprieur ). one of the most popular approaches to beta diversity builds upon quantification of variation between pairs of communities using dissimilarity indices (anderson et al. , legendre & de cáceres , ricotta ). a broad spectrum of such dissimilarity indices are available for many specific purposes providing elementary tools for different fields of ecology and beyond (see reviews by legendre & legendre , podani (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ). nevertheless, choosing from such many options requires a more or less subjective decision from the researcher which may affect the final result of the analysis. comparative reviews of dissimilarity indices (faith et al. , koleff et al. ) and evaluations of effects of methodological decisions (lengyel & podani ) are inevitably helpful in making these decisions. the most popular, yet not exclusive, interpretations of diversity for long time considered species as variables which are unrelated with each other. in the last two decades, however, the functional approach to ecological questions gained unprecedented attention (díaz & cabido , mcgill et al. ). this approach relies on the fact that species are not all maximally different from each other, rather they can be considered related with respect to similarities in their traits thought to represent their roles in ecosystems (violle et al. ). the need for explicitly accounting for between-species relatedness generated a wave of methodological improvements that introduced new methods in the calculation of diversity. next to a lively scientific discussion on how functional alpha diversity can be appropriately quantified (mason et al. , petchey & gaston , villéger et al. , mouchet et al. ), suggestions were made also for the expression of functional beta diversity (swenson , botta-dukát , chao et al. ). among them, a large variety of indices for calculating dissimilarity between pairs of communities on the basis of the traits of their species have been proposed (e.g. ricotta & burrascano , cardoso et al. , ricotta & pavoine ). although these indices have been introduced as straightforward measures for revealing between- community dissimilarity on the basis of traits, they have very different concepts behind, and we still lack a comparative review of them. in this paper we aim to provide an overview and a conceptual framework for the pairwise functional dissimilarity (hereafter called fdissim) measures available in the literature to our best knowledge. we start with a ( ) short overview of the concept and indices of ecological (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (dis-)similarity without accounting for relatedness of species, then ( ) we review and classify fdissim indices according to their conceptual basis, and ( ) we test the performance of fdissim indices. short overview of taxon-based (dis-)similarity methods most fdissim measures are generalizations of simple indices which were originally designed for expressing dissimilarity based on species composition (that is, omitting similarities between species). we start the review of trait-based (dis-)similarity measures with a brief summary of these species-based indices. then, we present a framework of approaches including several families of trait-based dissimilarity indices. species-based indices most indices can be written in either similarity (s) or dissimilarity (d= -s) form but when we do not see necessary to specify the form, we call them ‘resemblances’. in the case of presence/absence data, these indices are based on the well-known × contingency table whose cells represent the number of species shared (denoted by a), as well as the number of species occurring only in one of the communities (b and c). the fourth cell of the contingency table quantifying the number of shared absences is disregarded by these indices and rarely used in ecological analyses (but see tamás et al. ). all these indices agree that they express similarity as the proportion of shared diversity to total diversity. hence, all of them range between and . in the case of presence/absence data the number of shared species, a, in the numerator stands for shared diversity for all indices, while the denominators are different. in the sørensen index (ss) the denominator is the arithmetic mean of the species numbers of the two communities, in ochiai index (so) it is their geometric mean, in kulczynski (sk) it is their harmonic mean, while in simpson index (ssi) it is the richness of the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . species poorer community. if the two communities are equally species-rich, then these indices are equal, otherwise ss < so < sk < ssi. in the jaccard index (sj), the denominator is the total number of species in the two communities, while in sokal & sneath index (sss) species occurring in a single community are taken into account with double weight. there is a direct and monotonic relationship between jaccard, sørensen, and sokal & sneath indices (see appendix s ). table summarizes the similarity and dissimilarity forms of the above indices. for abundance data, the resemblance of two communities is derived from the summation of species-wise differences, with the simplest interpretation being the euclidean and the manhattan distances, respectively: eq. . ��������� � �∑ ��� � ��� �� ����� eq. . ��� ����� � ∑ �� � ��� ����� where xij and xik are the abundance of species i in communities j and k, sjk is the total number of species in j and k. for both indices, the minimum is but the maximum of euclidean distance is the square-root of the sum of squared abundances, while for manhattan distance the maximum is the sum of abundances. obviously, their dependence on total abundance makes these index values difficult to compare across samples; therefore, indices including a standardization have become more popular in ecological studies. the standardization is possible in several ways. the first option is to standardize raw species contributions to between-community dissimilarity (xij-xik), and then to sum them. therefore, each species-level difference in abundance should be divided by a scaling factor in a way that maximal species- level difference is and this difference is maximal if species present only one of the compared communities. summing xij and xik in the denominator satisfies this requirement and gives a well-known distance measure, the canberra index: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . eq. . ��� ����� � ∑ ������������������ �� ��� however, canberra index still ranges between and sjk. according to ricotta & podani ( ), the normalized canberra index can be derived by unweighted averaging of species contributions: eq. . ���� ����� � � �� ∑ ��������� ��������� �� ��� alternatively, species-level differences can be divided by max(xij, xik). it also results unity, if species occur only either of the plots. ricotta & podani ( ) called this modified canberra index, whose normalized version follows: eq. . ����� ����� � � �� ∑ ��������� ��� ���,���" �� ��� calculating from binary data, both normalized canberra and normalized modified canberra result in jaccard dissimilarity. a different way of standardization is possible if raw species-level differences are summed and divided by the sum of their theoretical maxima. in this case, the denominator can follow the logic of canberra index, thus leading to the bray-curtis index: eq. . �#� � ∑ ��������� ��� ��� ∑ �������" ��� ��� analogously with the normalized modified canberra index, instead of the sum, the denominator may contain the maximum of abundance, resulting in the formula known as marczewski-steinhaus index: eq. . �� � ∑ ��������� ��� ��� ∑ ��� ���,���" ��� ��� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . worth to note that bray-curtis and marczewski-steinhaus indices calculated on presence/absence data return the values of sørensen index and jaccard index in dissimilarity form, respectively. moreover, several abundance-based indices can be expressed if we generalize a, b, and c quantities used during the definition of indices for presence/absence data (tamás et al. ). eq. . % � ∑ min ��� , ��� � ����� eq. . �% � ∑ �max ��� , ��� � � �� � ����� eq. . �% � ∑ �max ��� , ��� � � ��� � ����� substituting a, b and c with a’, b’ and c’ into the formula of sørensen index gives bray- curtis, and doing so with jaccard index results in the marczewski-steinhaus. abundance versions of all other presence/absence indices can be created in the same manner. a classification of fdissim indices fdissim indices incorporate trait information into the calculation of dissimilarity in different ways. the simplest solution is when summary statistics or distributions are calculated for the two communities and a measure of distance or segregation is calculated between them. we call this the summary-based class, and in our review, we include two approaches within this, the typical value approach and the distribution-based approach. in the second class we include indices which utilize a symmetrical species by species (dis-)similarity matrix and link it directly through matrix operations with the compositional matrix. we call this the dissimilarity-based class which includes the probabilistic, the ordinariness-based, the diversity partitioning, and the nearest neighbour approaches. the third class includes methods which make use of between-species (dis-)similarities for classification of species; (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . therefore, we call it the classification-based class. the classification either transforms the original structure of the dissimilarity matrix into discrete groups of species which can be used as functional types, or expresses dissimilarities in a form of a tree-graph where between- species dissimilarities are organized in an inclusive hierarchy. this is a widespread approach for accounting for phylogenetic relatedness, since phylogenies are commonly summarized in the form of cladograms. such methods heavily rely on the algorithm chosen for the classification, including the decisions about the number of clusters and the method for breaking tied values. examples are provided by hérault & honnay ( ), nipperess et al. ( ), and cardoso et al. ( ), while a review is available by pavoine ( ). as there is no general recommendation for the classification method, we omit this class from the framework detailed below and the comparative test. the classification of trait-based dissimilarity indices and their main properties are summarized on table . typical value approach indices following this approach represent each community with a typical trait value, and calculate a distance metric between them. the most commonly applied typical trait value is the community weighted mean (cwm; garnier et al. ). the rationale behind the cwm can be linked with the mass ratio hypothesis (grime ) stating that the effect of species on ecosystem functioning is proportional to their relative abundances. although, several issues emerged regarding its limited applicability in statistical inference (hawkins et al. , peres- neto et al. , zeleny ) and its negligence of within-community variation (muscarella & uriarte ), difference in cwm is still considered a reliable indicator of robust changes in trait composition induced by selective forces like environmental matching or succession (de bello et al. , , kleyer et al. ). ricotta et al. ( ) investigated the relatedness of the distance between cwms with the probabilistic approach (see therein) and showed its applicability on phylogenetic data. due to its tolerable requirements for (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . computational capacity, lengyel et al. ( ) used the euclidean distance between trait cwms of phytosociological relevés for the trait-based numerical classification of grasslands of poland with a sample size of sites and species. another advantage of this method is its euclidean property. besides the community-weighted mean, other typical values, e.g. the median or the mode, might be considered depending on the scaling of the trait variable and on specific research aims. distribution-based approach instead of typical values, the distribution of trait values is considered a more reliable representative of the trait composition and variability of a community. continuous distributions can be defined by a density function, while discrete distributions by the probabilities of the possible values, while both types can be characterized by a cumulative distribution function (cdf). a useful analogue of the distance between typical values might be distance between discrete distributions, density functions or cdfs. if data is available on intraspecific trait variation, trait values forms a continuous distribution. first, separate density functions have to be fitted within each species. then, density function of this community-level distribution can be calculated as weighted sum of species level density functions (carmona, de bello, mason, & lepš, ). if such data is not available, we can use relative abundances as estimates of probabilities of the corresponding trait values. pairs of trait values and their probability form a discrete distribution. similarity of density functions can be measured by their overlap (see appendix s for overview of overlap measures). overlap functions between within-species trait distributions has already been proved useful in the quantification of between-species niche segregation (macarthur & levins , mouillot et al. ) or trait-based dissimilarity of species (lepš (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . et al. , de bello et al. ). nevertheless, they are perfectly applicable to the community level as well. gregorius et al. ( ) proposed an index called delta for the quantification differences between discrete trait distributions. delta is the minimal sum of frequencies shifted from one trait state to another trait state, weighted by the differences between the respective states. minimizing the sum of shifted frequencies is known in linear programming as the transportation problem (hitchcock ). due to its relatively high computational demand, it is unfeasible for large compositional and trait data matrices typically used in ecological research, therefore, we exclude this index from our comparison. difference between two cdfs can be calculated at each possible trait values (i.e. not only the observed ones), then the sum of them can be used as a trait-based dissimilarity measure. in appendix s we introduce the distance between cdfs in more detail. maximally distinct communities species-based dissimilarities, except euclidean, manhattan and (non-normalized) canberra distances, equal unity, which is their maximum, when the two compared communities do not share any species. in this context, we could call such communities maximally distinct. however, when traits are considered, two communities can be similar, even if they do not share any species. for example, if all species of community a is replaced by a similar species in community b, the two communities have no shared species, but from functional point of view, they are similar. in this context, two communities are maximally distinct, when similarity of any species from the first community is zero to any species in the other community. it is a desirable property for a functional similarity index to take the value if and only if the two compared communities are maximally distinct. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . probabilistic approach this approach can be traced back to the diversity framework proposed by rao ( ), and recently extended by pavoine & ricotta ( ). rao’s within community diversity is defined as the expected dissimilarity between two randomly drawn individuals from a single community: eq. . ���� � ∑ ∑ �� � δ� � where pi is the relative abundance of the ith species in the community and δij is the dissimilarity between species i and j. this has become a widely used index of functional alpha diversity (botta-dukát ). likewise, a between-community component of diversity, q(p,q), can be defined as the dissimilarity between two random individuals, each selected from different communities: eq. . ���, �� � ∑ ∑ �� � δ� � between community diversity can be expressed using within community diversity of the two original communities (q(p) and q(q)) and the community with mean relative abundances; � �&�' � �. eq. . � �&�' � � � ∑ ∑ (��)� � (��)� � � δ� � �� ∑ ∑ ��� � � �� � � � �� � � δ� � � �"�� �" � ���, �� subtracting mean within community diversity from the between community diversity leads to rao’s dissimilarity (also called disc): (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . eq. . !* � ∑ ∑ ��������� � ∑ ∑ (�(�+���� �∑ ∑ )�)�+���� � ∑ ∑ ��������� � � �"�� �" � � ���� � � �� � � �� � where pi and qi are the relative abundances of species i in the two communities. champely and chessel ( ) proved that if δ has squared euclidean property, rao quadratic entropy is concave function, i.e. � �&�' � � is higher than or equal to mean of ���� and ����. thus under this condition, !* " . if $ %� $ , ∑ ∑ �� � %� � , which is the weighted average of between-species distances, also has to be within this range. therefore, $ !* $ . however, dq may be much less than , even if the two communities are completely distinct, when ���� and ���� are high. therefore, pavoine & ricotta ( ) suggested dividing dq by its theoretical maximum (see equations and in pavoine & ricotta ). they recognized that the resulting indices are representatives of a broader family of indices, hereafter called dsimcom, which are actually the implementations of rao’s between-community and within- community components of diversity into the similarity formulae designed for presence/absence data. for this index, it is necessary to introduce the similarity between species, εij= - δij. the expected similarity between individuals of different communities, ' � ∑ ∑ � � � � (���� is taken analogous with the shared diversity, a, according to the parameters of the similarity indices for presence/absence data disregarding species properties, while the expected similarities within communities (' � ) � ∑ ∑ � � � � (���� and ' � * � ∑ ∑ ����(���� ) are analogous with the species numbers (a+b, a+c). in this way, pavoine & ricotta ( ) presented formulae following the sokal & sneath, jaccard, sørensen, and ochiai indices. additionally, a formula analogous with whittaker’s effective species turnover (β=γ/α- ; whittaker , tuomisto a) is suggested for two communities, which in similarity form is shown to be identical with the overlap index of chiu et al. ( ). in this formulation γ=a+b+c and α=( a+b+ c)/ . pavoine & ricotta ( ) showed that members of the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dsimcom family provide meaningful values also if absolute abundances, percentage values or binary occurrences are used instead of relative abundances. when εij contains taxonomical similarities, its off-diagonal elements are , and a=a, b=b, and c=c. worth to note the inherent link between dq and cwmdis on the basis of the geometric interpretation by pavoine ( ) and ricotta et al. ( ). pavoine ( ) showed that if between-species dissimilarities are in the form δij=(dij )/ and dij is euclidean embeddable, dq is half the squared euclidean distance between the centroids of two communities – a function monotonically related with cwmdis, the simple euclidean distance between centroids of communities. as ricotta et al. ( ) argue, if species relatedness is only described by a dissimilarity matrix, which is the common case in phylogenetic analyses, species can be mapped into a principal coordinate analysis ordination using dij. given the euclidean embeddable property of dij, this ordination should produce s- or fewer ordination axes, all with positive eigenvalues. ordination scores for species can be used as traits, and therefore, centroids of communities, and (squared) euclidean distances between communities can be calculated. in the special case when between-species dissimilarities are euclidean distances, dq must be equal with the euclidean distance between the weighted averages of traits, that is, cwmdis. it is also notable that swenson et al. ( ) and swenson ( ) use the quantity q(p, q) as a standalone index of pairwise beta diversity and call it dpw or “rao’s d”. the latter name is misleading since rao ( ) himself noted with dij the disc (or dq) index. q(p, q) measures dissimilarity between two communities but the dissimilarity of a community from itself is not zero. swenson ( ) also presents a standardized version of q(p, q) under the name rao’s h. with this formula the dissimilarity of a community to itself is scaled to , however, its (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . transformation to a meaningful scale where each community has dissimilarity value zero towards itself is not elaborated. due to this drawback, we do not consider these indices in our review of functional dissimilarity measures. schmidt et al. ( ) proposed probabilistic indices with weighted and unweighted versions for expressing community similarity on the basis of taxa interaction networks (called tina, taxa interaction-adjusted) and phylogenetic relatedness (pina, phylogenetic interaction- adjusted). tina and pina differ only in what type of data the interaction matrix contains. notably, the functional formula of weighted tina is identical with the ochiai version of dsimcom. however, the unweighted tina, abbreviated tu, is not a special case of tina, which we consider an inconsistency. therefore, we did not include tu as a separate index. ordinariness-based approach with respect to functional alpha diversity, leinster & cobbold ( ) introduced the concept of species ordinariness defined as the weighted sum of relative abundances of species similar to a focal species within the same community, or in other words, the expected similarity of an individual of the focal species and an individual chosen randomly from the same community. according to ricotta & pavoine ( ) it is straightforward to replace abundances with ordinariness values in the species-based (dis-)similarity indices. following this concept, ricotta & pavoine ( ) introduced a new family of trait-based similarity measures called dissabc. dissabc applies the schemes of jaccard, sørensen, ochiai, kulczynski, sokal & sneath, and simpson indices. either relative or absolute abundances can be chosen as input values. species ordinariness values can be calculated either with respect to the pooled species list of the two communities under comparison, or to the total species list of the data matrix. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . for species-based analyses, ricotta & podani ( ) suggested a general formula of distance measures in which community dissimilarity is calculated by the weighted averaging of species-level differences in abundance. from this formula, a normalized canberra distance, bray-curtis distance, marczewski-steinhaus index, and an evenness-based dissimilarity index (ricotta ) can be derived. according to pavoine & ricotta ( ), replacing species abundances with species ordinariness values, a meaningful dissimilarity index can be designed, which is called generalized_tradidiss. additionally, this index contains a factor which weights the contribution of each species to the overall dissimilarity between the two communities. this weight can be set to give even weight to all species or to weigh them proportionally to their relative abundance in the pooled communities. diversity partitioning approach following the work of hill ( ), a community with diversity of order q, qd, is as diverse as a theoretical community containing qd equally abundant species. the order of diversity, q, expresses the weight given to differences in species abundance, q = representing the presence/absence case, q = ∞ considering only the relative abundance of the most abundant species in the community. without accounting for interspecific similarities, there is emerging consensus that using effective numbers (also called number of equivalents) is a straightforward way for partitioning diversity into within-community (alpha), between- community (beta) and across-community (gamma) components (jost ). of these three, the between-community component, beta diversity, can be interpreted as a form of dissimilarity when applied for two communities (ricotta ). beta diversity can be derived from alpha and gamma diversity in a multiplicative (beta = gamma/alpha) or an additive way (beta = gamma – alpha). jost ( ) and chao et al. ( ) argued that multiplicative beta diversity is a useful way for quantifying community differentiation; however, due to its (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . scaling between and n (n being the number of communities) it is not comparable across samples containing different numbers of communities. to remove this dependence, they offer three solutions with which the value of multiplicative beta can be normed. although, for pairwise comparisons, n is always , it seems straightforward to follow these recommendations, since the scaling between and has several advantages, and most other indices also share this property. the rescaling formulae of chao et al. ( ) embody different concepts of community (dis-)similarity, which together we call the family of multiplicative beta indices. the first formula is the relative turnover rate per community, which is a linear transformation of beta to the normed scale. eq. . +��� ,-�� ,�- � � +) � �/�/ � � here means identical species composition, while indicates totally distinct communities. in the pairwise comparison (n = ), βturnover〈q〉 = q β - . the second index measures homogeneity, and is a linear transformation of the inverse of beta. with respect to the fact that the complement term of homogeneity is heterogeneity, we call its dissimilarity form βheterogeneity: eq. . +�����,.� ���/ ,�- � � � � � �� � � ��� when n = , βhet〈q〉 = - / q β. with q = (presence/absence case) the index is identical with jaccard index, while with q = ∞ (abundance case) it is the morisita & horn index. the third index measures overlap between communities, whose counterpart is segregation, thus we call it βsegr: eq. . + �.��.���, ,�- � � � � )�� � � � � �)�� � � � � �)�� with q = , + �.��.���, ,�- � +��� ,-�� ,�-, and both gives the sørensen index. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . according to leinster & cobbold ( ), it is possible to implement species similarities in the calculation of effective numbers. this way, the meaning of qdz, is the diversity of a theoretical community with qdz equally abundant and maximally different species. hence, both unevenness in the abundance structure and the between-species similarities decrease the value of effective species number. due to measuring diversity in effective numbers, it is possible to partition diversity into alpha, beta, and gamma fractions (leinster & cobbold ; botta-dukát ) in the multiplicative way. then, this multiplicative beta can be rescaled using the formulae proposed by chao et al. ( ). these indices behave consistently only if abundances are taken into account as relative abundances. nearest neighbour approach the earliest representatives of this family were shown by clarke & warwick ( ) and izsák & prince ( ), then ricotta & burrascano ( ), and ricotta & bacaro ( ; see dcw and dip indices). later ricotta et al. ( ) introduced a new, general family called paddis. all these indices were primarily defined for presence-absence data type. the approach is based on a re-definition of the b and c quantities of the × contingency table. looking at species as maximally different, and taking x and y the two communities under comparison, b can be viewed as the total uniqueness of community x. the uniqueness of a single species in x is if it is absent in y, otherwise it is . therefore, b is the sum of species uniqueness values. however, from a functional perspective, the uniqueness of a species present only in x should be between and if it is absent in y but a similar species present there. therefore, it is possible to define the analogue of b which accounts for similarities between species: eq. . � � ∑ � max� ����� � � � � ∑ max� ���� � (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the same logic applies for c, which is the uniqueness of community y, where c expresses the degree of uniqueness: eq. . � � ∑ � max� � ����� � �� � ∑ max� � ���� ricotta et al. ( ) define the a quantify as follows: eq. . ' � � �� � )� � �� � *� having a, b, and c defined as analogues of a, b, and c, it is now possible to design trait-based similarity measures following the logics of jaccard, sørensen, sokal & sneath, kulczynski, ochiai and simpson indices. it is notable that ricotta et al. ( ) define a as a quantity that ensures the components b and c to add up to a + b + c but with no explicit biological interpretation. notably, dip and dcw are identical with the sørensen and kulczynski forms of paddis. the generalization of dip and dcw to relative abundances, dcw(q), was also derived by ricotta & bacaro ( ). for these two versions, it is not necessary to explicitly define the a component. using the relationships between jaccard, sørensen, kulczynski, ochiai and sokal & sneath indices, from dcw(q) it is theoretically possible to derive the extension of paddis to relative abundances; however, the biological interpretation of a remains dubious in this framework. methods the performance of fdissim indices can be reliably tested on data sets with known background processes driving community assembly which is hardly possible to satisfy with real data. therefore, we compared the performance of fdissim indices using simulated data sets. the data sets were generated using the comm.simul function of the comsimitv r package (botta-dukát & czúcz , botta-dukát ). this function follows an individual-based (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . model for a meta-community comprising n communities and a regional pool of s species. local communities include j individuals, and are distributed equidistantly along a continuous environmental gradient (with gradient values between and ). each individual possesses three traits: an ‘environmental’, a ‘competitive’ trait, and a neutral trait, all ranging on [ ; ]. intraspecific variation in trait values is neglected in the simulation, that is, individuals belonging to the same species are identical. the environmental trait defines the optimum of the species along the environmental gradient. the closer the position of a community along the environmental gradient to the environmental trait value of a species, the more suitable it is for that species: eq. . :; �:<:;= � � -��, � � � � -��, � ��� �����"� . where σ (sigma) is adjustable so as to change the niche width of the species, and hence, the length of the gradient (see later). the competitive trait represents the resource acquisition strategy of the individual. the more similar the latter value between two individuals, the higher the competition is between them, which means that intraspecific competition is the strongest. the neutral trait has no effect on community assembly, thus it is not considered in our study. the simulation starts with the random assignment of all individuals of all communities to species. the second step is a ‘disturbance’ event, when one individual ‘dies’ in each community. this individual is to be replaced by an offspring of other individuals within the same community or those of other communities. each individual produces one offspring or does not reproduce. probability of reproduction depends on the strength of competition. the offspring remains in the same community or randomly disperses into any of the other communities. finally, the dead individual is replaced by one new individual from the seeds produced and dispersed. the probability that an individual of a certain species replaces the dead individual is defined by the number of seeds of that species and the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . suitability of the habitat. steps between the disturbance event and the establishment of a new individual constitute a single ‘generation’. community composition is evaluated after lot of generations. the strength of the environmental filtering can be adjusted by the sigma parameter, respectively. when sigma is , all species are maximally specialist, which means that they can occur only at the optimum point of the gradient (that is, at the exact value for the environmental trait). if sigma is infinity, species are maximally generalist and all points along the environmental gradient are equally suitable for them. therefore, sigma is the parameter which defines the suitability of each point of the gradient for each species based on its distance from the respective optima. we generated data sets with sigma values . , . , . , . , and in order to simulate situations with different strength of environmental filtering. the number of communities was , each community comprised individuals, the number of species in the species pool was , the simulation iterated for generations, and we allowed no intraspecific trait variation. for all the other parameters, we used the default options. however, it needed further explanation what real situations the six simulated levels of environmental filtering represent. to provide a reference and assist interpretation, we calculated two species-based beta-diversity measures, the multiplicative beta (whittaker ) and the gradient length of the first axis of a detrended correspondence analysis (dca) ordination (hill & gauch ; appendix s , fig. s . ). the former gives the number of distinct communities present in the total species pool of the gradient, while the latter is minimal number of average niche breadths (also called turnover units) necessary for covering the total gradient length. moreover, we plotted the abundance of species in the sample units along the gradient as a visual tool for assessing gradient length (appendix s , fig. s . ). all these methods indicated that with sigma = . the gradient is extremely long: there are more than distinct communities and near turnover units along the gradient. samples with such (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . high beta diversity are very rare and special in real ecological research; therefore, findings from simulations with sigma = . are mostly of theoretical importance. beta diversity values from sigma = . to sigma = are more similar to real study situations, hence they should be more relevant for practice. at sigma = , environmental filtering is practically not operating, between-community variation is driven by interspecific relations and chance. we calculated between-species dissimilarities as the gower distance between their environmental trait values which in this case equals the euclidean distance scaled to [ ; ]. these distances had to be transformed to similarities according to the requirements of the fdissim indices. several formulae are available with which it is possible; however, they may assume different functional relationships between similarity and distance. one of such formulae we used is the linear transformation according to similarity = -distance. besides this, we also used similarity = e-u×distance which supposes a curvilinear function between similarity and distance (leinster & cobbold ). with this exponential formula, it is possible to weight the importance of small gower distances between species relative to large distances. with changing the parameter u it is possible to adjust how steeply similarity decreases with increasing distance. we set u = which leads to a relatively steep decline. although, after this transformation the minimal value for similarity is higher than zero, we considered it negligibly low (e- ≈ . ) so we did not apply the transformation proposed by botta-dukát ( ). for all fdissim indices where it was necessary we used the similarity matrix or a dissimilarity matrix calculated as dissimilarity = -similarity as input. the dissimilarity matrix is identical with the gower distance matrix if the similarities were calculated in a linear way, but in the other case, it keeps the exponential relationship between distance and (dis-)similarity. dissimilarity matrices were calculated for the four community data sets with different sigma values, with the two functions transforming gower distances, and across a broad range of (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . available fdissim indices. for indices where absolute or relative abundances could have been taken into account, we opted for relative abundance for the sake of better comparability. with generalized_tradidiss, we calculated the ‘even’ and the ‘uneven’ weighting versions. the entire analysis was run with abundance and presence/absence data. some fdissim indices are only suitable for binary data, thus the number of indices applied for relative abundance and binary data were and , respectively. in cases of indices handling both data types, we used exactly the same version of the index as with abundance data, hence communities with different numbers of species were given equal weight due to division by community totals. additionally, dissimilarity matrices were also calculated using the bray-curtis index (for binary data: sørensen index in dissimilarity form) to provide a contrast against the case disregarding between-species dissimilarities. then for each dissimilarity matrices, we conducted two types of analyses. firstly, we compared how strongly the dissimilarity indices correlate with the environmental distance using kendall tau rank correlation. this gives an estimate of how well a dissimilarity index reveals the monotonic relationship between trait composition of local communities and the environmental gradient. we visually assessed the shape of relationship between dissimilarity and environmental distance in the case of lowest sigma (i.e., longest gradient) when the distortion of linear relationship between the two is supposed to be the strongest. then, to disentangle the effects of different methodological decisions and the sigma parameter on the correlation between fdissim indices and environmental distance we calculated a random forest model. in this model the dependent variable was the kendall tau correlation coefficient, while the independent variables were the sigma, the data type (abundance vs. presence/absence), the transformation method for gower distances (linear vs. exponential), and the fdissim method. within approaches fdissim methods often strongly correlated that resulted in very similar kendall’s tau values. therefore, only the sørensen/bray-curtis (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . versions of dsimcom, dissabc, paddis/dcw, generalized_tradidiss with uneven weights, as well as βturnover, cwmdis, and the cdfdis were included into this analysis. variable importance scores (vis) in the random forest were estimated by the permutation approach based on mean decrease in log-likelihood using the varimp function of the partykit package. the effects of the model terms were also illustrated by heat-maps. all statistical analyses were done in r (r core team ) using the fd (laliberté & legendre , laliberté et al. ), adiv (pavoine a,b), comsimitv (botta-dukát ,) vegan (oksanen et al. ), desctools (signorell et al. ), partykit (hothorn et al. , strobl et al. , strobl et al. , hothorn & zeileis ) packages. results kendall tau correlation coefficients decreased as the strength of environmental filtering decreased (that is, with increasing sigma) in all examined cases. for fdissim indices which handled both data types, presence/absence data resulted in lower correlations than abundance data for all indices. for most indices, this difference was highest at intermediate values for sigma. these trends were consistent between the linear and the exponential transformations. correlations for all indices at all sigma values with linear transformation are shown in table for abundances data and in table for presence/absence data. in most simulation scenarios, the fdissim indices correlated more strongly with the environmental gradient than the species-based bray-curtis index. however, in several occasions, indices belonging to the nearest neighbour family performed poorer than the species-based dissimilarity. notably, at the highest sigma and with presence/absence data, all indices showed correlation near to zero but among them the bray-curtis index had the highest correlation with environmental distance. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . as expected, we found perfect rank correlations among jaccard, sørensen, sokal-sneath and whittaker’s beta versions of dsimcom, among jaccard, sørensen and sokal-sneath forms of dissabc, between dip and sørensen form of paddis (only for presence-absence data), between dcw and kulczynski form of paddis (only for presence-absence data), and between dip and dcw (for abundance data type). dissimilarity indices showed various shapes of relationship with environmental distance (appendix s ). at strongest environmental filtering, all fdissim indices had dissimilarity values near zero at minimal environmental distance, only the species-based bray-curtis which had dissimilarity was near . at the smallest environmental distances. in case of linear transformation of gower distances and presence/absence data, approximately linear relationship was found for cwmdis, cdfdis, dq, sørensen and ochiai forms of dsimcom, jaccard form of dissabc, marczewski-steinhaus form of generalized_tradidiss with both weighting versions, βheterogeneity and βsegregation; although, most other indices showed only a small degree of distortion of linear function (figure s . ). exponential relationship was found for the evenness-based (pe) form of generalized_tradidiss. notably, the taxon-based bray- curtis index had the steepest asymptotic function among all. in case of exponential transformation all other indices relying on between-species dissimilarities showed an asymptotic curve (figure s . ). in the random forest, niche width (that is, sigma) acquired by far the highest variable importance score (vis= . ). the less important variables were the data type (vis= . ), the dissimilarity method (vis= . ) and the transformation (vis=- . ). the heat map (figure ) also revealed a strong decrease in correlation along increasing sigma. it is also clearly shown that in most cases abundance data resulted in significantly higher correlation than presence/absence. the difference between linear and exponential transformation methods was not always visible. regarding variation between dissimilarity indices, the most striking (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patterns were the relatively poor performance of the paddis/dcw indices. all but the latter index combined with abundance data and linear transformation of dissimilarities lead to the highest correlation with environmental distance. discussion general patterns in the correlation with environmental distance we ran different simulation scenarios with varying strength of environmental filtering. we expected that the correlation between fdissim indices and environmental distance to be the highest when the environmental filtering is the strongest, and the correlation to become neutral when environmental filtering is not effective. when environmental filtering was strongest (that is, minimal overlap of species niches along the environmental gradient), all fdissim indices correlated highly with the environmental gradient. as expected, correlation between trait dissimilarity and environmental distance decreased as filtering weakened, moreover, differences between families of indices became more apparent. this result suggests that all tested methods are able to reveal the strong environmental filtering processes. as the contribution of competitive exclusion and stochastic processes approach or override environmental filtering, the correlation between fdissim indices and the background gradient becomes weaker. this decrease itself is not a drawback of the fdissim methods, rather it is a consequence of our study design, since we applied a series of scenarios where the effect of niche filtering was decaying. however, we think that the degree of the decrease reflects the sensitivity of the fdissim indices to the underlying trait-environmental relationship. indices, which showed high correlation with environmental distance, could be capable of revealing the environmental signal even when it is weak. actually, in our tests, most indices reached similarly high correlation, and there were only a few combinations of simulation parameters (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . which resulted in a decreased correlation with environmental distance for some dissimilarity indices. determinants of the correlation based on the random forest model the random forest model revealed that the effect of gradient length is the most important determinant of the correlation between dissimilarity and environmental distance, while methodological decisions had much lower variable importance. these observations suggest that the absolute value of the correlation between dissimilarity and environmental distance is primarily dependent on the sample in hand, and can be influenced by methodological decisions to a limited extent. correlations were stronger with abundance than with presence/absence data. this finding is at least partly attributable to our simulation design where community composition was driven by individual-based processes: birth, fitness difference, reproduction, and death. as a result, species relative abundances had to be proportional with their environmental suitability in the local community. transforming such data to binary scale loses meaningful information and weakens the correlation between dissimilarity in trait composition and environmental background. in cases when presences and absences of species respond more robustly to the main environmental gradient, while relative abundances change stochastically, or abundance estimations are inaccurate, the binary data type might be more straightforward. transforming between-species dissimilarities has a potential to conform distributional requirements, to approximate expert intuitions about relatedness of species or to customize sensitivity to functional difference with respect to specific research aims. for most indices across the tested range of gradient length and data type, the exponential transformation resulted a somewhat lower correlation than with linear transformation. more insight is provided by examining the shape of the relationships besides the pure correlation value. after (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . linear transformation of gower distances, most dissimilarity indices showed a linear or slightly curved function along environmental distance; although the scatter of the evenness- based generalized_tradidiss differed considerably from the straight line towards an exponentially increasing one. after exponential transformation of between-species trait dissimilarities, all indices in the direct dissimilarity-based class showed a rather steeply increasing asymptotic function. this result suggests that with the exponential transformation of between-species dissimilarities, it is possible to make fdissim indices more sensitive to smaller differences in functional composition. certainly, summary-based indices (cwmdis, cdfdis) are not affected by this transformation, since they are not based on between-species dissimilarities. comparison of taxon-based vs. trait-based dissimilarity the basic assumption of functional ecology is that the traits of individuals should be in closer relationship with ecological properties than their taxonomical status. following this argument, we expected that trait-based dissimilarity measures correlate more strongly with the environmental background than species-based indices. in contrast, higher correlation of species-based dissimilarity than trait-based dissimilarity indicates loss of information with the introduction of between-species similarity – which is non-sensual since our data was simulated in a way to possess a strong pattern in trait-environment relationship. we used the sørensen/bray-curtis index in a dissimilarity form as a reference method representing species-based dissimilarity calculations disregarding traits. our expectation was fulfilled by all indices with the exception of the members of the nearest neighbour family (dip, dcw and paddis). we suspect two potential reasons behind the low performance of these latter groups of indices. the first one is the improper scaling factor used for standardizing the ‘operational part’ of the indices (see the description in of the paddis family and the discussion about it under the paragraph “within-family variation of indices”). second, these indices rely on the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . quantities of minimally different species in the two communities under comparison. however, the minimum is a less robust descriptor of any sample distribution because of its dependency on sampling error; therefore, it might provide a poor representation of total community dissimilarity. although, we did not include dissimilarity values at exactly zero distance, the y-intercept (also called ‘nugget’) of the dissimilarity vs. environmental distance functions can be extrapolated with negligible error (fortin & dale ). brownstein et al. ( ) argued that the nugget of the distance decay relationship is a direct estimate of the amount of chance in the variation between local communities. in this respect worth noting is that the nugget with species-based bray-curtis index was near . , while with all trait-based indices the nugget was near zero. this suggests that without accounting for species similarities, environmental distance between communities can be overestimated due to similar species replacing each other. within-family variation of indices the perfect correlation between jaccard, sørensen and sokal-sneath forms of dsimcom and dissabc families was expected, since the original, taxon-based jaccard, sørensen and sokal- sneath indices are algebraically related, too (janson & vegelius ). however, for paddis jaccard, sørensen and sokal-sneath forms showed correlation below . at this family, the b and c components of the × contingency table are defined as measurable quantities with clear interpretation: the sum of species uniqueness values within each community. the total diversity (a+b+c) is defined to be equal with the species richness of the pooled pair of communities (a+b+c), and the quantity a is derived by subtracting (b+c) from it. with this definition, a remains a virtual quantity with no biological interpretation. in paddis indices, trait-based quantities b and c appear in the numerator (the ‘operational part’ sensu ricotta et al. ) of the indices, while in the denominators (i.e., in the ‘scaling factor’) the taxon- based quantities, a, b and c are used. we argue that the inconsistent behaviour of paddis is (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . due to the application of taxon-based quantities for scaling factors of trait-based operational parts. at the same time, we acknowledge that we either see no obvious solution to define total diversity or shared diversity according to the uniqueness-based idea behind paddis in a more realistic way. in the generalized_tradidiss family, the trait-based analogue of bray-curtis index can be achieved by calculating generalized canberra distance with uneven weighting of species. we expected this to be perfectly correlated with marczewski-steinhaus form of generalized_tradidiss index with uneven weighting, since bray-curtis and marczewski- steinhaus indices are the abundance forms of sørensen and jaccard indices, respectively. however, the correlation between them was lower. in the generalized_tradidiss family, between-community dissimilarity is calculated as weighted sum a standardized differences in species ordinariness values. species ordinariness is calculated on the basis of species abundance and trait values; however, weights used for adjusting species-level contributions are derived solely from abundances. therefore, generalized_tradidiss also follows a ‘hybrid’ approach in accounting for taxon-based vs. trait-based information. we argue that this is the reason why the algebraic relationships between the original sørensen and jaccard indices does not apply to its sørensen/bray-curtis-type and jaccard/marczewski-steinhaus-type forms. to sum up, we point to our observation that jaccard, sørensen and sokal-sneath forms of certain families of indices do not satisfy the algebraic relationships they supposed to, opening space for potential confusion. these algebraic relations hold only if a, b and c quantities are explicitly and consistently defined. families of fdissim indices combine abundance difference of species between plots and interspecific trait differences in a unique way, while indices belonging to the same family differ in how they relate this amount of ‘unshared’ variation (summarized as the b and c portions of the contingency table) to the shared (a) variation. some indices are able to handle abundances either as absolute or relative abundance (e.g. dsimcom, generalized_tradidiss, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dissabc), while others divide absolute abundances by their sum over the respective community, thus they work only with relative abundances. when indices in the former group are set to consider absolute abundances, they become sensitive to variation in the summed abundances of the communities under comparison. to place our tests on a common ground, we simulated communities with equal total number of individuals, and set all indices, where relevant, to work with relative abundances. hence, we removed the effect of differences in total abundance. the constant number of individuals might have increased the similarity between fdissim indices belonging to the same family and the correlation with the environmental gradient. the sum of abundances, let them be measured on any quantitative scale, may vary considerably in real study situations due to aggregated distribution of individuals or uneven sampling effort. therefore, our findings are more likely valid for settings when the sum of abundances are relatively stable, e.g. when sampling effort is controlled and individuals are dispersed evenly, or when abundances are recorded on percentage scale. limitations of our study in our study, we simulated a research situation in a simplistic way. we applied only one environmental gradient which operated as an environmental filter driving convergence on a single trait. besides this, we applied another trait which was constantly affected by a low level of competitive exclusion. these two traits were uncorrelated. nevertheless, there was some effect of random drift on community composition due to the probabilistic components of the simulation algorithm. we varied the strength of environmental filtering thus it had different relative contribution compared with competitive exclusion and stochasticity. in real research situations local trait composition is influenced by a wide range of processes, including several abiotic and biotic filters acting simultaneously. unless they are manipulated as parts of an experimental system, the full set of such filters are usually unknown for the researchers. the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiplicity of filters may reduce the ability of fdissim indices in recovering trait- environment relationships. further research should clarify how increasing complexity of the sample affects the behaviour of fdissim indices. conclusions considering the diversity of concepts they are built upon, fdissim indices showed unexpectedly low variation in performance. cwmdis, dsimcom, generalized_tradidiss acquired the highest correlation with environmental distance in all simulation scenarios, therefore they seem to be equally suitable for quantifying pairwise beta diversity based on traits. nevertheless, the most important determinant of the matching between trait-based dissimilarity and environmental distance is the length of the trait gradient. besides this, the data type (presence/absence vs. abundance) also affected the correlation more strongly than the choice of fdissim method. extending the comparative tests of fdissim measure to more complex gradients and real data sets could offer further insight into their behaviour. data availability simulated data was generated using the comsimitv r package. own functions for functional dissimilarity indices are made available through the zenodo public repository: . /zenodo. . author contributions a.l. designed and carried out the analysis, lead writing, z.b.d. discussed the concept and the results, wrote parts of and commented on the manuscript. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references anderson, m. j., crist, t. o., chase, j. m., vellend, m., inouye, b. d., freestone, a. l., sanders, n. j., cornell, h. v., comita, l. s., davies, k. f., harrison, s. p., kraft, n. j. b., stegen, j. c. & swenson, n. g. ( ). navigating the multiple meanings of β diversity: a roadmap for the practicing ecologist. ecology letters, ( ), - . doi: . /j. - . . .x anderson, m. j., ellingsen, k. e. & mcardle, b. h. ( ). multivariate dispersion as a measure of beta diversity. ecology letters, ( ), - . doi: . /j. - . . .x baselga, a. & leprieur, f. ( ). comparing methods to separate components of beta diversity. methods in ecology and evolution, : - . doi: . / - x. botta�dukát, z. & czúcz, b. ( ). testing the ability of functional diversity indices to detect trait convergence and divergence using individual�based simulation. methods in ecology and evolution, , - . https://doi.org/ . / - x. botta�dukát, z. ( ). rao's quadratic entropy as a measure of functional diversity based on multiple traits. journal of vegetation science, , - . https://doi.org/ . /j. - . .tb .x botta�dukát, z. ( ). the generalized replication principle and the partitioning of functional diversity into independent alpha and beta components. ecography, : - . doi: . /ecog. botta-dukat, z. ( ). comsimitv: flexible framework for simulating community assembly. r package version . . . https://cran.r-project.org/package=comsimitv (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . brownstein, g., steel, j.b., porter, s., gray, a., wilson, c., wilson, p.g. & wilson, j. b. ( ). chance in plant communities: a new approach to its measurement using the nugget from spatial autocorrelation. journal of ecology, , - . https://doi.org/ . /j. - . . .x cardoso, p., rigal, f., carvalho, j.c., fortelius, m., borges, p.a.v., podani, j. & schmera, d. ( ). partitioning taxon, phylogenetic and functional beta diversity into replacement and richness difference components. journal of biogeography, , - . doi: . /jbi. carmona, c. p., de bello, f., mason, n. w. h., lepš, j. ( ). traits without borders: integrating functional diversity across scales. trends in ecology and evolution ( ), - . doi: . /j.tree. . . champely, s., chessel, d. ( ). measuring biological diversity using euclidean metrics. environmental and ecological statistics , – . https://doi.org/ . /a: chao, a., chiu, c. and hsieh, t.c. ( ). proposing a resolution to debates on diversity partitioning. ecology, , - . https://doi.org/ . / - . chao, a., chiu, c.�h., villéger, s., sun, i�f., thorn, s., lin, y.�c., chiang, j.�m., & sherwin, w. b. ( ). an attribute�diversity approach to functional diversity, functional beta diversity, and related (dis)similarity measures. ecological monographs, ( ), e . . /ecm. chiu, c.-h., jost, l. & chao, a. ( ). phylogenetic beta diversity, similarity, and differentiation measures based on hill numbers. ecological monographs, ( ), - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . clarke, k.r. & warwick, r.m. ( ). quantifying structural redundancy in ecological communities. oecologia, ( ), - . de bello, f., carmona, c.p., mason, n.w.h., sebastià, m.�t. and lepš, j. ( ). which trait dissimilarity for functional diversity: trait means or trait overlap? journal of vegetation science, , - . doi: . /jvs. de bello, f., lepš, j., lavorel, s., & moretti, m. ( ). importance of species abundance for assessment of trait composition: an example based on pollinator communities. community ecology, ( ), – . https://doi.org/ . /comec. . . . díaz, s., & cabido, m. ( ). vive la différence: plant functional diversity matters to ecosystem processes. trends in ecology and evolution, ( ), – . https://doi.org/ . /s - ( ) - faith, d. p., minchin, p. r. & belbin, l. ( ). compositional dissimilarity as a robust measure of ecological distance. vegetatio , - . fortin, m.�j. & dale, m.r.t. ( ). spatial data analysis: a guide for ecologists. cambridge university press, cambridge. garnier, e., cortez, j., billès, g., navas, m., roumet, c., debussche, m., laurent, g., blanchard, a., aubry, d., bellmann, a., neill, c. & toussaint, j. ( ). plant functional markers capture ecosystem properties during secondary succession. ecology, , - . doi: . / - gregorius, h.�r., gillet, e.m. & ziehe, m. ( ). measuring differences of trait distributions between populations. biometrical journal, , - . https://doi.org/ . /bimj. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . grime, j. p. ( ). benefits of plant diversity to ecosystems: immediate, filter and founder effects. journal of ecology, , – . hawkins, b.a., leroy, b., rodríguez, m.Á., singer, a., vilela, b., villalobos, f., wang, x. & zelený, d. ( ). structural bias in aggregated species�level variables driven by repeated species co�occurrences: a pervasive problem in community and assemblage data. journal of biogeography, , - . hérault, b., & honnay, o. ( ). using life-history traits to achieve a functional classification of habitats. applied vegetation science, ( ), – . https://doi.org/ . /j. - x. .tb .x hill, m. o. & gauch, h. g. ( ). detrended correspondence analysis: an improved ordination technique. vegetatio, , – . hill, m. o. ( ). diversity and evenness: a unifying notation and its consequences. ecology, ( ), – . hitchcock, f.l. ( ). distribution of a product from several sources to numerous localities. journal of mathematical physics, : - . hothorn, t., hornik, k., van de wiel, m. a. & zeileis, a. ( ). a lego system for conditional inference. the american statistician, ( ), – . hothorn, t., zeileis, a. ( ). partykit: a modular toolkit for recursive partytioning in r. journal of machine learning research, , - . url http://jmlr.org/papers/v /hothorn a.html izsák, c., & price. r. g. ( ). measuring b-diversity using a taxonomic similarity index, and its relation to spatial scale. marine ecology progress series , – . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . janson, s. & j. vegelius ( ). measures of ecological association. oecologia, ( ), - . jost, l. ( ). partitioning diversity into independent alpha and beta components. ecology, , – . kleyer, m., dray, s., bello, f., lepš, j., pakeman, r.j., strauss, b., thuiller, w. & lavorel, s. ( ). assessing species and community functional responses to environmental gradients: which multivariate methods? journal of vegetation science, , - . doi: . /j. - . . .x: – . koleff, p., gaston, k. j. & lennon, j. j. ( ). measuring beta diversity for presence– absence data. journal of animal ecology, , - . doi: . /j. - . . .x laliberté, e. & p. legendre ( ). a distance-based framework for measuring functional diversity from multiple traits. ecology, , - . laliberté, e., legendre, p., & shipley, b. ( ). fd: measuring functional diversity from multiple traits, and other tools for functional ecology. r package version . - . legendre, p. & legendre, l. ( ) numerical ecology. elsevier, amsterdam, nl legendre, p., de cáceres, m. ( ). beta diversity as the variance of community data: dissimilarity coefficients and partitioning. ecology letters , – leinster, t. & cobbold, c.a. ( ). measuring diversity: the importance of species similarity. ecology, , - . doi: . / - . lengyel, a. & podani, j. ( ). assessing the relative importance of methodological decisions in classifications of vegetation data. journal of vegetation science, , - . doi: . /jvs. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . lengyel, a., swacha, g., botta-dukát, z. & kacki, z. ( ). trait-based numerical classification of mesic and wet grasslands in poland. journal of vegetation science, , – . https://doi.org/ . /jvs. lepš, j., de bello, f., lavorel, s. & berman, s. ( ). quantifying and interpreting functional diversity of natural communities: practical considerations matter. preslia, , – . macarthur, r., levins, r. ( ). limiting similarity convergence and divergence of coexisting species. american naturalist, , – . mason, n. w. h., mouillot, d., lee, w. g. & wilson, j. b. ( ). functional richness, functional evenness and functional divergence: the primary components of functional diversity. oikos, , - . doi: . /j. - . . .x mcgill, b., enquist, b. j., weiher, e., westoby, m. ( ). rebuilding community ecology from functional traits. trends in ecology and evolution ( ), - . mouchet, m.a., villéger, s., mason, n.w.h. and mouillot, d. ( ). functional diversity measures: an overview of their redundancy and their ability to discriminate community assembly rules. functional ecology, , - . doi: . /j. - . . .x mouillot, d., stubbs, w., faure, m., dumay, o., tomasini, j.a., wilson, j.b. & chi, t.d. ( ). niche overlap estimates based on quantitative functional traits: a new family of non�parametric indices. oecologia, , – . muscarella, r. & uriarte, m. ( ). do community-weighted mean functional traits reflect optimal strategies? proceedings of the royal society b, , . https://doi.org/ . /rspb. . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . nipperess, d.a., faith, d.p. & barton, k. ( ), resemblance in phylogenetic diversity among ecological assemblages. journal of vegetation science, , - . doi: . /j. - . . .x oksanen, j., blanchet, f.g., friendly, m., kindt, r., legendre, p., mcglinn, d., peter r. minchin, p. r., o'hara, r. b., simpson, g. l., solymos, p., stevens, m. h. m., szoecs, e. & wagner, h. ( ). vegan: community ecology package. r package version . - . https://cran.r-project.org/package=vegan pavoine, s. & ricotta, c. ( ). functional and phylogenetic similarity among communities. methods in ecology and evolution, , -- . pavoine, s. & ricotta, c. ( ). measuring functional dissimilarity among plots: adapting old methods to new questions. ecological indicators, , - . pavoine, s. ( ). clarifying and developing analyses of biodiversity: towards a generalisation of current approaches. methods in ecology and evolution, , - . doi: . /j. - x. . .x pavoine, s. ( ). a guide through a family of phylogenetic dissimilarity measures among sites. oikos, , - . doi: . /oik. pavoine, s. ( ). adiv: an r package to analyse biodiversity in ecology. methods in ecology and evolution, , – . https://doi.org/ . / - x. peres-neto, p.r., dray, s. & ter braak, c.j.f. ( ). linking trait variation to the environment: critical issues with community�weighted mean correlation resolved by the fourth�corner approach. ecography, , - . petchey, o. l. & gaston, k. j. ( ). functional diversity: back to basics and looking forward. ecology letters, , - . doi: . /j. - . . .x (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . podani, j. & schmera, d. ( ). a new conceptual and methodological framework for exploring and explaining pattern in presence – absence data. oikos, , - . doi: . /j. - . . .x podani, j. ( ). introduction to the exploration of multivariate biological data. backhuys, leiden, nl. r core team ( ). r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria. https://www.r-project.org/. rao, c. r. ( ). diversity and dissimilarity coefficients: a unified approach. theoretical population biology, , - . ricotta c. & burrascano s. ( ). beta diversity for functional ecology. preslia, , – . ricotta, c. & g. bacaro. ( ). on plot-to-plot dissimilarity measures based on species functional traits. community ecology, , – . ricotta, c. & j. podani. ( ). on some properties of the bray-curtis dissimilarity and their ecological meaning. ecological complexity, , – . ricotta, c. & pavoine, s. ( ). measuring similarity among plots including similarity among species: an extension of traditional approaches. journal of vegetation science, , - . doi: . /jvs. ricotta, c. ( ). of beta diversity, variance, evenness, and dissimilarity. ecology and evolution , – . https://doi.org/ . /ece . ricotta, c. ( ). a family of (dis)similarity measures based on evenness and its relationship with beta diversity. ecological complexity, , - . doi: . /j.ecocom. . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ricotta, c., bacaro, g., caccianiga, m., cerabolini, b.e.l. & moretti, m. ( ). a classical measure of phylogenetic dissimilarity and its relationship with beta diversity. basic and applied ecology ( ), - . https://doi.org/ . /j.baae. . . ricotta, c., podani, j., pavoine, s. ( ). a family of functional dissimilarity measures for presence and absence data. ecology and evolution, , – . doi: . /ece . schmidt, t., matias rodrigues, j. & von mering, c. ( ). a family of interaction-adjusted indices of community similarity. isme journal , – . https://doi.org/ . /ismej. . signorell, a. et mult. al. ( ). desctools: tools for descriptive statistics. r package version . . . strobl, c., boulesteix, a.l., kneib, t., augustin, t. & zeileis, a. ( ). conditional variable importance for random forests. bmc bioinformatics, ( ). http://www.biomedcentral.com/ - / / strobl, c., boulesteix, a.l., zeileis, a. & hothorn, t. ( ). bias in random forest variable importance measures: illustrations, sources and a solution. bmc bioinformatics, , . http://www.biomedcentral.com/ - / / swenson n. g., anglada-cordero p. & barone j. a. ( ). deterministic tropical tree community turnover: evidence from patterns of functional beta diversity along an elevational gradient. proceedings of the royal society b, , – . swenson, n. g. ( ). phylogenetic beta diversity metrics, trait evolution and inferring the functional beta diversity of communities. plos one ( ), e . https://doi.org/ . /journal.pone. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tamás, j., podani, j. & csontos, p. ( ). an extension of presence/absence coefficients to abundance data: a new look at absence. journal of vegetation science, , - . doi: . / tuomisto, h. ( a). a diversity of beta diversities: straightening up a concept gone awry. part . defining beta diversity as a function of alpha and gamma diversity. ecography, , - . doi: . /j. - . . .x tuomisto, h. ( b). a diversity of beta diversities: straightening up a concept gone awry. part . quantifying beta diversity and related phenomena. ecography, , - . doi: . /j. - . . .x villéger, s., mason, n.w.h. & mouillot, d. ( ). new multidimensional functional diversity indices for a multifaceted framework in functional ecology. ecology, , - . doi: . / - . violle, c., navas, m.�l., vile, d., kazakou, e., fortunel, c., hummel, i. & garnier, e. ( ). let the concept of trait be functional! oikos, , - . doi: . /j. - . . .x whittaker, r. h. ( ). vegetation of the siskiyou mountains, oregon and california. ecological monographs, , – . whittaker, r. h. ( ). evolution and measurement of species diversity. taxon, , - .doi: . / zelený, d. ( ). which results of the standard test for community weighted mean approach are too optimistic? journal of vegetation science , - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tables and figures table . similarity and dissimilarity forms of resemblance indices for presence-absence data name of the index similarity version dissimilarity version sørensen �� � � � � � � � � � �� � ��� ⁄ �� � � � � � � � � � � � � � �� � �� ochiai �� � ���� � ���� � �� � � ����� �� � � � � ����� kulczynski �� � � � � � � � � � � � � � ��⁄ � ��⁄ �⁄ �� � ! � � � � � � � � �" � # � �� � � ��$ simpson ��� � �� � min��, �� � � min ��, ��� ��� � � � � ()*���, ��� jaccard �� � �� � � � � � � ��� �� � � � � � � � � � � � � � ��� sokal & sneath ��� � �� � �� � �� ��� � �� � �� � � �� � �� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table . classification of trait-based dissimilarity indices. in columns of input data type x-es indicate, if abundance (a), relative abundance (r), and presence-absence data can be used as input. class approach family references input data tpye r function a r p/a summary-based typical value cwm-based ricotta et al. ( ) x x x fd:::functcomp distribution- based cdf-based appendix s x x x our new functions, see data availability direct dissimilarity probabilistic disc/dq rao , pavoine & ricotta ( ) x x x adiv::sq dsimcom pavoine & ricotta ( ) x x x adiv:::dsimcom ordinariness- based dissabc pavoine & ricotta ( ) x x x adiv:::dissabc generalized_tradidiss pavoine & ricotta ( ) x x adiv:::generalized_tradidiss diversity multiplicative beta chao et al. ( ) x our new functions, see data (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r. a ll rig h ts re se rve d . n o re u se a llo w e d w ith o u t p e rm issio n . t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d ja n u a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . partitioning availability nearest neighbour dcw, dcw(q) clarke & warwick ( ), ricotta & bacaro ( ) x x our new functions, see data availability dip izsák & prince ( ), ricotta & bacaro ( ) x x our new functions, see data availability paddis ricotta et al. ( ) x adiv:::paddis classification- based not discussed not discussed hérault & honnay ( ), nipperess et al. ( ), cardoso et al. ( ), pavoine ( ) (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r. a ll rig h ts re se rve d . n o re u se a llo w e d w ith o u t p e rm issio n . t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d ja n u a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . table . kendall tau correlations between environmental distance and the functional dissimilarity measures at different values of sigma and with abundance data type sigma= . sigma= . sigma= . sigma= . sigma= sigma= cwmdis . . . . . . cdfdis . . . . . . d(q) . . . . . . dsimcom.ss . . . . . . dsimcom.jac . . . . . . dsimcom.sor . . . . . . dsimcom.och . . . . . . dsimcom.beta . . . . . . dissabc.jac . . . . . . dissabc.sor . . . . . . dissabc.ss . . . . . . dissabc.och . . . . . . dissabc.kul . . . . . . dissabc.si . . . . . . tradidiss.gc.even . . . . . . tradidiss.ms.even . . . . . . tradidiss.pe.even . . . . . . tradidiss.gc.uneven . . . . . . tradidiss.ms.uneven . . . . . . tradidiss.pe.uneven . . . . . . βturnover . . . . . . βheterogeneity . . . . . . βsegregation . . . . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dip . . . . . . dcw . . . . . . bray-curtis (species-based) . . . . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table . kendall tau correlations between environmental distance and the functional dissimilarity measures at different values of sigma and with presence/absence data type sigma= . sigma= . sigma= . sigma= . sigma= sigma= cwmdis . . . . . . cdfdis . . . . . - . d(q) . . . . . - . dsimcom.ss . . . . . - . dsimcom.jac . . . . . - . dsimcom.sor . . . . . - . dsimcom.och . . . . . . dsimcom.beta . . . . . - . dissabc.jac . . . . . - . dissabc.sor . . . . . - . dissabc.ss . . . . . - . dissabc.och . . . . . - . dissabc.kul . . . . . - . dissabc.si . . . . . . tradidiss.gc.even . . . . . - . tradidiss.ms.even . . . . . - . tradidiss.pe.even . . . . . - . tradidiss.gc.uneven . . . . . - . tradidiss.ms.uneven . . . . . - . tradidiss.pe.uneven . . . . . - . βturnover . . . . . - . βheterogeneity . . . . . - . βsegregation . . . . . - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dip . . . . . - . dcw . . . . . - . paddis.jac . . . . . - . paddis.sor . . . . . - . paddis.ss . . . . . - . paddis.och . . . . . - . paddis.simp . . . . . . paddis.kul . . . . . - . sørensen (species-based) . . . . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . heat maps showing the interactive effects of niche width (sigma), transformation of between-species dissimilarities (lin = linear, exp = exponential), data type (abund = abundance, p/a = presence/absence), and dissimilarity index ( – cwmdis, – cdfdis, – dq, – dsimcom/sørensen, – dissabc/sørensen, – generalized_tradidiss/generalized canberra, uneven weighting, – βturnover, – dcw) on the correlation with environmental distance (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . competitive binding of stats to receptor phospho-tyr motifs accounts for altered cytokine responses in autoimmune disorders competitive binding of stats to receptor phospho-tyr motifs accounts for altered cytokine responses in autoimmune disorders stephan wilmes *, polly-anne jeffrey *, jonathan martinez-fabregas , maximillian hafer , paul fyfe , elizabeth pohler , silvia gaggero , martín lópez-garcía , grant lythe , thomas guerrier , david launay , mitra suman , jacob piehler , carmen molina-parís # and ignacio moraga # division of cell signalling and immunology, school of life sciences, university of dundee, dundee, uk. department of applied mathematics, school of mathematics, university of leeds, leeds, uk. department of biology and centre of cellular nanoanalytics, university of osnabrück, osnabrück, germany. université de lille, inserm umr cnrs umr –canther and institut pour la recherche sur le cancer de lille (ircl), lille, france. univ. lille, inserm, chu lille, u - infinite - institute for translational research in inflammation, f- lille, france. * these authors contributed equally to this work # these authors share senior authorship abstract cytokines elicit pleiotropic and non-redundant activities despite strong overlap in their usage of receptors, jaks and stats molecules. we use il- and il- to ask how two cytokines activating the same signaling pathway have different biological roles. we found that il- induces more sustained stat phosphorylation than il- , with the two cytokines inducing comparable levels of stat phosphorylation. mathematical and statistical modelling of il- and il- signaling identified stat binding to gp , and stat binding to il- ra, as the main dynamical processes contributing to sustained pstat by il- . mutation of tyr on il- ra decreased il- -induced stat phosphorylation by % but had limited effect on stat phosphorylation. strong receptor/stat coupling by il- initiated a unique gene expression program, which required sustained stat phosphorylation and irf expression and was enriched in classical interferon stimulated genes. interestingly, the stat/receptor coupling exhibited by il- /il- was altered in patients with systemic lupus erythematosus (sle). il- /il- induced a more potent stat activation in sle patients than in healthy controls, which correlated with higher stat expression in these patients. partial inhibition of jak activation by sub-saturating doses of tofacitinib specifically lowered the levels of stat activation by il- . our data show that receptor and stats concentrations critically contribute to shape cytokine responses and generate functional pleiotropy in health and disease. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction il- and il- both have intricate functions regulating inflammatory responses ( ). il- is a hetero-dimeric cytokine comprised of p and ebi subunits ( ). il- exerts its activities by binding gp and il- rα receptor subunits in the surface of responsive cells, triggering the activation of the jak /stat /stat signaling pathway. il- elicits both pro- and anti- inflammatory responses, although the later activity seems to be the dominant one ( ). il- stimulation inhibits rorgt expression, thereby suppressing th- commitment and limiting subsequent production of pro-inflammatory il- ( , ). moreover, il- induces a strong production of anti-inflammatory il- on (tbet+ and foxp -) tr- cells ( - ) further contributing to limit the inflammatory response. il- engages a hexameric receptor complex comprised of each of two copies of il- ra, gp and il- ( ), triggering the activation, as il- does, of the jak /stat /stat signaling pathway. however, opposite to il- , il- is known as a paradigm pro-inflammatory cytokine ( , ). il- inhibits lineage differentiation to treg cells ( ) while promoting th- ( , ), thus supporting its pro-inflammatory role. how il- and il- elicit opposite immuno-modulatory activities despite activating almost identical signaling pathways is currently not completely understood. the relative and absolute stats activation levels seem to have intricate roles, which lead to a strong signaling and functional plasticity by cytokines. although il- robustly activates stat , it is capable to mount a considerable stat response as well ( ). moreover, in the absence of stat , il- induces a strong stat response comparable to ifng – a prototypic stat activating cytokine ( ). likewise, the absence of stat potentiates the stat response for il- , which normally elicits a strong stat response, rendering it to mount an il- -like response ( ). furthermore, negative feedback mechanisms like socss and phosphatases have been described as critical players influencing stat and stat phosphorylation kinetics and thereby shaping their signal integration for gp -utilizing cytokines ( - ). yet, how all these molecular components are integrated by a given cell to produce the desired response is still an open question. among the il- /il- cytokine family, il- exhibits a unique stat activation pattern. the majority of gp -engaging cytokines activate preferentially stat , with activation of stat being an accessory or balancing component ( , ). il- , however, triggers stat and stat activation with high potency ( ). indeed, different studies have shown that il- responses rely on either stat ( - ) or stat activation ( , ). moreover, recent transcriptomics studies showed that in the absence of stat , il- and il- lost more than % of target gene induction. yet, stat was the main factor driving the specificity of the il- versus the il- response, highlighting a critical interplay of stat and stat engagement ( ). while the biological responses induced by il- and il- have been extensively studied ( , ), the very initial steps of signal activation and kinetic integration by these two cytokines have not been comprehensively analysed. since the different biological outcomes elicited by il- and il- are most likely encoded in the early events of cytokine stimulation, here we specifically aimed to identify the molecular determinants underlying functional selectivity by il- in human t-cells. we asked how a defined cytokine stimulus is propagated in time over multiple layers of signaling to produce the desired response. to this end, we probed il- and il- signaling at different scales, ranging from cell surface receptor assembly and early stat / effector activation to an unbiased and quantitative multi-omics approach: phospho- proteomics after early cytokine stimulation, kinetics of transcriptomic changes and alteration of the t-cell proteome upon prolonged cytokine exposure. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / il- and il- induced similar levels of assembly of their respective receptor complexes, which resulted in comparable phosphorylation of stat by the two cytokines. il- , on the other hand, triggered a more sustained stat phosphorylation. to decipher the molecular events which determine sustained stat phosphorylation by il- , we mathematically model the stat and stat signaling kinetics induced by each of these cytokines. we identified differential binding of stat and stat to il- ra and gp , respectively, as the main factor contributing to a sustained stat activation by il- . at the transcriptional level, il- triggered the expression of a unique gene program, which strictly required the cooperative action between sustained pstat and irf expression to drive the induction of an interferon- like gene signature that profoundly shaped the t-cell proteome. interestingly, our mathematical models of il- and il- signaling predicted that changes in receptor and stat expression could fundamentally change the magnitude and timescale of the il- and il- responses. we found high levels of stat expression in sle patients when compared to healthy donors, which correlated with biased stat responses induced by il- and il- in these patients. strikingly, we could specifically inhibit stat activation by il- using suboptimal doses of the jak inhibitor tofacitinib. this could provide a new strategy to specifically target individual stats engaged by cytokines. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / results: il- induces a more sustained stat activation than hypil- in human th- cells il- and il- are critical immuno-modulatory cytokines. while il- engages a hexameric surface receptor comprised of two molecules of il- ra and two molecules of gp to trigger the activation of stat and stat transcription factors (figure a), il- binds gp and il- ra to trigger activation of the same stats molecules (figure a). despite sharing a common receptor subunit, gp , and activating similar signaling pathways, these two cytokines exhibit non-redundant immuno-modulatory activities, with il- eliciting a potent pro- inflammatory response and il- acting more as an anti-inflammatory cytokine. here, we set to investigate the molecular rules that determine the functional specificity elicited by il- and il- using human th- cells as a model experimental system. due to the challenging recombinant expression of the human il- , we have recombinantly produced a murine single-chain variant of il- (p and ebi ) which cross-reacts with the human receptors and triggers potent signaling, comparable to the signaling output produced by commercial human il- ( ) (supp. fig. a). in addition, we have used a linker-connected single-chain fusion protein of il- ra and il- termed hyperil- (hypil- ) ( ) to diminish il- signaling variability due to changes in il- ra expression during t cell activation ( ). cd + t cells from human buffy coat samples were isolated by magnetic activated cell sorting (macs) and grew under th- polarizing conditions. th- cells were then used to study in vitro signaling by il- and il- (supp. fig. b). we took advantage of a barcoding methodology allowing high-throughput multiparameter flow cytometry to perform detailed dose/response and kinetics studies induced by hypil- and il- in th- cells ( ) (supp. fig. b). dose- response experiments with il- and hypil- on th- cells showed concentration-dependent phosphorylation of stat and stat . phosphorylation of stat / was more sensitive to activation by il- with an ec of ~ pm compared to ~ pm for hypil- (figure b). despite this difference in sensitivity, both cytokines yielded the same activation amplitude for pstat . for pstat , however, we observed a significantly reduced maximal amplitude for hypil- relative to il- (figure b). we next performed kinetic studies to assess whether the poor stat activation by hypil- was a result from different activation kinetics. for stat , we saw the peak of phosphorylation after ~ - minutes, followed by a gradual decline. both cytokines exhibited an almost identical sustained pstat profile, with ~ % of activation still seen after h of continuous stimulation. interestingly, il- did not only activate stat with higher amplitude but also more sustained than hypil- (figure c). this could be better appreciated when pstat levels were normalized to maximal mfi for each cytokine, with il- inducing clearly a more sustain phosphorylation of stat than hypil- (supp. fig. c). the same phenotype was observed in other t-cell subsets of activated pbmcs (supp. fig. d). as cell surface gp levels are significantly reduced upon t-cell activation ( ), we next investigated whether the transient stat activation profile induced by hypil- resulted from limited availability of gp . for that we generated a rpe cell clone stably expressing ten times higher levels of gp in its surface (figure d, right panel). stimulation of this rpe clone with hypil- resulted in a more sustained activation of stat , with very little effect on stat activation kinetics when compared to rpe wild type cells, suggesting that gp receptor density does not contribute to the transient stat activation kinetics elicited by hypil- (figure d). ligand-induced cell-surface receptor assembly by il- and hypil- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we next investigated whether il- and hypil- elicited differential cell surface receptor engagement that could explain their distinct signaling output. for that, we measured the dynamics of receptor assembly in the plasma membrane of live cells by simultaneous dual- colour total internal reflection fluorescence (tirf) imaging. rpe cells were chosen as a model experimental system since they do not express endogenous il- ra (supp. fig. e). we used previously described rpe gp ko cells (supp. fig. a) ( ) to transfect and express tagged variants of il- ra and gp , to allow quantitative site-specific fluorescence cell surface labelling by dye-conjugated nanobodies (nbs) (figure e) as recently described in ( ). for both il- ra and gp we found a random distribution and unhindered lateral diffusion of individual receptor monomers (figure f). single molecule co- localization combined with co-tracking analysis was then used to identify correlated motion of il- ra and gp which was taken as a readout for receptor heterodimer formation ( ) (figure f, figure supp. movie ). in the resting state, we did not observe pre-assembly of il- ra and gp . however, after stimulation with il- we found substantial heterodimerization (figure f & g, supp. fig. b, figure supp. movie & ). at elevated laser intensities, bleaching analysis of individual complexes confirmed a one-to-one ( : ) complex stoichiometry of il- ra and gp , whereas single-molecule förster resonance energy transfer (fret) further corroborated close molecular proximity of the two receptor chains (figure h). we also observed association and dissociation events of receptor heterodimers, pointing to a dynamic equilibrium between monomers and dimers as proposed for other heterodimeric cytokine receptor systems ( , ) (figure supp. movie ). to measure homodimerization of gp by hypil- , we stochastically labelled gp with equal concentrations of the same nb species conjugated to either of the two dyes ( ). we saw strong homodimerization of gp after stimulation with hypil- (figure g, supp. fig. b , figure supp. movie ). homodimerization was confirmed either by single- color dual-step bleaching or dual-color single-step bleaching as shown for other homodimeric cytokine receptors (supp. fig. c) ( ). for both cytokine receptor systems, we saw a cytokine-induced reduction of the diffusion mobility, which has been ascribed to increased friction of receptor dimers diffusing in the plasma membrane. however, we note that hypil- stimulation impaired diffusion of gp more strongly than il- did, possibly indicating faster receptor internalization (supp. fig. d). based on the dimerization data, we were able to calculate the two-dimensional equilibrium dissociation constants (𝐾!"!) according to the law of mass action for a dynamic monomer-dimer equilibrium: for il- -induced heterodimerization of il- ra and gp , we calculated a d kd of ~ . µm- . in activated t-cells with high levels and a significant excess of il- ra over gp , this 𝐾!"! ensures strong receptor assembly by il- ( ). the d kd for gp homodimerization by hypil- was ~ . µm- . this higher affinity is most likely due to the two high-affinity binding sites engaged in the hexameric receptor complex ( ). however, in t-cells the expression of gp can be particularly low, thus, probably limiting hypil- . taken together, these experiments marked ligand-induced receptor assembly as the initial step triggering downstream signaling for both il- and hypil- , with no obvious differences in their receptor activation mechanism which could support the observed more sustained stat activation elicited by il- . mathematical and statistical analysis of hypil- and il- induced stat kinetic responses to gain further insight into the molecular rules and kinetics that define il- sustained stat phosphorylation, we developed two mathematical models of the initial steps of hypil- and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / il- receptor-mediated signaling, respectively. the mathematical model for each cytokine considers the following events: i) cytokine association and dissociation to a receptor chain (figure a, supp. fig. a and b, top panel), ii) cytokine-induced dimer association and dissociation (supp. fig. a and b, bottom panel), iii) stat (or stat ) binding and unbinding to dimer (supp. fig. c and d), iv) stat (or stat ) phosphorylation when bound to dimer (supp. fig. c and d), v) internalisation/degradation of complexes (supp. fig. e and f), and vi) dephosphorylation of free stat (or stat ) (supp. fig. g). details of model assumptions, model parameters and parameter inference have been provided in the material and methods under mathematical models and bayesian inference. we first wanted to explore if there existed a potential feedback mechanism in the way in which receptor molecules are internalised/degraded over time. to this end, and for each cytokine model, we considered two hypotheses: hypothesis assumes that receptor complexes (supp. fig. e and f) are internalised with rate proportional to the concentration of the species in which they are contained (e.g., different dimer types), and hypothesis , that receptor complexes are internalised with rate proportional to the product of the concentration of the species in which they are contained and the sum of the concentrations of free phosphorylated stat and stat . hypothesis is consistent with a negative feedback mechanism in which pstat molecules translocate to the nucleus, where they increase the production of negative feedback proteins such as socs . as described in the material and methods (mathematical models and bayesian inference) we made use of the rpe experimental data set to carry out mathematical model selection for the two different hypotheses. we found that hypothesis could explain the data better than hypothesis , with a probability of %. this result can be seen in figure b, in which we plot, for different values of the distance threshold between the mathematical model output and the data (see mathematical models and bayesian inference in material and methods, for details), the relative probability of each hypothesis, where hypothesis is denoted 𝐻# and hypothesis is denoted 𝐻". it can be observed that for smaller values of the distance threshold, which indicate better support from the data to the mathematical model, the relative probability of hypothesis is higher than that of hypothesis . we then made use of this result to explore the mathematical models for both cytokines under hypothesis , in particular we performed parameter calibration. to this end (and as described in material and methods under mathematical models and bayesian inference), we carried out bayesian inference together with the mathematical models (hypothesis ) and the experimental data sets to quantify the reaction rates (see supp. fig. ) and initial molecular concentrations (see table and table ). the bayesian parameter calibration of the two models of cytokine signaling allows one to quantify the observed kinetics of pstat / phosphorylation induced by hypil- and il- in rpe and th- cells (figure c). substantial differences in stat association rates to and dissociation rates from the dimeric complexes were inferred to critically contribute to defining pstat / kinetics. figure d shows the kernel density estimates (kdes) for the posterior distributions of the rate constants and initial concentrations in the models. 𝑘$% & denotes the rate at which stat𝑖 binds to gp and 𝑘$' & denotes the rate at which stat𝑖 binds to il- ra, for 𝑖 ∈ { , }. our results indicate that stat and stat exhibit different binding preferences towards il- ra and gp , respectively. while stat exhibits stronger binding to il- ra than gp (𝑘#' & > 𝑘#% & ), stat exhibits stronger binding to gp than il- ra, (𝑘(%& > 𝑘(' & ) in agreement with previous observations ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / il- rα cytoplasmic domain is required for sustained pstat kinetics the bayesian inference carried out with the experimental data and the mathematical models clearly indicated statistically significant differences in the binding rates of stat /stat to gp and il- ra, to account for the different phosphorylation kinetics exhibited by hypil- and il- . thus, we next investigated whether the more sustained stat activation by il- resulted from its specific engagement of il- ra. for that, we used rpe cells, which do not express il- ra (supp. fig. e), to systematically dissect the contribution of the il- ra cytoplasmic domain to the differential pstat activation by il- . il- ra’s intracellular domain is very short and only encodes two tyr susceptible to be phosphorylated in response to il- stimulation, i.e., tyr and ty (figure a). we mutated these two tyr to phe to analyse their contribution to il- induced signaling. we stably expressed wt il- ra as well as different il- ra tyr mutants in rpe cells with comparable cell surface expression levels (figure b). importantly, this reconstituted experimental system mimicked the pstat / activation kinetics of t-cells (supp. fig. a). as the endogenous gp expression levels remain unaltered, all generated clones exhibited very comparable responses to hypil- (figure b, bottom panels). il- triggered comparable levels of stat and stat activation in rpe cells reconstituted with il- ra wt and il- ra y f mutant, suggesting that this tyr residue does not contribute to signaling by this cytokine (figure b and supp. fig. b). in rpe cells reconstituted with the il- ra y f or y f-y f mutants, il- stimulation resulted in % of the stat activation, but only % of the stat activation levels induced by this cytokine relative to il- ra wt (figure b) ( ). these observations suggest a tight coupling of stat phosphorylation to one of the receptor chains; namely, il- ra with pstat and gp with pstat , respectively. we next tested how the cytoplasmic domains of gp and il- ra shape the pstat kinetic profiles. thus, we generated a stable rpe clone expressing a chimeric construct comprised of the extracellular and transmembrane domain of il- ra but the cytoplasmic domain of gp (figure c, supp. fig. a). again, as both cell lines express unaltered endogenous gp levels, they exhibited comparable responses to hyil- (figure c). strikingly, this domain-swap resulted in a transient pstat kinetic response by il- comparable to hypil- stimulation. stat activation on the other hand remained unaltered suggesting that the cytoplasmic domain of il- ra is essential for a sustained pstat response but not for pstat . two plausible scenarios could explain the observed pstat / activation differential by hypil- and il- : i) il- ra-jak complex phosphorylates stat faster than gp -jak complex or ii) pstat is more quickly dephosphorylated in the il- /gp receptor homodimer. in the latter case, pstat deactivation by constitutively expressed phosphatases could be an additional factor of regulation. indeed, shp- has been described to bind to gp and shape il- responses ( ). however, our bayesian inference results (together with the mathematical models and the experimental data) identified the stat/receptor association rates as the only rates that could account for the greater and more sustained activation of stat by il- . we note (as described in the material and methods) that the phosphorylation rate, denoted by q, of stat and stat when bound to a dimer (homo- or hetero-) has been assumed to be independent of the stat type and the receptor chain. moreover, the model also included dephosphorylation of free pstat molecules, and predicted that the rates at which these reactions occur (𝑑# and 𝑑() had rather similar posterior distributions, hence arguing against the potential role of phosphatases to specifically target stat upon hypil- stimulation. to distinguish between the two plausible scenarios, we next .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / determined the rates of pstat / dephosphorylation by blocking jak activity upon cytokine stimulation making use of the jak inhibitor tofacitinib in rpe cells. tofacitinib was added minutes after stimulation with either cytokine and pstat and pstat levels were measured at the indicated times. jak inhibition markedly shortened the pstat / activation profiles induced by both cytokines (figure d, supp. fig. b). the relative dephosphorylation rates could then be determined by the signal intensity ratio of +/- tofacitinib. even though pstat levels were more affected by jak inhibition than those of pstat , the observed relative changes were nearly identical for il- and hypil- . these findings were also confirmed for th- cells (supp. fig. c & d) and indicate, that selective phosphatase activity cannot serve as an explanation for the pstat / differential by hypil- and il- , in agreement with our mathematical modelling predictions. similarly, we tested whether neosynthesis of feedback inhibitors such as socs ( ) would selectively impair signaling by hypil- but not by il- . to this end we pre-treated cells with cycloheximide (chx) and followed the pstat / kinetics induced by the two cytokines (supp. fig. a & b). chx treatment resulted in more sustained pstat activity for both cytokines. to our surprise, stat phosphorylation by il- was even more sustained while pstat levels induced by il- remained unaffected. these observations exclude that feedback inhibitors selectively impair stat activation kinetics by hypil- and thus do not account for the faster stat dephosphorylation kinetics observed under hypil- stimulation. overall our data from the chimera and mutant experiments, which were not used in the bayesian calibration, provide strong and independent support, as well as validation, to the mathematical models of hypil- and il- signaling, and point to the differential association/dissociation of stat and stat to il- ra and gp , respectively, as the main factor defining stat phosphorylation kinetics in response to hypil- and il- stimulation. unique and overlapping effects of il- and hypil- on the th- phosphoproteome thus far, we have investigated the differential activation of stat /stat induced by hypil- and il- . next, we asked whether il- and il- induced the activation of additional and specific intracellular signaling programs that could contribute to their unique biological profiles. to this end, we investigated the il- and hypil- activated signalosome using quantitative mass-spectrometry-based phospho-proteomics. macs-isolated cd + were polarized into th- cells and expanded in vitro for stable isotope labelling by amino acids in cell culture (silac). cells were then stimulated for min with saturating concentrations of il- , hypil- or left untreated. samples were enriched for phosphopeptides (ti-imac), subjected to mass spectrometry and raw files analysed by maxquant software (supp. fig. a). in total we could quantify ~ phosphopeptides from proteins, identified across all conditions (unstimulated, il- , hypil- ) for at least two out of three tested donors. for il- and hypil- we detected similar numbers of significantly upregulated ( vs. ) and downregulated ( vs. ) phosphorylation events (figure a) and systematically categorized them in context with their cellular location and ascribed biological functions (supp. fig. b & c) ( ). the two cytokines shared approximately half of the upregulated and one third of the downregulated phospho-peptides (supp. fig. a) but also exhibited differential target phosphorylation (figure b and supp. fig. b). as expected, we found multiple members of the stat protein family among the top phosphorylation hits by the two cytokines, validating our study (figure b & c). in line with our previous observations, we detected the same relative amplitudes for tyrosine phosphorylated stat and stat . in addition to tyrosine- phosphorylation, we detected robust serine-phosphorylation on s for stat and stat (figure c). while ps-stat activity correlated with py-stat with il- being more potent .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / than hypil- , this was not the case for stat . despite an identical py-stat phosphorylation profile, hypil- induced a ~ % higher ps-stat relative to il- (figure c). these results were corroborated, following the phosphorylation kinetics of ps- stat and ps-stat by flow-cytometry (figure d). given the overlapping phospho-proteomic changes, gene ontology (go) analysis associated several sets of phosphopeptides with biological processes that were mostly shared between both cytokines (figure e, supp. fig. c). a large set of phospho-peptides was linked to transcription initiation (including jak/stat signaling) or mrna modification (figure e). interestingly, il- stimulation was associated to negative regulation of rna polymerase ii, whereas a positive regulation was detected for hypil- . a closer look into the functional regulation of rna-pol ii activity by the two cytokines revealed that multiple proteins involved in this process were differentially regulated by hypil- and il- (figure f). while positive regulators of rna-pol ii transcription, such as negative elongation factor a (nelfa), ppm g, rchy and pol ra, were much more phosphorylated in response to hypil- than il- , negative regulators of rna-pol ii transcription, such as larp , were much more engaged by il- treatment than by hypil- (figure f). interestingly, in a previous study we linked rna-pol ii regulation with the levels of stat s phosphorylation induced by hypil- via recruitment of cdk to stat dependent genes ( ). our phospho-proteomic analysis thus, suggests that il- and hypil- recruit different transcriptional complexes that ultimately could contribute to provide gene expression specificity by the two cytokines. additionally, we identified several interesting il- -specific phosphorylation targets. one example was ubiquitin protein ligase e component n-recognin (ubr ). phosphorylated ubr leads to ubiquitination and subsequent degradation of rorgc ( ), the key transcription factor required for th- lineage commitment, thus limiting th- differentiation (supp. fig. d). a second example is pak , which phosphorylates and stabilizes foxp leading to higher levels of treg cells (supp. fig. d) ( ). moreover, il- stimulation led to a very strong phosphorylation of bcl -associated agonist of cell death (bad), a critical regulator of t-cell survival and a well-known substrate of the pak kinase ( ). overall, our data show a large overlap between the il- and il- signaling program, with a strong focus on jak/stat signaling. however, il- engages additional signaling intermediaries that could contribute to its unique immuno-modulatory activities. further studies will be required to assess how these il- specific signaling pockets contribute to shape il- responses. kinetic decoupling of gene induction programs depends on sustained stat activation and irf expression by il- next, we investigated how the different kinetics of stat activation induced by hypil- and il- ultimately modulated gene expression by these two cytokines. to this end, we performed rna-seq analysis of th- cells stimulated with hypil- or il- for h, h and h to obtain a dynamic perspective of gene regulation. we identified ~ shared genes that could be quantified for all three donors and throughout all tested experimental conditions. in a first step, we compared how similar the gene programs induced by hypil- and il- were. principal component analysis (pca) was run for a subset of genes, found to be significantly up- (total ~ ) or downregulated (total ~ ) by either of the experimental conditions (p value£ . , fold change ³+ or £- ). at one hour of stimulation hypil- and il- induced very similar gene programs, with the two cytokines clustering together in the pca analysis regardless of whether we focused on the subsets of upregulated or downregulated genes (figure a). however, the similarities between the two cytokines changed dramatically in the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / course of continuous stimulation. while the two cytokines induced the downregulation of comparable gene programs at h and h stimulation, as denoted by the close clustering in the pca analysis (figure a, right panel) and the fraction of shared genes (~ %, figure b, supp. fig. a-c, supp. fig. a), this was not observed for upregulated genes. although the two cytokines induced comparable gene upregulation programs after h of stimulation (~ % shared genes), this trend almost completely disappeared at later stimulation times (figure a & b, supp. fig. b). this is well-reflected by the absolute numbers of up- or downregulated genes observed for il- and hypil- (figure c). stimulation with both cytokines yielded a similar trend of gene downregulation (figure c, right panel). however, while hypil- stimulation resulted in a spike of gene upregulation at h that quickly disappeared at later stimulation times, il- stimulation was capable to increase the number of upregulated genes beyond h of stimulation and maintains it even after h (figure c, left panel). this “kinetic decoupling” of gene induction seems to have a striking functional relevance. gene set enrichment analysis (gsea) ( ) identified several reactome pathways to be enriched for il- over the course of stimulation – most of them linked with interferon signaling and immune responses (figure d). in contrast, for hypil- stimulation no pathway enrichment was detected. most importantly, the vast majority of il- -induced genes that were associated to these pathways belonged to genes upregulated by il- treatment and that have been previously linked to stat activation ( , ) (supp. fig. c). although hypil- treatment resulted in the induction of some of these genes, their expression was very transient in time, in agreement with the short stat activation kinetic profile exhibited by hypil- (supp. fig. b & c). next, we performed cluster analysis to find further similarities and discrepancies between the gene expression programs engaged by hypil- and il- (figure e). since genes downregulated by il- and hypil- showed overall good similarity throughout the whole kinetic series, we mainly focused on differences in upregulated gene induction. we identified three functionally relevant gene clusters. the first gene cluster corresponds to genes that are transiently and equally induced by hypil- and il- . these genes peak after one hour and return to basal levels after h and h of stimulation (figure e). interestingly, this cluster contains classical il- -induced and stat -dependent genes, such as members of the nfkb and jun/fos transcriptional complex ( ), as well as the feedback inhibitor suppressor of cytokine signaling (socs ) ( ) and t-cell early activation marker cd . (figure e). a second cluster of genes corresponded to genes that were persistently activated by il- but only transiently by hypil- (figure e). among these genes we found classical stat - dependent genes, such as socs , programmed cell death ligand (pdl = cd ) ( ) and members of the interferon-induced protein with tetratricopeptide repeats (ifit) family. the third cluster of genes corresponded to genes exhibiting strong and sustained activation by il- after h and h stimulation but no activation by hypil- at all. this “ nd wave” of gene induction by il- was almost exclusively comprised of classical interferon stimulated genes (isgs) (supp. fig. c), such as stat & , guanylate binding protein (gbp ), gbp , & , and irf & . it is worth mentioning, that genes in the third cluster appear to require persistent stat activation ( , ) and were the basis for the ifn signature identified in our reactome pathway analysis. still, we were surprised about the magnitude of this nd gene wave. even though il- exerts a sustained pstat kinetic profile, pstat levels were down to ~ % of maximal amplitude after h of stimulation. we reasoned that additional factors could further amplify the stat response for il- but not for hypil- . within the st wave of stat -dependent genes, .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we also spotted the transcription factor interferon response factor (irf ), that was continuously induced throughout the kinetic series in response to il- but only transiently spiking after h of hypil- stimulation (figure e). irf expression was shown to prolong pstat kinetics ( ) and to be required for il- -dependent tr- differentiation and function ( ). we confirmed the kinetics of irf protein expression by flow cytometry and showed higher and more sustained protein levels after il- stimulation relative to hypil- (figure a). next, we tested in our rpe cell system, whether sirna mediated knockdown of irf would alter the gene induction profiles of certain stat or stat -dependent marker genes. in rpe cells, reconstituted with il- ra, irf protein levels were peaking around h after stimulation with il- and transfection with irf -targeting sirna knocked down expression by > % (figure b). importantly, knockdown of irf did not alter the overall kinetics of pstat and pstat activation (figure c). induction of stat -dependent genes stat , gbp and oas as well as stat -dependent gene socs were followed by rt qpcr (figure d). interestingly, up to h of stimulation, the gene induction curves were identical for control- and irf -sirna treated cells. later than h – that is, when irf protein levels are peaking – the gene induction was decreased between - % in absence of irf . strikingly, expression of socs , a classical stat -dependent reporter gene was transient and independent on irf levels, highlighting that irf selectively amplifies stat -dependent gene induction. taken together our data support a scenario whereby il- by exhibiting a kinetic decoupling of stat and stat activation is capable of triggering independent gene expression waves, which ultimately contribute to shape its distinct biology. il- -induced stat response drives global proteomic changes in th- cells next, we aimed to uncover how the distinct gene expression programs engaged by hypil- and il- ultimately relate to alterations of the th- cell proteome. for that, we continuously stimulated silac labelled th- cells for h with saturating doses of il- and hypil- and compared quantitative proteomic changes to unstimulated controls (figure a). we quantified ~ proteins present in all three biological replicates and in all tested conditions (unstimulated/il- /hypil- ). both cytokines downregulated a similar number of proteins (il- : , hypil- : ) (figure b) with approximately half of them being shared by the two cytokines, mimicking our observations in the rna-seq studies (figure c, supp. fig. a). with upregulated proteins, il- was almost twice as potent as hypil- ( proteins) with very little overlap. among the upregulated proteins by il- but not hypil- , we detected several proteins with described immune-modulatory functions on t-cells. one of these proteins was transforming growth factor b (tgf-b), which is a key regulator with pleiotropic functions on t-cells ( ). tgf-b has been identified to synergistically act with il- to induce il- secretion from tr- cells – thus accounting for one of the key anti-inflammatory functions of il- ( ). on the other hand, we also found selplg-encoded protein rsgl- which is critically required for efficient migration and adhesion of th- cells to inflamed intestines ( , ). interestingly, we found larp moderately upregulated by il- . this negative regulator for rna pol ii was also identified in our phospho-target screening and selectively engaged by il- (figure f). il- and hypil- share ~ % of downregulated proteins, but without strong functional patterns. both cytokines downregulated several proteins related to mitotic cell cycle (lig , csnk b, psmb ) mrna processing and splicing (ncbp , pcbp , nudt ) ( ). strikingly, a significant number (~ %) of proteins upregulated by il- belong to the group of isgs (figure b & c, supp. fig. b). this particular set of proteins including stat , .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / stat , mx dynamin like gtpase (mx ), interferon stimulated gene (isg ) or poly(adp-ribose) polymerase family member (parp ) was not markedly altered by hypil- . of note: the overall expression patterns of the most significantly altered proteins are congruent to the gene induction patterns observed after h and h (figure d & e, supp. fig. b). similar to this, gsea reactome analysis identified again pathways associated with interferon signaling and cytokine/immune system but failed to detect any significant functional enrichment by hypil- (figure e, supp. fig. b & c). finally, we correlated rnaseq-based gene induction patterns with detected proteomic changes. to our surprise we only found a relatively low number of shared hits. however, the identified proteins belong exclusively to a group upregulated by il- (figure f). they are all located in the “ nd gene wave” cluster and all of them are regulated by isgs (figure e). taken together these results provide compelling evidence that sustained pstat activation by il- accounts for its gene induction and proteomic profiles, thus, giving a mechanistic explanation for the diverse biological outcomes of il- and il- . our observations are in good agreement with previous findings in cancer cells, showing that particularly the involvement of stat activation is responsible for proteomic remodeling by il- ( ). receptor and stat concentrations determine the nature of the il- /il- response our data suggest that stat molecules compete for binding to a limited number of phospho- tyr motifs in the intracellular domains of cytokine receptors. a direct consequence derived from this hypothesis is that cells can adjust and change their responses to cytokines by altering their concentrations of specific stats or receptors molecules. to assess to what degree immune cells differ in their expression of cytokine receptors and stats, we investigated levels of il- ra, gp , il- ra, stat and stat protein expression across different immune cell populations making use of the immunological proteomic resource (immpres - http://immpres.co.uk) database. strikingly, the level of expression of these proteins change dramatically across the populations studied (figure a), suggesting that these cells could potentially produce very different responses to hypil- and il- stimulation. in order to quantify (and predict) how changes in expression levels of different proteins modify the kinetics of pstat, we made use of the two mathematical models of hypil- and il- stimulation and the parameters inferred with bayesian methods. our mathematical models could accurately reproduce the experimental results generated across our study, i.e., signaling by the il- ra chimeric and il- ra-y f mutant receptors and dose/response studies (supp. fig. a-c), making use of the posterior parameter distributions generated from the bayesian parameter calibration. having developed mathematical models which are able to accurately explain the experimental data (supp. fig. b and c) and reproduce independent experiments (fig. b and c), we then sought to use the models to predict pstat signaling kinetics under different concentration regimes of receptors and stats. to simplify the simulations, we focused our analysis in gp and stat proteins, two of the proteins that greatly vary in the different immune populations (figure a). as baseline values for the concentrations [𝐺𝑃 ( )], [𝐼𝐿 𝑅𝑎( )] [𝑆𝑇𝐴𝑇 ( )] and [𝑆𝑇𝐴𝑇 ( )] we used approximately the median values from the posterior distributions for each parameter: [𝐺𝑃 ( )] = nm, [𝐼𝐿 𝑅𝑎( )] = nm and [𝑆𝑇𝐴𝑇 ( )] = [𝑆𝑇𝐴𝑇 ( )] = nm. to see the effect of varying gp concentrations on pstat signaling, we decreased the initial concentration of gp and simulated the model using the accepted parameters sets from the abc-smc to inform the other parameter values. a tenfold reduction on gp concentration ([𝐺𝑃 ( )] = . 𝑛𝑀) resulted in a striking loss in pstat levels induced by hypil- , with very little effect .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / on pstat levels induced by this cytokine (figure b). pstat / kinetics induced by il- however was not affected by this decrease in gp concentration (figure b). interestingly, the hypil- signaling profile predicted by our model at low gp concentrations strongly resemble the one induced by hypil- in th- cells (figure c), where very low levels of gp are found, further confirming the robustness of the predictions generated by our mathematical models. when the concentration of stat was increased by a factor of ten ([𝑆𝑇𝐴𝑇 ( )] = nm, both hypil- and il- induced significantly higher levels of pstat activation (figure b). pstat levels were not affected for hypil- stimulation but were decreased for il- stimulation (figure b), further indicating the competitive nature of the binding of stat and stat to il- ra and gp . overall, our mathematical model predicts that changes on gp and stat expression produce a substantial remodeling of the hypil- and il- signalosome, which ultimately could lead to aberrant responses. stat protein levels in sle patients modify hypil- and il- signaling responses stat is a classical ifn responsive gene and stat levels are highly increased in environments rich in ifns ( ). thus, we next ask whether stat levels would be increased in sle patients, an examples of disease where ifns have been shown to correlate with a poor prognosis, making use of available gene expression datasets ( ). we did not find differences in the expression of gp , il- ra or il- ra in sle patients (figure c). however, we detected a significant increase in the levels of stat and stat transcripts in these patients when compared to healthy controls, with the increase on stat expression being significantly more pronounced (figure c). since our mathematical model predicted that increases in stat expression could significantly change cytokine-induced cellular responses by hypil- and il- , we next experimentally tested this prediction. for that, we primed th- cells with ifna overnight to increase total stat levels (and to a lower extent stat ) in these cells (supp. fig. a). while both hypil- and il- induced comparable levels of pstat in primed and non-primed th- cells, levels of pstat induced by the two cytokines were significantly upregulated in primed th- cells, resulting in a bias stat response and confirming our model predictions (figure d). we next investigated whether this bias stat activation by hypil- and il- observed in ifna -primed th- cells was also present in sle patients. for that we collected pbmcs from six sle patients or five age-matched healthy controls and measured stat and stat expression, as well as pstat and pstat induction by hyil- and il- after min treatments in cd t cells. importantly, comparable results to those obtained with ifn-primed th- cells were obtained, with signaling bias towards pstat in cd + t cells from sle patients stimulated with hypil- and il- (figure e, supp. fig. b & c), further supporting the fact that stat concentrations play a critical role in defining cytokine responses in autoimmune disorders. our data show that stat and stat compete for phospho-tyr motifs in gp , with stat having an advantage resulting from its tighter affinity to gp . finally, we asked whether crippling jak activity by using sub-saturating doses of jak inhibitors could differentially affect stat and stat activation by hypil- and therefore rescue the altered cytokine responses found in sle patients. to test this, rpe and th- cells were stimulated with saturated concentrations of hypil- and titrating the concentrations of tofacitinib, a clinically approved jak inhibitor. strikingly, tofacitinib inhibited hypil- induced pstat more efficiently than pstat in both rpe cells and th- cells (figure f). at nm concentration, tofacitinib inhibited pstat levels induced by hypil- by %, while only inhibited pstat levels by % (figure f) – an effect that we did not observe for il- stimulation (supp. fig. d). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / overall, our results show that the changes in stats concentration found in autoimmune disorders shape cytokine signaling responses and could contribute to disease progression. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / discussion: cytokine pleiotropy is the ability of a cytokine to exert a wide range of biological responses in different cell types. this functional pleiotropy has made the study of cytokine biology extremely challenging given the strong cross-talk and shared usage of key components of their signaling pathways, leading to a high degree of signaling plasticity, yet still allowing functional selectivity ( , ). here we aimed to identify the underlying determinants that define cytokine functional selectivity by comparing il- and il- at multiple scales – ranging from cell surface receptors to proteomic changes. we show that il- triggers a more sustained stat phosphorylation than il- , via a high affinity stat /il- ra interaction centered around tyr on il- ra. this in turn results in a more sustained irf expression induced by il- , which leads to the upregulation of a second wave of gene expression unique to il- and comprised of classical isgs. we go one step further and show that this strong receptor/stat coupling is altered in autoimmune disorders where stats concentrations are often dysregulated. increased expression of stat in sle patients biases hypil- and il- responses towards stat activation, further contributing to the worsening of the disease. by using suboptimal doses of the jak inhibitor tofacitinib we show that specific stat proteins engaged by a given cytokine can be targeted. overall, our study highlights a new layer of cytokine signaling regulation, whereby stat affinity to specific cytokine receptor phospho-tyr motifs controls stat phosphorylation kinetics and the identity of the gene expression program engaged, ultimately ensuing the generation of functional diversity through the use of a limited set of signaling intermediaries. the tight coupling of one receptor subunit to one particular stat that we have identified in our study is a rather unusual phenomenon for heterodimeric cytokine receptor complexes, which has been first suggested by owaki et al. ( ). generally, the entire signaling output driven by a cytokine-receptor complex emanates from a dominant receptor subunit, which carries several tyr residues susceptible of being phosphorylated ( , ). this in turn results in competition between different stats for binding to shared phospho-tyr motifs in the dominant receptor chain, leading to different kinetics of stat phosphorylation as observed for il- stimulation ( ) (figure b). moreover, this localized signaling quantum allows phosphatases and feedback regulators – induced upon cytokine stimulation – to act in synergy to reset the system to its basal state, generating a very synchronous and coordinated signaling wave. although very effective, this molecular paradigm presents its limitations. stat competition for the same pool of phospho-tyr makes the system very sensitive to changes in stat concentration. ifng primed cells, which exhibit increased stat levels, trigger an ifng- like stat response upon il- stimulation ( ). il- anti-inflammatory properties are lost in cells with high levels of stat expression, as a result of a pro-inflammatory environment rich in ifns ( ). indeed, we show that stat transcripts levels are increased in crohn’s disease and sle patients and they contributed to alter il- responses. strikingly, il- appears to have evolved away from this general model of cytokine signaling activation. our results show that stat activation by il- is tightly coupled to il- ra, while stat activation by this cytokine mostly depends on gp . this decoupled stat and stat activation by il- is possible thanks to the presence of a putative high affinity stat binding site on il- ra that resembles the one present in ifngr ( ). as a result of this, il- can trigger sustained and independent phosphorylation of both stat and stat . this unique feature of il- allows it to induce robust responses in dynamic immune environments. indeed, our mathematical models of cytokine signaling and bayesian inference, together with the experimental observations show that changes in receptor concentration minimally affected .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pstat / induced by il- , while they fundamentally alter il- responses. overall, our data show that cytokine responses are versatile and adapt to the continuously changing cell proteome, highlighting the need to measure cytokine receptors and stats expression levels, in addition to cytokine levels, in disease environments to better understand and predict altered responses elicited by dysregulated cytokines. in recent years, it has become apparent that the stability of the cytokine-receptor complex influences signaling identity by cytokines ( ). short-lived complexes activate less efficiently those stat molecules that bind with low affinity phospho-tyr motif in a given cytokine receptor ( ). our current results further support this kinetic discrimination mechanism for stat activation. our statistical inference identified differences in stat recognition to the cytokine receptor phospho-tyr motifs as one of the major determinants of stat phosphorylation kinetics. this parameter alone was sufficient to explain transient and sustained stat phosphorylation induced by il- and il- , respectively, without the need to invoke the action of phosphatases or negative feedback regulators such as socss. indeed, our results indicate that the rate of stat dephosphorylation is similar between the il- and il- systems, suggesting that phosphatases do not contribute to these early kinetic differences. moreover, blocking protein translation, and therefore the upregulation of negative feedback regulators by il- treatment did not result in a more sustained stat phosphorylation by il- , again indicating that the transient kinetics of stat phosphorylation by il- is encoded at the receptor level and does not require further regulation. however, recent reports have found that the amplitude of stat phosphorylation in response to il- is regulated by levels of ptpn expression, suggesting that phosphatases can play additional roles in shaping il- responses beyond controlling the kinetics of stat activation ( ). stat phosphorylation levels by il- on the other hand were significantly more sustained in the absence of protein translation, suggesting that negative feedback mechanisms are required to downmodulate signaling emanating from high affinity stat-receptor interactions. overall our results suggest that while phosphatases and negative feedback regulators play an important role in maintaining cytokine signaling homeostasis ( ), the kinetics of stat activation appears to be already encoded at the level of receptor engagement, thus ensuring maximal efficiency and signal robustness. cytokine signaling plasticity can occur at the level of receptor activation. in the past years, a scenario has emerged suggesting that the absolute number of signaling active receptor complexes is a critical determinant for signal output integration. accordingly, specific biological responses were shown to be tuned either by abundance of cell surface receptors ( , ) or by the level of receptor assembly ( , , ). here, we show for the first time that il- - induced dimerization of il- ra and gp at the cell surface of live cells – in good agreement with previous studies on heterodimeric cytokine receptor systems ( , ). for il- , the receptor subunits il- ra and gp can be expressed at different ratios as seen for naïve vs. activated t-cells ( ) as well as intestinal cells ( ). on t-cells, particularly after activation, il- ra is expressed in strong excess over gp , rendering gp as the limiting factor for receptor complex assembly ( ). interestingly, we observe that in addition to a faster kinetic of stat phosphorylation, hypil- treatment induces a lower maximal amplitude in pstat activation in t cells. this is in stark contrast to our results in rpe cells, where high abundance of gp (~ - copies of cell surface gp ) is found. in these cells both cytokines elicited similar amplitudes of stat phosphorylation. our results suggest that surface receptor density in synergy with stats binding dynamics to phospho-tyr motif .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / on cytokine receptors act to define the amplitude and kinetics of stat activation in response to cytokine stimulation. the distinct stat and stat kinetic profiles induced by il- and il- are the prerequisite for time-correlated decoupling of genetic programs: a “shared gp /stat -dependent wave” and an il- -“unique il- ra/stat -dependent wave”. however, pstat levels induced by il- at h were down to ~ % of maximal amplitude, suggesting that additional factors would be required to amplify the initial stat response elicited by il- . we observed that il- induces the expression of an early wave of classical stat -dependent genes, which is also shared by il- . however, while il- induces the upregulation of these genes throughout the entire duration of the experiment, il- only resulted in a transient spike. we reasoned that this additional factor required for il- signal amplification would be among these early stat -dependent genes. among this set of genes we found the transcription factor irf , which had been shown to act as a feedback amplificant for pstat activity ( ). importantly, irf protein levels have been shown to be upregulated in response to il- and ifng but not to il- stimulation in hepatocytes ( ). irf plays a key role in chromatin accessibility which is critically required for il- -induced differentiation of tr cells and subsequent il- secretion ( ). here, we could prove that the contribution of irf on stat - but not stat -dependent genes is a generic feature of il- signaling. this readily explains the significant transcriptomic overlap of il- with type i ( ) or type ii interferons ( ) after long-term stimulation with these cytokines. along this line, it is not surprising that il- – beyond its well-described effects on t-cell development – can also mount a considerable antiviral response as shown in hepatic cells and pbmcs ( , ). our results suggest that by modulating the kinetics of stat phosphorylation, cytokines can modulate the expression of accessory transcription factors, such as irf , that act in synergy with stats to fine-tune gene expression and provide functional diversity. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / acknowledgments we thank members of the moraga, molina-parís, piehler and mitra laboratories for helpful advice and discussion. we thank g. hikade and h. kenneweg for technical support, c. p. richter for providing software for single-molecule image analysis, r. kurre (integrated bioimaging facility osnabrück) for support with fluorescence microscopy and the fingerprints proteomics facility (dundee) for support with the mass spectrometry data. this work was supported by the stg, ls , wellcome-trust- /z/ /z (im ep), erc- -stg grant (im jmf ep pkf), embo (sw – ), dfg (sfb , p /z, jp), national heart, lung and blood institute (k hl , mk) and contrat de plan etat région hauts de france and institut pour la recherche sur le cancer de lille (sm sg). cmp and gl were supported by h , quantii. pj is supported by the epsrc, astrazeneca and smith institute (smith institute case studentship, award reference ). numerical work was undertaken on arc , which is part of the high performance computing facilities at the university of leeds, uk. competing interests the authors declare that they have no competing interests. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / material and methods protein expression and purification: murine il- was cloned as a linker-connected single-chain variant (p +ebi ) as described in ( ). human hyperil- (hypil- ), and murine single-chain il- were cloned into the pacgp -a vector (bd biosciences) in frame with an n-terminal gp signal sequence and a c-terminal hexahistidine tag, and produced using the baculovirus expression system, as described in ( ). baculovirus stocks were prepared by transfection and amplification in spodoptera frugiperda (sf ) cells grown in sf ii media (invitrogen) and protein expression was carried out in suspension trichoplusiani ni (high five) cells grown in insectxpress media (lonza). purification was performed using the method described in ( ). for il- , the cells were pelleted with centrifugation at rpm, prior to a precipitation step through addition of tris ph . , cacl and nicl to final concentrations of mm, mm and mm respectively. the precipitate formed was then removed through centrifugation at rpm. nickel-nta agarose beads (qiagen) were added and the target proteins purified through batch binding followed by column washing in hbs-hi buffer (hbs buffer supplemented to mm nacl and % glycerol, ph . ). elution was performed using hbs-hi buffer plus mm imidazole. final purification was performed by size exclusion chromatography on an enrich sec column (biorad), again equilibrated in hbs-hi. concentration of the purified sample was carried out using kda millipore amicon-ultra spin concentrators. for hypil- , proteins were purified likewise, but in mm hepes (ph . ) containing mm nacl. recombinant cytokines were purified to greater than % homogeneity. for cell surface labeling, the anti-gfp nanobody (nb) “enhancer” and “minimizer” were used, which bind megfp with subnanomolar binding affinity ( ). nb was cloned into pet- a with an additional cysteine at the c-terminus for site-specific fluorophore conjugation in a : fluorophore:nanobody stoichiometry. furthermore, (pas) sequence to increase protein stability and a his-tag for purification were fused at the c-terminus. protein expression in e. coli rosetta (de ) and purification by immobilized metal ion affinity chromatography was carried out by standard protocols. purified protein was dialyzed against hepes ph . and reacted with a two-fold molar excess of dy maleimide (dyomics), atto maleimide (at ) and atto rho maleimide (rho ) (atto-tec gmbh), respectively. after h, a -fold molar excess (with respect to the maleimide) of cysteine was added to quench excess dye. protein aggregates and free dye were subsequently removed by size exclusion chromatography (sec). a labeling degree of . - : fluorophore:protein was achieved as determined by uv/vis spectrophotometry. cd + t cell purification and th- differentiation: human buffy coats were obtained from the scottish blood transfusion service and peripheral blood mononuclear cells (pbmcs) of healthy donors were isolated from buffy coat samples by density gradient centrifugation according to manufacturer’s protocols (lymphoprep, stemcell technologies). from each donor, x pbmcs were used for isolation of cd + t-cells. cells were decorated with anti-cd fitc antibodies (biolegend, # ) and isolated by magnetic separation according to manufacturer’s protocols (macs miltenyi) to a purity > % cd +. freshly isolated resting cd + t cells ( x per donor) were activated under th- polarizing conditions using immunocult™ human cd /cd t cell activator (stemcell, cat# ) following manufacturer instructions for days in rpmi- , % v/v .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fbs, u/ml penicillin-streptomycin (gibco) in the presence of the cytokines il- (novartis, # , ng/ml), anti-il- antibody ( ng/ml, bd biosciences, # ), il- ( ng/ml, biolegend, # ). after three days of priming, cells were expanded for another days in the presence of il- ( ng/ml). human sle patient samples: this study was authorized by the french competent authority dealing with research on human biological samples namely the french ministry of research. the authorization number is ech / . to issue such authorization, the ministry of research has sought the advice of an independent ethics committee, namely the “comité de protection des personnes,” which voted positively, and all patients gave their written informed consent. the healthy volunteer was recruited to serve as healthy control individuals. healthy and patients’ blood samples were collected in heparinized tubes (bd vacutainer , bd biosciences san jose, ca, usa) and pbmc samples were isolated using ficoll (pancoll, pan biotech #p - ) density gradient centrifugation. the isolated pbmcs were washed with pbs and the remaining red blood cells were lysed using rbc lysis buffer (ack lysing buffer, gibco #a - ), incubate min at room temperature. cells were washed in pbs and resuspend the cells with ml of freezing medium (with dmso, pan biotech, #p - ) and transfer the cells in a cryotube. cryotube in a freezing container (nalgene) and at - °c and then transferred into liquid nitrogen container for long term storage. classification and demographic information about sle patients and healthy controls: sle patients were included if they fulfilled the american college of rheumatology (acr) classification criteria (hochberg mc. updating the american college of rheumatology revised criteria for the classification of systemic lupus erythematosus ( ). exclusion criteria were current intake of mg or more of prednisone or equivalent and/or use of immunosupressants within the previous months before inclusion. use of hydroxychloroquine was not an exclusion criterion. patients were mostly in clinical remission, half with biological remission, half with persistent anti native dna autoantibodies. all sle patients and healthy controls were females between and years old. (phospho-) proteomics: for (phospho-) proteomic experiments, th- cells from each donor were split into three different conditions after initial expansion: light silac media ( mg/ml l-lysine k (sigma, #l ) and mg/ml l-arginine r (sigma, #a )), medium silac media ( mg/ml l- lysine u- c k (ckgas, #clm- - . ) and mg/ml l-arginine u- c r (ckgas, #clm- - . )) and heavy silac media ( . mg/ml l-lysine u- c ,u- n k (ckgas, #cnlm- -h- . ) and . mg/ml l-arginine u- c ,u- n r (ckgas, #cnlm- -h- . )) prepared in rpmi silac media (thermo scientific, # ) supplemented with % dialyzed fbs (hyclone, #sh . ), ml l-glutamine (invitrogen, # ), ml pen/strep (invitrogen, # ), ml mem vitamin solution (thermo scientific, # ), ml selenium-transferrin-insulin (thermo scientific, # ) and expanded in the presence of ng/ml il- and ng/ml anti-il for another days in order to achieve complete labelling. media was exchanged every two days. incorporation of medium and heavy version of lysine and arginine was checked by mass spectrometry and samples with an incorporation greater than % were used. after expansion, cells were starved without il- for hours before stimulation with nm il- or nm hyil- for minutes (phosphoproteomics) or h (global proteomic changes). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / cells were then washed three times in ice-cold pbs, mix in a : : ratio, resuspended in sds- containing lysis buffer ( % sds in mm triethylammonium bicarbonate buffer (teab)) and incubated on ice for min to ensure cell lysis. then, cell lysates were centrifuged at g for minutes at + °c and supernatant was transferred to a clean tube. protein concentration was determined by using bca protein assay kit (thermo, # ), and mg of protein per experiment were reduced with mm dithiothreitol (dtt, sigma, #d ) for h at °c and alkylated with mm iodoacetamide (iaa, sigma, #i ) for min at rt. protein was then precipitated using six volumes of chilled (- °c) acetone overnight. after precipitation, protein pellet was resuspended in ml of mm teab and digested with trypsin ( : w/w, thermo, # ) and digested overnight at .c. then, samples were cleared by centrifugation at g for min at + °c, and peptide concentration was quantified with quantitative colorimetric peptide assay (thermo, # ). phosphopeptide enrichment in the peptide fractions generated as described above was carried out using magresyn ti-imac following manufacturer instructions ( bscientific, mrtim ). high ph reverse phase fractionation for phosphoproteomics: samples were dissolved in μl of mm ammonium formate buffer ph . and peptides are fractionated using high ph rp chromatography. a c column from waters (xbridge peptide beh, Å, . µm . x mm, ireland) with a guard column (xbridge, c , . µm, . x mm, waters) are used on a ultimate hplc (thermo-scientific). buffers a and b used for fractionation consist, respectively of mm ammonium formate in milliq water (buffer a) and mm ammonium formate in % acetonitrile (buffer b), both buffers were adjusted to ph . with ammonia. fractions are collected using a wps- fc autosampler (thermo-scientific) at min intervals. column and guard column were equilibrated with % buffer b for min at a constant flow rate of . ml/min and a constant temperature f oc. samples ( µl) are loaded onto the column at . ml/min, and separation gradient started from % buffer b, to % b in min, then from % b to % b within min and finaly from % b to % b in min. the column is washed for min at % buffer b and equilibrated at % buffer b for min as mentioned above. the fraction collection started min after injection and stopped after min (total of fractions, µl each). each peptide fraction was acidified immediately after elution from the column by adding to µl % formic acid to each tube in the autosampler. the total number of fractions concatenated was set to . the content of fractions from each set was dried prior to further analysis. lc-ms/ms analysis: lc-ms analysis was done at the fingerprints proteomics facility (university of dundee). analysis of peptide readout was performed on a q exactive™ plus, mass spectrometer (thermo scientific) coupled with a dionex ultimate rs (thermo scientific). lc buffers used are the following: buffer a ( . % formic acid in milli-q water (v/v)) and buffer b ( % acetonitrile and . % formic acid in milli-q water (v/v). dried fractions were resuspended in µl, % formic acid and aliquots of μl of each fraction were loaded at μl/min onto a trap column ( μm × cm, pepmap nanoviper c column, μm, Å, thermo scientific) equilibrated in . % tfa. the trap column was washed for min at the same flow rate with . % tfa and then switched in-line with a thermo scientific, resolving c column ( μm × cm, pepmap rslc c column, μm, Å). the peptides were eluted from the column .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / at a constant flow rate of nl/min with a linear gradient from % buffer b to % buffer b in min then from % buffer b to % buffer b in min, and finally from % buffer b to % buffer b in min. the column was then washed with % buffer b for min and re- equilibrated in % buffer b for min. the column was kept at a constant temperature of oc. q-exactive plus was operated in data dependent positive ionization mode. the source voltage was set to . kv and the capillary temperature was oc. a scan cycle comprised ms scan (m/z range from - , ion injection time of ms, resolution and automatic gain control (agc) x ) acquired in profile mode, followed by sequential dependent ms scans (resolution ) of the most intense ions fulfilling predefined selection criteria (agc x , maximum ion injection time ms, isolation window of . m/z, fixed first mass of m/z, spectrum data type: centroid, intensity threshold x , exclusion of unassigned, singly and > charged precursors, peptide match preferred, exclude isotopes on, dynamic exclusion time s). the hcd collision energy was set to % of the normalized collision energy. mass accuracy is checked before the start of samples analysis. mass spectrometry data analysis: q exactive plus mass spectrometer .raw files were analyzed, and peptides and proteins quantified using maxquant ( ), using the built-in search engine andromeda ( ). all settings were set as default, except for the minimal peptide length of , and andromeda search engine was configured for the uniprot homo sapiens protein database (release date: _ ). peptide and protein ratios only quantified in at least two out of the three replicates were considered, and the p-values were determined by student’s t test and corrected for multiple testing using the benjamini–hochberg procedure (benjamini and hochberg, ). plasmid constructs: for single molecule fluorescence microscopy, monomeric non-fluorescent (y f) variant of egfp was n-terminally fused to gp . this tag (mxfpm) was engineered to specifically bind anti-gfp nanobody “minimizer” (agfp-minb). this construct was inserted into a modified version of psems- m (covalys) using a signal peptide of igk. the orf was linked to a neomycin resistance cassette via an ires site. a mxfpe-il- ra construct was designed likewise but is recognized by agfp nanobody “enhancer” (mxfpe). the chimeric construct mxfp-il- ra (ecd & tmd)-gp (icd) was a fusion construct of il- ra (aa - ) and gp (aa - ). cell lines and media: hela cells were grown in dmem containing % v/v fbs, penicillin-streptomycin, and l- glutamine ( mm). rpe cells were grown in dmem/f containing % v/v fbs, penicillin- streptomycin, and l-glutamine ( mm). rpe cells were stably transfected by mxfpe-il- ra, mutants and the chimeric construct by pei method according to standard protocols. using g selection ( . mg/ml) individual clones were selected, proliferated and characterized. for comparing receptor cell surface expression levels of stable clones expressing variants of il- ra, cells were detached using pbs+ mm edta, spun down ( g, min) and incubated with “enhancer” agfp-ennbdy ( nm, min on ice). after incubation, cells were washed with pbs and run on cytometer. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / flow cytometry staining and antibodies: for measuring dose-response curves of stat / phosphorylation (either th- cells or rpe clones), -well plated were prepared with µl of cell suspensions at x cells/ml/well for th- and x cells/ml/well for rpe . the latter were detached using accutase (sigma). cells were stimulated with a set of different concentrations to obtain dose-response curves. to this end cells were stimulated for min at °c with the respective cytokines followed by pfa fixation ( %) for min at rt. for kinetic experiments, cell suspensions were stimulated with a defined, saturating concentration of cytokines ( nm il- , nm hypil- , nm wt-il- ) in a reverse order so that all cell suspensions were pfa-fixed ( %) simultaneously. for pstat / kinetic experiments at jak inhibition, tofacitinib ( μm, stratech, #s -sel) was added after min of stimulation and cells were pfa-fixed in correct order. after fixation ( min at rt), cells were spun down at g for min at °c. cell pellets were resuspended and permeabilized in ice-cold methanol and kept for min on ice. after permeabilization cells were fluorescently barcoded according to ( ). in brief: using two nhs- dyes (pacificblue, # , dylight , # , thermo scientific), individual wells were stained with a combination of different concentrations of these dyes. after barcoding, cells are pooled and stained with anti-pstat alexa (cell signaling technologies, # ) and anti- pstat alexa (biolegend, # ) at a : dilution in pbs+ . %bsa for h at rt. t-cells were also stained with anti-cd alexaflour ( : , biolegend, # ), anti-cd pe ( : , biolegend, # ), anti-cd brilliantviolet ( : , biolegend, # ). cells were analzyed at the flow cytometer (beckman coulter, cytoflex s) and individual cell populations were identified by their barcoding pattern. mean fluorescence intensity (mfi) of pstat and pstat was measured for all individual cell populations. for measuring total stat levels, methanol-permeabilized cells were stained with anti- stat alexa ( : , biolegend, # ) or anti-stat apc ( : , biolegend, # ). total irf levels methanol-permeabilized cells were stained with anti-irf alexa ( : , biolegend, # ). for measuring cell surface levels of gp , cells were detached with accutase (sigma) and stained with anti-gp apc ( : , biolegend, # ) for h on ice. rna transcriptome sequencing: human th- cells from three donors each (stemcell technologies) were cultivated and stimulated as described in above. cells were washed in hank’s balanced salt solution (hbss, gibco) and snap frozen for storage. rna was isolated using the rneasy kit (quiagen) according to manufacturer’s protocol. all rna / ratios were above . . of each sample, μg of rna was used. transcriptomic analysis was done by novogene as follows. sequencing libraries were generated using nebnext® ultratm rnalibrary prep kit for illumina® (neb, usa) following manufacturer’s recommendations and index codes were added to attribute sequences to each sample. briefly, mrna was purified from total rna using poly-t oligo-attached magnetic beads. fragmentation was carried out using divalent cations under elevated temperature in nebnext first strandsynthesis reaction buffer ( x). first strand cdna was synthesized using random hexamer primer and m-mulv reverse transcriptase (rnase h-). second strand cdna synthesis was subsequently performed using dna polymerase i and rnase h. remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. after adenylation of ’ ends of dna fragments, nebnext adaptor with hairpin loop structure were ligated to prepare for hybridization. in order to select .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / cdna fragments of preferentially ~ bp in length, the library fragments were purified with ampure xp system (beckman coulter, beverly, usa). then μl user enzyme (neb, usa) was used with size-selected, adaptor-ligated cdna at °c for min followed by min at °c before pcr. then pcr was performed with phusion high-fidelity dna polymerase, universal pcr primers and index (x) primer. at last, pcr products were purified (ampure xp system) and library quality was assessed on the agilent bioanalyzer system. rna sequencing data analysis: primary data analysis for quality control, mapping to reference genome and quantification was conducted by novogene as outlined below. quality control: raw data (raw reads) of fastq format were firstly processed through in- house scripts. in this step, clean data (clean reads) were obtained by removing reads containing adapter and poly-n sequences and reads with low quality from raw data. at the same time, q , q and gc content of the clean data were calculated. all the downstream analyses were based on the clean data with high quality. mapping to reference genome: reference genome and gene model annotation files were downloaded from genome website browser (ncbi/ucsc/ensembl) directly. paired-end clean reads were mapped to the reference genome using hisat software. hisat uses a large set of small gfm indexes that collectively cover the whole genome. these small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. quantification: htseq was used to count the read numbers mapped of each gene, including known and novel genes. and then rpkm of each gene was calculated based on the length of the gene and reads count mapped to this gene. rpkm, (reads per kilobase of exon model per million mapped reads), considers the effect of sequencing depth and gene length for the reads count at the same time and is currently the most commonly used method for estimating gene expression levels. for each identified gene, the fold change was calculated by the ratio of cytokine stimulated/unstimulated expression levels within each donor and an unpaired, two-tailed t test was applied to calculate p values. genes were considered to be significantly altered if: p value £ . , and log fold change ³+ or £- . genes with an rpkm of less than in two or more donors were excluded from analysis so as to remove genes with abundance near detection limit. genes without annotated function were also removed. functional annotation of genes (kegg pathways, go terms) was done using david bioinformatics resource functional annotation tool ( , ). clustered heatmap was generated using r studio pheatmap package. sirna-mediated knockdown of irf in rpe cells: a set of four irf -sirnas were purchased from dharmacon and tested individually to determine levels of knockdown achieved. the sirna providing the highest level of irf . knockdown (horizon, lq- - - , sirna # : ugaacucccugccagauau) were subsequently used in all the experiments. rpe -il ra cells were plated in -well dishes ( . x cells per well) and transfected the next day with irf -sirna or control-gapdh sirna (horizon, d- - - ) (dharmacon) using dharmafect transfection reagent (dharmacon) following the manufacturer’s instructions for h. at different timepoints of il- ( nm) or hypil- ( nm) stimulation, samples were collected from each one -well. cells were .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / trypsinized and each sample was spun down and pellets snap-frozen in liquid nitrogen for subsequent rna isolation ( %) or pfa-fixed for total irf staining ( %) by flow cytometry. real-time quantitative pcr: cells were subject to rna isolation using the qiagen rneasy kit. rna ( ng) was reverse transcribed to complementary dna (cdna) using an iscript cdna synthesis kit (biorad, # ), which was used as template for quantitative pcr. powertrack™ sybr green master mix (takara, #a ) was used for the reaction with the following primers: b-actin was used as housekeeping gene for normalization. each sirna knockdown experiment was performed in three replicates with each sample for qpcr being done in two technical replicates. mathematical models and bayesian inference: we developed two new mathematical models, making use of ordinary differential equations (odes), for the initial steps of cytokine-receptor binding, dimer formation and signal activation by hypil- and il- , respectively; namely, a set of odes for the hypil- system and a separate set of odes for the il- system (see end of this section for the set of odes included in each model). these odes describe the rate of change of the concentration for each molecular species considered in the receptor-ligand systems (hypil- and il- ) over time. by solving these odes, a time-course for the concentration of total (free and bound) phosphorylated stat and stat can be obtained and compared to the experimental data (supp. fig. b & c). the hypil- and il- mathematical models differ due to the reactions involved in the formation of the signaling dimer for each cytokine. under stimulation with hypil- , two hypil- bound gp monomers are required to form the homodimer (supp. fig. a), whereas under il- stimulation, we assume that il- binds to the il- ra chain and not to gp (supp. fig. b) and hence the heterodimer is comprised of an il- molecule bound to an il- ra monomer and one gp chain. in the mathematical models, we assume that upon formation of the dimers (homo- or heterodimer), these receptor chains become immediately phosphorylated. the models do not consider jak molecules explicitly. we are assuming that these molecules are constitutively bound to their corresponding receptor chains and that they phosphorylate immediately upon receptor phosphorylation (dimer formation). after the formation of the dimer, which we denote by 𝐷) or 𝐷"*, formed by hypil- or il- respectively, the biochemical reactions included in each mathematical model are similar, and are summarized as follows. table provides a description of the rates for each reaction considered in each (and both) mathematical model(s). in what follows we assume mass action kinetics for all the reactions. a free cytoplasmic unphosphorylated stat or stat molecule can bind to either receptor chain in the dimer, provided that the intracellular tyrosine residue of the receptor in the dimer is free (supp. fig. c & d). the stat or stat target for rev size b-actin catgtacgttgctatccaggc ctccttaatgtcacgcacgat bp stat ctagtggagtggaagcggag caccacaaacgagctctgaa bp gbp tcctcggattattgctcggc cctttgcgcttcagcctttt bp oas gaaggcagctcacgaaacc aggcctcagcctcttgtg bp socs gtccccccagaagagcctatta ttgacggtcttccgacagagat .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / molecule can subsequently dissociate from the receptor chain in the dimer or can become phosphorylated (with rate 𝑞) whilst bound to the dimer. we have assumed that the rate of stat or stat phosphorylation when bound does not depend on the stat type ( or ) or on the receptor chain (supp. fig. c & d). phosphorylated stat (pstat ) and stat (pstat ) molecules can dissociate from the dimer. once free in the cytoplasm, they can then dephosphorylate (supp. fig. g). we have assumed that this rate of stat dephosphorylation only depends on the concentration of the respective pstat type, free in the cytoplasm. we note that no allostery has been considered in the models and hence, phosphorylated and unphosphorylated stat molecules dissociate from the receptor with the same rate (supp. fig. c & d). finally, any molecular species containing receptor molecules can be removed from the system, due to internalisation or degradation, via one of two hypothesised mechanisms (supp. fig. e & f): • hypothesis (h ): receptors (free or bound, phosphorylated or unphosphorylated) are internalised/degraded with a rate proportional to the concentration of the species in which they are contained, or • hypothesis (h ): receptors (free or bound, phosphorylated or unphosphorylated) are internalised/degraded with a rate proportional to the product of the concentration of the species in which they are contained and the sum of the concentrations of free cytoplasmic phosphorylated stat and stat . we note that hypothesis assumes that receptor molecules (free or bound, phosphorylated or unphosphorylated) are being internalised/degraded as part of the natural cellular trafficking cycle. hypothesis is consistent with a potential feedback mechanism, whereby the free cytoplasmic pstat molecules would migrate to the nucleus and increase the production of negative feedback proteins, such as socs , which down-regulate cytokine signaling. thus, the internalisation/degradation rate of receptor molecules (free or bound, phosphorylated or unphosphorylated) under hypothesis increases with the total amount of free cytoplasmic phosphorylated stat and stat , to account for this surface receptor down-regulation. a depiction of the reactions in both the hypil- and il- mathematical models and under each hypothesis is given in supp. fig. where a), c), e) and g) describe the hypil- model and b), d), f) and g) describe the il- model. in this figure, 𝑖 ∈ { , } so that the reactions shown can either involve stat or stat . above or below the reaction arrows is a symbol which represents the rate at which the reaction occurs (under the assumption of mass action kinetics). the notation for the rate constants and initial concentrations in the models, along with their descriptions and units, are given in table . parameter description unit 𝑟#,) & ,𝑟#,"* & rate of receptor-ligand binding nm- s- 𝑟#,) , ,𝑟#,"* , rate of receptor-ligand dissociation s- 𝑟",) & ,𝑟","* & rate of monomers binding to form a dimer nm- s- 𝑟",) , ,𝑟","* , rate of dissociation of the dimer s- 𝑘$% & rate of stat𝑖 binding to gp nm- s- 𝑘$' & rate of stat𝑖 binding to il- ra nm- s- 𝑘$% , rate of stat𝑖 dissociating gp s- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑘$' , rate of stat𝑖 dissociating il- ra s- 𝑞 rate of stat phosphorylation on the dimer s- 𝑑$ rate of free pstat𝑖 dephosphorylation s - 𝛽),𝛽"* rate of receptor internalisation/degradation under hypothesis s- 𝛾),𝛾"* rate of receptor internalisation/degradation under hypothesis nm- s- [𝑅#( )] initial concentration of gp nm [𝑅"( )] initial concentration of il- rα nm [𝑆$( )] initial concentration of stat𝑖 nm table : notation, definitions and units for the parameter values used in the mathematical models, where 𝑖 ∈ { , } so that stat𝑖 corresponds to stat or stat . the hypil- mathematical model was formulated based on reactions involving the following species: • 𝐿) = hypil- , • 𝑅# = gp , • 𝐶# = gp - hypil- monomer, • 𝐷) = phosphorylated gp - hypil- - hypil- - gp homodimer, • 𝑆# = unbound cytoplasmic unphosphorylated stat , • 𝑆( = unbound cytoplasmic unphosphorylated stat , • 𝐷) ⋅ 𝑆# = dimer bound to stat , • 𝐷) ⋅ 𝑆( = dimer bound to stat , • 𝐷) ⋅ 𝑝𝑆# = dimer bound to pstat , • 𝐷) ⋅ 𝑝𝑆( = dimer bound to pstat , • 𝑆# ⋅ 𝐷) ⋅ 𝑆# = dimer bound to two molecules of stat , • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆# = dimer bound to two molecules of stat , one of which is phosphorylated, • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆# = dimer bound to two molecules of pstat , • 𝑆( ⋅ 𝐷) ⋅ 𝑆( = dimer bound to two molecules of stat , • 𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆( = dimer bound to two molecules of stat , one of which is phosphorylated, • 𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆( = dimer bound to two molecules of pstat , • 𝑆# ⋅ 𝐷) ⋅ 𝑆( = dimer bound to one molecule of stat and one of stat , • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆( = dimer bound to one molecule of pstat and one of stat , • 𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆( = dimer bound to one molecule of stat and one of pstat , • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆( = dimer bound to one molecule of pstat and one of pstat , • 𝑝𝑆# = unbound cytoplasmic phosphorylated stat , • 𝑝𝑆( = unbound cytoplasmic phosphorylated stat . the initial reactions in the hypil- signaling pathway can then be described by the odes ( ) – ( ), under the law of mass action, where the terms involving the parameter 𝛽) apply only to the model under hypothesis and the terms involving the parameter 𝛾) apply only to the model under hypothesis . square brackets around a species is a notation that denotes the concentration of this species with unit nm, and “⋅” implies a reaction bond between two molecules/species. the odes are valid for any time 𝑡, with 𝑡 ≥ , but time has been omitted in the species concentration for ease of notation. we note here that, for example [𝑅#] = [𝑅#](𝑡) for all 𝑡 ≥ . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝑅 ] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿)] + 𝑟 , − [𝐶 ] − 𝛽 [𝑅 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑅 ] ( ) 𝑑[𝐿)] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿)] + 𝑟 , − [𝐶 ] ( ) 𝑑[𝐶 ] 𝑑𝑡 = 𝑟 , + [𝑅 ][𝐿)] − 𝑟 , − [𝐶 ] − 𝑟 , + [𝐶 ] + 𝑟 , − [𝐷 ] − 𝛽 [𝐶 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐶 ] ( ) 𝑑[𝐷 ] 𝑑𝑡 = 𝑟 , + [𝐶 ] − 𝑟 , − [𝐷 ] − 𝑘 𝑎 + [𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − 𝑘 𝑎 + [𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − 𝛽 [𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ] ( ) 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]( [𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]( [𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆# ⋅ 𝐷 ⋅ 𝑆(] − 𝑞[𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆) ⋅ 𝐷* ⋅ 𝑆)] − 𝑞[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] − 𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / −𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] −𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆)] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] −𝛽*[𝑝𝑆+ ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷* ⋅ 𝑝𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 + [𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] −𝑘+,- [𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝑞[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝑘),- [𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] −𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] −𝑘),- [𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝑞[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝑘+,- [𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] −𝛽*[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+](𝑘),- + 𝑘+,- ) − 𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] −𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] ( ) 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) similarly, and with some species in common with the hypil- model, the il- model has been formulated based on reactions involving the following species: • 𝐿"* = il- , • 𝑅# = gp , • 𝑅" = il- ra, • 𝐶" = il- ra - il- monomer, • 𝐷"* = phosphorylated il- ra - il- - gp heterodimer, • 𝑆# = unbound cytoplasmic unphosphorylated stat , • 𝑆( = unbound cytoplasmic unphosphorylated stat , • 𝑆# ⋅ 𝐷"* = dimer bound to stat via 𝑅#, .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / • 𝑆( ⋅ 𝐷"* = dimer bound to stat via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* = dimer bound to pstat via 𝑅#, • 𝑝𝑆( ⋅ 𝐷"* = dimer bound to pstat via 𝑅#, • 𝐷"* ⋅ 𝑆# = dimer bound to stat via 𝑅", • 𝐷"* ⋅ 𝑆( = dimer bound to stat via 𝑅", • 𝐷"* ⋅ 𝑝𝑆# = dimer bound to pstat via 𝑅", • 𝐷"* ⋅ 𝑝𝑆( = dimer bound to pstat via 𝑅", • 𝑆# ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to two molecules of stat , • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅#, • 𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅", • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound to two molecules of pstat , • 𝑆( ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to two molecules of stat , • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅#, • 𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound to two molecules of stat , one of them phosphorylated on 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound to two molecules of pstat , • 𝑆# ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to stat via 𝑅# and stat via 𝑅", • 𝑆( ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to stat via 𝑅" and stat via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆( = dimer bound to pstat via 𝑅# and stat via 𝑅", • 𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound to pstat via 𝑅" and stat via 𝑅#, • 𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound to stat via 𝑅# and pstat via 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆# = dimer bound to stat via 𝑅" and pstat via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆( = dimer bound pstat via 𝑅# and pstat via 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆# = dimer bound pstat via 𝑅# and pstat via 𝑅#, • 𝑝𝑆# = unbound cytoplasmic phosphorylated stat , • 𝑝𝑆( = unbound cytoplasmic phosphorylated stat . again, under the law of mass action, the initial reactions in the il- signaling pathway can be described by the odes ( ) – ( ). 𝑑[𝑅 ] 𝑑𝑡 = −𝑟 , + [𝐶 ][𝑅 ] + 𝑟 , − [𝐷 ] − 𝛽 [𝑅 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑅 ] ( ) 𝑑[𝑅 ] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿 ] + 𝑟 , − [𝐶 ] − 𝛽 [𝑅 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑅 ] ( ) 𝑑[𝐿 ] 𝑑𝑡 = −𝑟 , + [𝑅 ][𝐿 ] + 𝑟 , − [𝐶 ] ( ) 𝑑[𝐶 ] 𝑑𝑡 = 𝑟 , + [𝑅 ][𝐿 ] − 𝑟 , − [𝐶 ] − 𝑟 , + [𝐶 ][𝑅 ] + 𝑟 , − [𝐷 ] − 𝛽 [𝐶 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐶 ] ( ) 𝑑[𝐷 ] 𝑑𝑡 = 𝑟 , + [𝐶 ][𝑅 ] − 𝑟 , − [𝐷 ] − m𝑘 𝑎 + + 𝑘 𝑏 + n[𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − m𝑘 𝑎 + + 𝑘 𝑏 + n[𝐷 ][𝑆 ] + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) − 𝛽 [𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]([𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑘 𝑏 + [𝑆 ]([𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝑆 ]([𝐷 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ] + [𝐷 ⋅ 𝑆 ] + [𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑎 − ([𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑘 𝑏 + [𝑆 ]([𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ] + [𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) + 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ] − 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑆 ][𝐷 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑆 ] − 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ] − 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑆 ][𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑆 ⋅ 𝐷 ] ( ) 𝑑[𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑆 ][𝐷 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑆 ] − 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝛽 [𝐷 ⋅ 𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ] 𝑑𝑡 = −𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑝𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑝𝑆 ⋅ 𝐷 ] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ] 𝑑𝑡 = −𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + 𝑞[𝑆 ⋅ 𝐷 ] − 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝑝𝑆 ⋅ 𝐷 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝑝𝑆 ⋅ 𝐷 ] ( ) 𝑑[𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = −𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 + [𝐷 ⋅ 𝑝𝑆 ][𝑆 ] + 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑞[𝐷 ⋅ 𝑆 ] − 𝑘 𝑏 − [𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + 𝑘 𝑎 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] − 𝛽 [𝐷 ⋅ 𝑝𝑆 ] − 𝛾 ([𝑝𝑆 ] + [𝑝𝑆 ])[𝐷 ⋅ 𝑝𝑆 ] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘) [𝑆) ⋅ 𝐷 ][𝑆)] − 𝑘) - [𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑘),- [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝑘) - [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)](𝑘),- + 𝑘) - ) − 𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘+ [𝑆+ ⋅ 𝐷 ][𝑆+] − 𝑘+ - [𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑘+,- [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝑘+ - [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+](𝑘+,- + 𝑘+ - ) − 𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘+ [𝑆) ⋅ 𝐷 ][𝑆+] − 𝑘+ - [𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑘) [𝑆+ ⋅ 𝐷 ][𝑆)] − 𝑘) - [𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑘),- [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] −𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] 𝑑𝑡 = 𝑘 𝑏 + [𝑝𝑆 ⋅ 𝐷 ][𝑆 ] − 𝑘 𝑏 − [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑘+,- [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] −𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑆)] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] +𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑆+] − 𝑞[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝑘+ - [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛽 [𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] ( ) 𝑑[𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 + [𝑆 ][𝐷 ⋅ 𝑝𝑆 ] − 𝑘 𝑎 − [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] +𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑆)] − 𝑞[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝑘) - [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛽 [𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] − 𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+](𝑘),- + 𝑘+ - ) − 𝛽 [𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷 ⋅ 𝑝𝑆+] ( ) 𝑑[𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] 𝑑𝑡 = 𝑞([𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ]) −[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)](𝑘+,- + 𝑘) - ) − 𝛽 [𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] −𝛾 ([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷 ⋅ 𝑝𝑆)] ( ) .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝑝𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) 𝑑[𝑝𝑆 ] 𝑑𝑡 = 𝑘 𝑎 − ([𝑝𝑆 ⋅ 𝐷 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) + 𝑘 𝑏 − ([𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ] + [𝑝𝑆 ⋅ 𝐷 ⋅ 𝑝𝑆 ]) − 𝑑 [𝑝𝑆 ] ( ) similarly to the hypil- model, the terms in equations ( ) - ( ) involving the parameter 𝛽"* apply only to the model under hypothesis and the terms involving the parameter 𝛾"* apply only to the model under hypothesis . we now describe how we have made use of the experimental data (fig. b and c supp.) to parameterise the mathematical models described above. since the experimental outputs are levels of pstat and pstat as a function of time under hypil- and il- stimulation (fig. b and c supp.), we consider two model outputs of interest for the hypil- and il- mathematical models, which are proportional to the experimental data in supp. figure b and c; namely, the sum of all molecular species (variables) containing phosphorylated stat (free or bound) ([𝑝𝑆#]-,., for 𝑗 ∈ { , }) and the sum of all species (variables) containing phosphorylated stat (free or bound) ([𝑝𝑆(]-,., for 𝑗 ∈ { , }). the concentrations of the two model outputs of interest at any time 𝑡 are given by [𝑝𝑆#]-,)(𝑡) = [𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆#](𝑡), ( ) [𝑝𝑆(]-,)(𝑡) = [𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆#](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆(](𝑡), ( ) for the hypil- model, and by [𝑝𝑆#]-,"*(𝑡) = [𝑝𝑆# ⋅ 𝐷"*](𝑡) + [𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆#](𝑡) + [𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆(](𝑡) + [𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆#](𝑡), ( ) [𝑝𝑆(]-,"*(𝑡) = [𝑝𝑆( ⋅ 𝐷"*](𝑡) + [𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆(](𝑡) + [𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆#](𝑡) + [𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆(](𝑡), ( ) for the il- model. having developed two mathematical models for the stimulation of the experimental system with hypil- and il- , it was then our objective to parameterise these models making use of approximate bayesian computation sequential monte carlo (abc-smc). firstly, a bayesian model selection was carried out to determine which hypothesis (mechanism) of internalisation/degradation of receptor molecules is most likely given the data. once a hypothesis was selected, together with the experimental data, the abc-smc method allows one to obtain posterior distributions for each of the parameter values and initial concentrations in the mathematical models. in this way, we can learn about which reactions and parameters in the models are causing the differential signaling by pstat observed when stimulating with hypil- and il- . the experimental data we used to compare with the mathematical model .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / outputs, was the mean relative fluorescence intensity of total phosphorylated stat and total phosphorylated stat in both rpe and th- cells (supp. figure b and c). we normalised the data to obtain dimensionless values, which can be compared with the mathematical model outputs. firstly, we constructed a linear model for the fluorescence intensity (background fluorescence) of antibodies for phosphorylated stat and stat in unstimulated cells. we subtracted the value of this linear model at each time point from the corresponding fluorescence intensity in hypil- and il- stimulated cells, for each repeat of the experiment and each cell type. denoting by 𝑓 the experimental fluorescence intensity, 𝑓(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) corresponds to the fluorescence intensity for the 𝑟th repeat, 𝑟 ∈ 𝑅 = { , , , } with antibody for stat𝑖, 𝑖 ∈ 𝐼 = { , } at time point 𝑡𝑝 ∈ 𝑇𝑃 = { 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛, 𝑚𝑖𝑛} under stimulation by cytokine il-𝑗 (hypil-𝑗 when 𝑗 = ), with 𝑗 ∈ 𝐽 = { , } and in cell type 𝑑 ∈ 𝐷 = {rpe ,th- }. each data point 𝑑𝑎𝑡𝑎(𝑟, 𝑖, 𝑡𝑝,𝑗,𝑑), to be used in the bayesian inference and bayesian model selection was then computed as 𝑑𝑎𝑡𝑎(𝑟, 𝑖, 𝑡𝑝,𝑗,𝑑) = 𝑓(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) 𝑓(𝑟, 𝑖, 𝑡𝑝 = 𝑚𝑖𝑛,𝑗 = ,𝑑) . to compare the model output, 𝑠𝑖𝑚, with the data, the output was normalised in the same way as the data, i.e., 𝑠𝑖𝑚(𝑖,𝑡𝑝,𝑗,𝑑) = [𝑝𝑆$]-,.(𝑡𝑝,𝑑) [𝑝𝑆$]-,"*( 𝑚𝑖𝑛,𝑑) , where [𝑝𝑆$]-,.(𝑡𝑝,𝑑) denotes the total concentration of phosphorylated stat𝑖 at time 𝑡𝑝 (see equations - ) when considering cell type 𝑑. in this way, experimental data and the mathematical model outputs are comparable. the similarity between the model output and the data points is then computed by the introduction of a distance measure 𝛿(𝑠𝑖𝑚,𝑑𝑎𝑡𝑎). here, this distance measure was chosen as a generalisation of the euclidean distance, where 𝛿/(𝑠𝑖𝑚,𝑑𝑎𝑡𝑎)" = z z zm𝑠𝑖𝑚(𝑖,𝑡𝑝,𝑗,𝑑) − 𝜇/% %(𝑖,𝑡𝑝,𝑗,𝑑)n " .∈ ∈- $∈ , for 𝑑 ∈ 𝐷 = {rpe ,th- }, where 𝜇/% %(𝑖,𝑡𝑝,𝑗,𝑑) is the mean of the four repeats of the data and is given by 𝜇/% %(𝑖,𝑡𝑝,𝑗,𝑑) = z𝑑𝑎𝑡𝑎(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) # . to carry out the bayesian model selection and bayesian parameter inference, prior beliefs about the parameters were firstly defined. each of the parameters (reaction rates) and initial concentrations in the model were sampled from a prior distribution, where the distribution was informed by experimental data or values from the literature, when possible. the choice of prior distributions is given in table . parameter prior distribution reference 𝑟#,) & for 𝑟 ∼ 𝑁(− , . ) * 𝑟#,) , for 𝑟 ∼ 𝑁(− . , . ) * 𝑟#,"* & for 𝑟 ∼ 𝑁(− . , . ) * .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝑟#,"* , for 𝑟 ∼ 𝑁(− . , . ) * 𝑟",$ & for 𝑗 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ( ) 𝑟",$ , for 𝑗 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ( ) 𝑘$% & ,𝑘$' & for 𝑖 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ** 𝑘$% , ,𝑘$' , for 𝑖 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) ** 𝑞 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− , ) assumed 𝑑$ for 𝑖 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− ,− ) *** β. for 𝑗 ∈ { , } for 𝑟 ∼ 𝑈𝑛𝑖𝑓(− ,− ) † [𝑅#( )] 𝑁( . , . ) ‡ [𝑅"( )] 𝑁( . , . ) ‡ [𝑆#( )] 𝑁( , ) ( ) [𝑆(( )] 𝑁( , ) ( ) table : prior distributions assigned to each parameter and initial concentration in the model. * these distributions are centred around measurements obtained from cell surface receptor quantification experiments. ** these distributions were derived based on 𝐾/ values obtained from the literature ( ). *** these distributions are based on values derived from experimental data in which the cells were treated with tofacitinib. † these distributions were based on values derived from experimental data in which the cells were treated with cycloheximide. ‡ these distributions were based on computations involving approximate cell sizes and average numbers of molecules per cell. we made use of the prior distributions from table to then carry out a bayesian model selection to determine which hypothesis is most likely given the rpe data for both hypil- and il- signaling. we ran ) simulations for each mathematical model (hypil- and il- ) and for each hypothesis, sampling model parameters from their prior distributions. we then computed a summary statistic for varying values of 𝛿 :#,∗, the distance threshold between the mathematical model and data at which parameters are accepted (or rejected) in the abc. finally, we computed 𝑓(𝐻<), the number of accepted parameter sets for hypothesis 𝑘, where the parameter sets are accepted if they result in a distance value less than or equal to 𝛿 :#,∗, the distance threshold. this allowed us to compute the relative probability, 𝑝(𝐻=), for each hypothesis, as defined by the following equation 𝑝(𝐻=|δ :#,∗) = 𝑓(𝐻=|δ :#,∗) 𝑓(𝐻#|δ :#,∗) + 𝑓(𝐻"|δ :#,∗) , for 𝑘 ∈ { , }. the results of the model selection analysis for rpe are shown in figure d, where the relative probability of hypothesis increases as 𝛿 :#,∗ tends to , whilst the relative probability of hypothesis decreases as a function of 𝛿 :#,∗. we hence concluded that the experimental data together with the mathematical models for hypil- and il- signaling provide greater support to hypothesis (around %) when compared to hypothesis (around %). we note that as the distance threshold, 𝛿 :#,∗, is increased, both hypotheses become equally likely, as is to be expected. given the results of the model selection, the bayesian parameter inference for the mathematical models of hypil- and il- signaling was only carried out for hypothesis . we used the abc, sequential monte carlo (abc-smc), approach ( ), to obtain posterior distributions for the parameters in table , making use of the prior distributions in table . all .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / model parameters in table were estimated for the rpe data set. a subset of the parameters, which we would expect may vary with cell type, were then estimated for the th- data set. in particular, the parameters not being estimated for th- were sampled from the posterior distributions obtained via the abc-smc for rpe , and those parameters estimated separately for th- were: 𝑞, 𝑑#, 𝑑(, 𝛽), 𝛽"*, [𝑅#( )], [𝑅"( )], [𝑆#( )] and [𝑆(( )]. to further validate the two mathematical models of cytokine signaling, we aimed to reproduce additional experimental results making use of the posterior parameter predictions from the rpe data abc-smc. firstly, and in order to replicate the experimental dose response curve seen in supp. fig. a, we run both models using the accepted parameters sets from the abc-smc for different values of cytokine concentration, within the range [ , – "] log nm. the results of this analysis are seen in supp. fig. b. we also modified the mathematical models to allow them to describe the il- rα-gp chimera experiments (fig. c). in particular, a new mathematical model for the chimera experiments was developed as follows: it consisted of the odes from the il- model which are involved in the formation of the dimer, (equations ( ) – ( )) and the odes from the hypil- model post-dimer formation (equations ( ) – ( )), in which 𝐷) was replaced by 𝐷"*. the ode for the il- induced dimer in the chimera model was as follows 𝑑[𝐷"*] 𝑑𝑡 = 𝑟","* & [𝐶"][𝑅#] − 𝑟","* , [𝐷"*] − 𝑘#% & [𝐷"*][𝑆#] + 𝑘#% , ([𝑆# ⋅ 𝐷"*] + [𝑝𝑆# ⋅ 𝐷"*]) − 𝑘(% & [𝐷"*][𝑆(] + 𝑘(% , ([𝑆( ⋅ 𝐷"*] + [𝑝𝑆( ⋅ 𝐷"*]) − β"*[𝐷"*]. we simulated both the original mathematical model of il- and the chimera model using the accepted parameter sets from the abc-smc. the results can be seen in supp. fig. a. finally, we focussed on one of the mutant varieties of il- rα, y f and sought to reproduce the results of fig. b making use of the mathematical model of il- signaling. since the mutation decreases the affinity of stat to il- rα, we fixed the association and dissociation rates of stat to the il- rα chain,𝑘#' & and 𝑘#' , , at values which resulted in a high µm affinity. the specific values chosen were 𝑘#' & = ,> nm- s- and 𝑘#' , = # s- which yields an affinity of " µm. the rate 𝑘#' , was chosen as approximately the median of the posterior distribution for this parameter from the abc-smc, and the rate 𝑘#' & was then significantly decreased in order to increase the affinity value. we simulated the mathematical model of il- signaling using the accepted parameter sets from the abc-smc, but where the rates 𝑘#' & and 𝑘#' , were fixed as described above. the pointwise medians and % credible intervals of these simulations are plotted in supp. fig. c, as well as the simulations for the wt, without altering any of the parameter values from the posterior distributions. altering the binding affinity of stat to il- rα in this way in the mathematical model allows us to generate results which replicate reasonably well, the experimental observations for the y f mutant in figure b. live-cell dual-color single-molecule imaging studies: single molecule imaging experiments were carried out by total internal reflection fluorescence (tirf) microscopy with an inverted microscope (olympus ix ) equipped with a triple-line total internal reflection (tir) illumination condenser (olympus) and a back-illuminated electron multiplied (em) ccd camera (ixon du d, x pixel, andor technology) as recently described ( - ). a x magnification objective with a numerical aperture of . (uapo / . tirfm, olympus) was used for tir illumination. all experiments were carried out at room temperature in medium without phenol red supplemented with an oxygen scavenger and a redox-active photoprotectant to minimize photobleaching ( ). for heterodimerization experiments of il- ra and gp cell surface labeling of rpe gp ko, co-transfected with mxfpe-il- ra and mxfpm-gp , was achieved by adding agfp-ennbrho and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / agfp-minbdy to the medium at equal concentrations ( nm) and incubated for at least min prior to stimulation with il- ( nm) or hypil- ( nm). for homodimerization experiments with mxfpm-gp , agfp-minbdy and agfp-minbrho ( ) were used for cell surface receptor labelling as described above. the nanobodies were kept in the bulk solution during the whole experiment in order to ensure high equilibrium binding to mxfp- gp . for simultaneous dual color acquisition, agfp-nbrho was excited by a nm diode-pumped solid-state laser at . mw (~ w/cm ) and agfp-nbdy by a nm laser diode at . mw (~ w/cm ). fluorescence was detected using a spectral image splitter (dualview, optical insight) with a dcxr dichroic beam splitter (chroma) in combination with the bandpass filter / (semrock) for detection of rho and / (chroma) for detection of dy dividing each emission channel into x pixel. image stacks of frames were recorded at ms/frame. single molecule localization and single molecule tracking were carried out using the multiple- target tracing (mtt) algorithm ( ) as described previously ( ). step-length histograms were obtained from single molecule trajectories and fitted by two fraction mixture model of brownian diffusion. average diffusion constants were determined from the slope ( - steps) of the mean square displacement versus time lapse diagrams. immobile molecules were identified by the density-based spatial clustering of applications with noise (dbscan) algorithm as described recently ( ). for comparing diffusion properties and for co-tracking analysis, immobile particles were excluded from the data set. prior to co-localization analysis, imaging channels were aligned with sub-pixel precision by using a spatial transformation. to this end, a transformation matrix was calculated based on a calibration measurement with multicolour fluorescent beads (tetraspeck microspheres . mm, invitrogen) visible in both spectral channels (cp tform of type ‘affine’, the mathworks matlab a). individual molecules detected in the both spectral channels were regarded as co-localized, if a particle was detected in both channels of a single frame within a distance threshold of nm radius. for single molecule co-tracking analysis, the mtt algorithm was applied to this dataset of co-localized molecules to reconstruct co-locomotion trajectories (co- trajectories) from the identified population of co-localizations. for the co-tracking analysis, only trajectories with a minimum of steps (~ ms) were considered in order to robustly remove random receptor co-localizations ( ). for heterodimerization experiments of mxfpe-il- ra and mxfpm-gp , the relative fraction of dimerized receptors was calculated from the number of co-trajectories relative to the number of il- ra trajectories. gp was expressed in moderate excess (~ . - fold), so that maximal receptor assembly was not limited by abundance of the low-affinity subunit gp . for homodimerization experiments with gp , the relative fraction of co-tracked molecules was determined with respect to the absolute number of trajectories and corrected for gp stochastically double-labelled with the same fluorophore species as follows: 𝐴𝐵∗ = ?@ "×bc ! !"# d×c # !"# de , 𝑟𝑒𝑙.𝑐𝑜 − 𝑙𝑜𝑐𝑜𝑚𝑜𝑡𝑖𝑜𝑛 = "×?@ ∗ (?&@) where a, b, ab and ab* are the numbers of trajectories observed for rho , dy , co- trajectories and corrected co-trajectories, respectively. the two-dimensional equilibrium dissociation constants (𝐾!"!) were calculated according to the law of mass action for a monomer-dimer equilibrium: heterodimerization (il- ra+gp ): .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / 𝐾! "! = m[𝐺𝑃 ] − (𝛼 × [𝐼𝐿 𝑅𝑎])n × m[𝐼𝐿 𝑅𝑎] − (𝛼 × [𝐼𝐿 𝑅𝑎])n (𝛼 × [𝐼𝐿 𝑅𝑎]) or 𝐾! "! = [𝐺𝑃 ] × j 𝛼 − k + [𝐼𝐿 𝑅𝑎] × (𝛼 − ) with: 𝛼 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝐼𝐿 𝑏𝑜𝑢𝑛𝑑 𝐼𝐿 𝑅𝑎 𝑖𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝑤𝑖𝑡ℎ 𝐺𝑃 homodimerization (gp +gp ): 𝐾! "! = [i]% [!] = ([i]&,"[!])% [!] 𝐾! "! = k[l #(m],"×(n×[l #(m])o % "×(n×[l #(m]) with: 𝛼 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝐺𝑃 ℎ𝑜𝑚𝑜𝑑𝑖𝑚𝑒𝑟𝑠 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑡𝑜 [𝐺𝑃 ]/ where [m] and [d] are the concentrations of the monomer and the dimer, respectively, and [m] is the total receptor concentration. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / references: . j. j. o'shea, r. plenge, jak and stat signaling molecules in immunoregulation and immune-mediated disease. immunity , - ( ). . s. pflanz et al., il- , a heterodimeric cytokine composed of ebi and p protein, induces proliferation of naive cd + t cells. immunity , - ( ). . h. yoshida, c. a. hunter, the immunobiology of interleukin- . annu rev immunol , - ( ). . j. s. stumhofer et al., interleukin negatively regulates the development of interleukin -producing t helper cells during chronic inflammation of the central nervous system. nat immunol , - ( ). . c. diveu et al., il- blocks rorc expression to inhibit lineage commitment of th cells. j immunol , - ( ). . d. c. fitzgerald et al., suppression of autoimmune inflammation of the central nervous system by interleukin secreted by interleukin -stimulated t cells. nat immunol , - ( ). . j. s. stumhofer et al., interleukins and induce stat -mediated t cell production of interleukin . nat immunol , - ( ). . c. pot, l. apetoh, a. awasthi, v. k. kuchroo, induction of regulatory tr cells and inhibition of t(h) cells by il- . semin immunol , - ( ). . m. j. boulanger, d. c. chow, e. e. brevnova, k. c. garcia, hexameric structure and assembly of the interleukin- /il- alpha-receptor/gp complex. science , - ( ). . s. rose-john, interleukin- family cytokines. cold spring harb perspect biol , ( ). . c. a. hunter, s. a. jones, il- as a keystone cytokine in health and disease. nature immunology , - ( ). . t. korn et al., il- controls th immunity in vivo by inhibiting the conversion of conventional t cells into foxp + regulatory t cells. proc natl acad sci u s a , - ( ). . a. kimura, t. kishimoto, il- : regulator of treg/th balance. eur j immunol , - ( ). . g. w. jones et al., loss of cd + t cell il- r expression during inflammation underlines a role for il- trans signaling in the local maintenance of th cells. j immunol , - ( ). . c. rolvering et al., crosstalk between different family members: il recapitulates ifn gamma responses in hcc cells, but is inhibited by il -type cytokines. bba-mol cell res , - ( ). . a. p. costa-pereira et al., mutational switch of an il- response to an interferon- gamma-like response. p natl acad sci usa , - ( ). . j. schmitz, m. weissenbach, s. haan, p. c. heinrich, f. schaper, socs exerts its inhibitory function on interleukin- signal transduction through the shp recruitment site of gp . journal of biological chemistry , - ( ). . h. yasukawa et al., il- induces an anti-inflammatory response in the absence of socs in macrophages. nat immunol , - ( ). . b. a. croker et al., socs negatively regulates il- signaling in vivo. nat immunol , - ( ). . c. brender et al., suppressor of cytokine signaling regulates cd t-cell proliferation by inhibition of interleukins and . blood , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . a. camporeale, v. poli, il- , il- and stat : a holy trinity in auto-immunity? front biosci (landmark ed) , - ( ). . g. regis, s. pensa, d. boselli, f. novelli, v. poli, ups and downs: the stat :stat seesaw of interferon and gp receptor signalling. semin cell dev biol , - ( ). . s. lucas, n. ghilardi, j. li, f. j. de sauvage, il- regulates il- responsiveness of naive cd (+) t cells through stat -dependent and -independent mechanisms. p natl acad sci usa , - ( ). . s. kamiya et al., an indispensable role for stat in il- -induced t-bet expression but not proliferation of naive cd (+) t cells. journal of immunology , - ( ). . a. takeda et al., cutting edge: role of il- /wsx- signaling for induction of t-bet through activation of stat during initial th commitment. journal of immunology , - ( ). . c. neufert et al., il- controls the development of inducible regulatory t cells and th cells via differential effects on stat . eur j immunol , - ( ). . t. owaki et al., stat is indispensable to il- -mediated cell proliferation but not to il- -induced th differentiation and suppression of proinflammatory cytokine production. journal of immunology , - ( ). . k. hirahara et al., asymmetric action of stat transcription factors drives transcriptional outputs and cytokine specificity. immunity , - ( ). . s. oniki et al., interleukin- and interleukin- exert quite different antitumor and vaccine effects on poorly immunogenic melanoma. cancer res , - ( ). . m. fischer et al., i. a bioactive designer cytokine for human hematopoietic progenitor cell expansion. nat biotechnol , - ( ). . h. h. oberg, d. wesch, s. grussel, s. rose-john, d. kabelitz, differential expression of cd and cd mediates different stat- phosphorylation in cd +cd - and cd high regulatory t cells. int immunol , - ( ). . p. o. krutzik, m. r. clutter, a. trejo, g. p. nolan, fluorescent cell barcoding for multiplex flow cytometry. curr protoc cytom chapter , unit ( ). . u. a. betz, w. muller, regulated expression of gp and il- receptor alpha chain in t cell maturation and activation. int immunol , - ( ). . j. martinez-fabregas et al., kinetics of cytokine receptor trafficking determine signaling and functional selectivity. elife , ( ). . c. gorby et al., engineered il- variants elicit potent immunomodulatory effects at low ligand doses. sci signal , ( ). . v. ruprecht, weghuber, j., wieser, s., schütz, g. j, in advances in planar lipid bilayers and liposomes. ( ), vol. ,, pp. - . . i. moraga et al., instructive roles for cytokine-receptor binding parameters in determining signaling and functional potency. science signaling , ( ). . s. wilmes et al., receptor dimerization dynamics as a regulatory valve for plasticity of type i interferon signaling. j cell biol , - ( ). . s. wilmes et al., mechanism of homodimeric cytokine receptor activation and dysregulation by oncogenic mutations. science , - ( ). . i. moraga et al., tuning cytokine receptor signaling by re-orienting dimer geometry with surrogate ligands. cell , - ( ). . s. pflanz et al., wsx- and glycoprotein constitute a signal-transducing receptor for il- . j immunol , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . m. wiederkehr-adam et al., characterization of phosphopeptide motifs specific for the src homology domains of signal transducer and activator of transcription (stat ) and stat . j biol chem , - ( ). . a. pradhan, q. t. lambert, l. n. griner, g. w. reuther, activation of jak -v f by components of heterodimeric cytokine receptors. j biol chem , - ( ). . h. kim, t. s. hawley, r. g. hawley, h. baumann, protein tyrosine phosphatase (shp- ) moderates signaling by gp but is not required for the induction of acute- phase plasma protein genes in hepatic cells. mol cell biol , - ( ). . d. w. huang, b. t. sherman, r. a. lempicki, systematic and integrative analysis of large gene lists using david bioinformatics resources. nat protoc , - ( ). . j. bancerek et al., cdk kinase phosphorylates transcription factor stat to selectively regulate the interferon response. immunity , - ( ). . s. rutz et al., deubiquitinase duba is a post-translational brake on interleukin- production in t cells. nature , - ( ). . k. l. o'hagan, s. d. miller, h. phee, pak is essential for the function of foxp +regulatory t cells through maintaining a suppressive treg phenotype. sci rep- uk , ( ). . d. z. ye, j. field, pak signaling in cancer. cell logist , - ( ). . y. liao, j. wang, e. j. jaehnig, z. shi, b. zhang, webgestalt : gene set analysis toolkit with revamped uis and apis. nucleic acids res , w -w ( ). . j. satoh, h. tabunoki, a comprehensive profile of chip-seq-based stat target genes suggests the complexity of stat -mediated gene regulatory mechanisms. gene regul syst bio , - ( ). . i. rusinova et al., interferome v . : an updated database of annotated interferon- regulated genes. nucleic acids res , d - ( ). . h. n. suh et al., role of interleukin- in the control of dna synthesis of hepatocytes: involvement of pkc, p / mapks, and ppardelta. cell physiol biochem , - ( ). . a. v. villarino et al., il- limits il- production during th differentiation. j immunol , - ( ). . k. hirahara et al., interleukin- priming of t cells controls il- production in trans via induction of the ligand pd-l . immunity , - ( ). . x. hu et al., sensitization of ifn-gamma jak-stat signaling during macrophage activation. nat immunol , - ( ). . v. francois-newton, m. livingstone, b. payelle-brogard, g. uze, s. pellegrini, usp establishes the transcriptional and anti-proliferative interferon alpha/beta differential. biochem j , - ( ). . k. zenke, m. muroi, k. i. tanamoto, irf supports dna binding of stat by promoting its phosphorylation. immunol cell biol , - ( ). . k. karwacz et al., critical role of irf and batf in forming chromatin landscape during type regulatory cell differentiation. nat immunol , - ( ). . a. yoshimura, y. wakabayashi, t. mori, cellular and molecular basis for the regulation of inflammation by tgf-beta. j biochem , - ( ). . a. awasthi et al., a dominant function for interleukin in generating interleukin - producing anti-inflammatory t cells. nat immunol , - ( ). . j. b. brown et al., p-selectin glycoprotein ligand- is needed for sequential recruitment of t-helper (th ) and local generation of th t cells in dextran sodium sulfate (dss) colitis. inflamm bowel dis , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . m. matsumoto et al., cd collaborates with p-selectin glycoprotein ligand- to mediate e-selectin-dependent t cell migration into inflamed skin. j immunol , - ( ). . d. n. slenter et al., wikipathways: a multifaceted pathway database bridging metabolomics to other omics research. nucleic acids res , d -d ( ). . a. petretto et al., proteomic analysis uncovers common effects of ifn-gamma and il- on the hla class i antigen presentation machinery in human cancer cells. oncotarget , - ( ). . l. h. wong, i. hatzinisiriou, r. j. devenish, s. j. ralph, ifn-gamma priming up- regulates ifn-stimulated gene factor (isgf ) components, augmenting responsiveness of ifn-resistant melanoma cells to type i ifns. j immunol , - ( ). . m. tokuyama et al., ervmap analysis reveals genome-wide transcription of human endogenous retroviruses. proc natl acad sci u s a , - ( ). . c. garbers et al., plasticity and cross-talk of interleukin -type cytokines. cytokine growth factor rev , - ( ). . s. kang, m. narazaki, h. metwally, t. kishimoto, historical overview of the interleukin- family cytokine. j exp med , ( ). . r. umeshita-suyama et al., characterization of il- and il- signals dependent on the human il- receptor alpha chain : redundancy of requirement of tyrosine residue for stat activation. int immunol , - ( ). . o. w. nadeau et al., the proximal tyrosines of the cytoplasmic domain of the beta chain of the type i interferon receptor are essential for signal transducer and activator of transcription (stat) activation. evidence that two stat sites are required to reach a threshold of interferon alpha-induced stat tyrosine phosphorylation that allows normal formation of interferon-stimulated gene factor . j biol chem , - ( ). . m. n. sharif et al., ifn-alpha priming results in a gain of proinflammatory function by il- : implications for systemic lupus erythematosus pathogenesis. j immunol , - ( ). . d. richter et al., ligand-induced type ii interleukin- receptor dimers are sustained by rapid re-association within plasma membrane microcompartments. nat commun , ( ). . j. p. twohig et al., activation of naive cd (+) t cells re-tunes stat signaling to deliver unique cytokine responses in memory cd (+) t cells. nat immunol , - ( ). . p. c. heinrich et al., principles of interleukin (il)- -type cytokine signalling and its regulation. biochem j , - ( ). . d. levin, d. harari, g. schreiber, stochastic receptor expression determines cell fate upon interferon treatment. mol cell biol , - ( ). . i. moraga, d. harari, g. schreiber, g. uze, s. pellegrini, receptor density is key to the alpha /beta interferon differential activities. mol cell biol , - ( ). . c. c. m. ho et al., decoupling the functional pleiotropy of stem cell factor by tuning c-kit signaling. cell , - e ( ). . p. charlot-rabiega, e. bardel, c. dietrich, r. kastelein, o. devergne, signaling events involved in interleukin (il- )-induced proliferation of human naive cd + t cells and b cells. j biol chem , - ( ). . j. diegelmann, t. olszak, b. goke, r. s. blumberg, s. brand, a novel role for interleukin- (il- ) as mediator of intestinal epithelial barrier protection mediated via differential signal transducer and activator of transcription (stat) protein .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / signaling and induction of antibacterial and anti-inflammatory proteins. journal of biological chemistry , - ( ). . h. bender et al., interleukin- displays interferon-gamma-like functions in human hepatoma cells and hepatocytes. hepatology , - ( ). . t. imamichi, j. yang, w. huang da, b. sherman, r. a. lempicki, interleukin- induces interferon-inducible genes: analysis of gene expression profiles using affymetrix microarray and david. methods mol biol , - ( ). . j. m. fakruddin et al., noninfectious papilloma virus-like particles inhibit hiv- replication: implications for immune control of hiv- infection by il- . blood , - ( ). . a. c. frank et al., interleukin- , an anti-hiv- cytokine, inhibits replication of hepatitis c virus. j interferon cytokine res , - ( ). . s. l. laporte et al., molecular and structural basis of cytokine receptor pleiotropy in the interleukin- / system. cell , - ( ). . j. b. spangler, i. moraga, k. m. jude, c. s. savvides, k. c. garcia, a strategy for the selection of monovalent antibodies that span protein dimer interfaces. j biol chem , - ( ). . a. kirchhofer et al., modulation of protein properties in living cells using nanobodies. nat struct mol biol , - ( ). . m. c. hochberg, updating the american college of rheumatology revised criteria for the classification of systemic lupus erythematosus. arthritis rheum , ( ). . j. cox, m. mann, maxquant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. nat biotechnol , - ( ). . j. cox et al., andromeda: a peptide search engine integrated into the maxquant environment. j proteome res , - ( ). . p. o. krutzik, g. p. nolan, fluorescent cell barcoding in flow cytometry allows high- throughput drug screening and signaling profiling. nat methods , - ( ). . w. huang da, b. t. sherman, r. a. lempicki, bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. nucleic acids res , - ( ). . w. huang da, b. t. sherman, r. a. lempicki, systematic and integrative analysis of large gene lists using david bioinformatics resources. nat protoc , - ( ). . n. kozer et al., exploring higher-order egfr oligomerisation and phosphorylation--a combined experimental and theoretical approach. mol biosyst , - ( ). . d. n. itzhak, s. tyanova, j. cox, g. h. borner, global, quantitative and dynamic mapping of protein subcellular localization. elife , ( ). . t. toni, d. welch, n. strelkowa, a. ipsen, m. p. stumpf, approximate bayesian computation scheme for parameter inference and model selection in dynamical systems. j r soc interface , - ( ). . j. vogelsang et al., a reducing and oxidizing system minimizes photobleaching and blinking of fluorescent dyes. angew chem int ed engl , - ( ). . a. kirchhofer et al., modulation of protein properties in living cells using nanobodies. nat struct mol biol , -u ( ). . a. serge, n. bertaux, h. rigneault, d. marguet, dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. nat methods , - ( ). . c. you et al., receptor dimer stabilization by hierarchical plasma membrane microcompartments regulates cytokine signaling. sci adv , e ( ). . f. roder, a. lubk, d. wolf, t. niermann, noise estimation for off-axis electron holography. ultramicroscopy , - ( ). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure legends: figure cytokine receptor activation by il- and (hyp)il- : a) cartoon model of stepwise assembly of the il- and hypil- -induced receptor complex and subsequent activation of stat and stat . b) dose-dependent phosphorylation of stat and stat as a response to il- and hypil- stimulation in th- cells, normalized to maximal il- stimulation. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. c) phosphorylation kinetics of stat and stat followed after stimulation with saturating concentrations of il- ( nm) and hypil- ( nm) or unstimulated th- cells, normalized to maximal il- stimulation. data was obtained from five biological replicates with each two technical replicates, showing mean ± std dev. d) top: phosphorylation kinetics of stat and stat followed after stimulation with hypil- ( nm) or left unstimulated, comparing wt rpe and rpe gp ko reconstituted with high levels of mxfpm-gp (= x [gp ]). data was normalized to maximal stimulation levels of wt rpe . left: cell surface gp levels comparing rpe gp ko, wt rpe and rpe gp ko stably expressing mxfpm-gp measured by flow cytometry. data was obtained from one biological replicate with each two technical replicates, showing mean ± std dev. bottom right: cell surface levels of gp measured by flow cytometry for indicated cell lines. e) cartoon model of cell surface labeling of mxfp-tagged receptors by dye-conjugated anti-gfp nanobodies (nb) and identification of receptor dimers by single molecule dual-colour co-localization. f) raw data of dual-colour single-molecule tirf imaging of mxfpe-il- rαnb-rho and gp nb-dy after stimulation with il- . particles from the insets (il- ra: red & gp : blue) were followed by single molecule tracking ( frames ~ . s) and trajectories > steps ( ms) are displayed. receptor heterodimerization was detected by co-localization/co-tracking analysis. g) relative number of co-trajectories observed for heterodimerization of il- rα and gp as well as homodimerization of gp for unstimulated cells or after indicated cytokine stimulation. each data point represents the analysis from one cell with a minimum of cells measured for each condition. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. h) stoichiometry of the il- –induced receptor complex revealed by bleaching analysis. left: intensity traces of mxfpe-il rαnb-rho and gp nb-dy were followed until fluorophore bleaching. middle: merged imaging raw data for selected timepoints. right: overlay of the trajectories for il- rα (red) and gp (blue). figure : mathematical modelling results in rpe and th- cells. a) simplified cartoon model of il- /hypil- signal propagation layers and coverage of the mathematical modelling approach. b) model selection results showing the relative probabilities of each hypothesis, for different values of the distance threshold, 𝛿∗, in rpe cells. c) pointwise median and % credible intervals of the predictions from the mathematical model, calibrated with the experimental data, using the posterior distributions for the parameters from the abc-smc. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / d) kernel density estimates of the posterior distributions for the parameters 𝑝 ∈ {𝑟#,. & ,𝑟#,. , ,𝑟",. & ,𝑟",. , ,𝑘$% & ,𝑘$% , ,𝑘$' & ,𝑘$' , ,𝑞,𝑑$,𝛽., [𝑅#( )],[𝑅"( )],[𝑆#( )],[𝑆(( )]} in the mathematical models where 𝑗 ∈ { , } and 𝑖 ∈ { , }. figure : il- rα cytoplasmic domain is required for sustained pstat kinetics. a) representation of the cytoplasmic domain of il- rα with its highlighted tyrosine residues y and y . b) stat and stat phosphorylation kinetics of rpe clones stably expressing wt and mutant il- rα after stimulation with il- ( nm, top panels) or after stimulation with hypil- ( nm, bottom panels), normalized to maximal levels of wt il- rα stimulated with il- (top) or hypil- (bottom). data was obtained from three experiments with each two technical replicates, showing mean ± std dev. bottom right: cell surface levels variants measured by flow cytometry for indicated il- rα cell lines. c) cytoplasmic domain of il- rα is required for sustained pstat activation. left: cartoon representation of receptor complexes. right: stat and stat phosphorylation kinetics of rpe clones stably expressing wt il- rα and il- rα- gp chimera after stimulation with il- ( nm, top panels) or after stimulation with hypil- ( nm, bottom panels). data was normalized to maximal levels for each cytokine and cell line. data was obtained from two experiments with each technical replicates, showing mean ± std dev. d) phosphatases do not account for differential pstat / activity induced by il- and hypil- . left: schematic representation of workflow using jak inhibitor tofacitinib. right: mfi ratio of tofacitinib-treated and non-treated rpe mxfpe-il- rα cells for pstat and pstat after stimulation with il- ( nm) and hypil- ( nm). data was obtained from two experiments with each two technical replicates, showing mean ± std dev. figure : unique and overlapping effects of il- and hypil- on the phosphoproteome of th- cells. a) volcano plot of the phospho-sites regulated (p value £ . , fold change ³+ . or £- . ) by il- (left) and hypil- (right). data was obtained from three biological replicates. b) heatmap representation (examples) of shared and differentially up- (left) and downregulated (right) phospho-sites after il- and hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. c) tyrosine and serine phosphorylation of selected stat proteins after stimulation with il- (red) and hypil- (blue). *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. d) ps -stat and ps -stat phosphorylation kinetics in th- cells after stimulation with il- or hypil- , normalized to maximal il- stimulation. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. e) go analysis “biological processes” of the phospho-sites regulated by il- (red) and hypil- (blue) represented as bubble-plots. f) phosphorylation of target proteins associated with stat /cdk transcription initiation complex after stimulation with il- (blue) and hypil- (red) and schematic representation of transcription regulation of rna polymerase ii with identified phospho-sites (red flags). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure : kinetic decoupling of gene induction programs depends on sustained stat activation by il- . a) principal component analysis for genes found to be significantly upregulated (left) or downregulated (right) for at least one of the tested conditions (time & cytokine). data was obtained from three biological replicates. b) kinetics of gene induction shared between il- and hypil- (relative to il- ) for upregulated genes (red) or downregulated genes (green). c) kinetics of gene numbers induced after il- and hypil- stimulation for upregulated genes (left) and downregulated genes (right). d) gsea reactome analysis of selected pathways with significantly altered gene induction in response to il- or hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. e) cluster analysis comparing the gene induction kinetics after il- or hypil- stimulation. gene induction heatmaps for example genes as well as induction kinetics (mean) are shown for highlighted gene clusters. data represents the mean (log ) fold change of three biological replicates. figure : il- -induced upregulation of irf amplifies induction of stat -dependent genes a) kinetics of irf protein expression as a response to continuous il- and hypil- stimulation in th- cells. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. dotted line indicates baseline level. b) kinetics of irf protein expression and sirna-mediated irf knockdown in rpe il- rα cells stimulated with il- ( nm). data was obtained from one representative experiment with each two technical replicates, normalized to maximal irf induction ( h), showing mean ± std dev. c) kinetics of stat (left) and stat (right) phosphorylation after sirna-mediated irf knockdown in rpe il- rα cells stimulated with il- ( nm). data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. d) kinetics of gene induction (stat , gbp , oas , socs ) followed by rt qpcr in rpe il- rα cells stimulated with il- ( nm) with and without sirna-mediated knockdown of irf . data was obtained from three experiments with each two technical replicates, showing mean ± sem. figure : il- -induced stat response drives global proteomic changes in th- cells. a) workflow for quantitative silac proteomic analysis of th- cells continuously stimulated ( h) with il- ( nm), hypil- ( nm) or left untreated. b) global proteomic changes in th- cells induced by il- (left) or hypil- (right) represented as volcano plots. proteins significantly up- or downregulated are highlighted in red (p value £ . , fold change ³+ . or £- . ). significantly altered isg-encoded proteins by il- are highlighted in yellow. data was obtained from three biological replicates. c) venn diagrams comparing unique upregulated (left) and downregulated (right) proteins by il- (blue) and hypil- (red) as well as shared altered proteins. isg-encoded proteins are highlighted in yellow. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / d) heatmaps of the top up- and downregulated proteins by il- compared to hypil- . data representation of the mean (log ) fold change of three biological replicates. e) heatmap representation and enrichment plot of proteins identified by gsea reactome pathway enrichment analysis “cytokine signaling and immune system” induced by il- . data representation of the mean (log ) fold change of three biological replicates. f) correlation of il- and hypil- -induced rna-seq transcript levels (³+ or £- fc) with quantitative proteomic data (³+ . or £- . fc). data representation of the mean (log ) fold change of three biological replicates. figure : receptor and stat concentrations determine the nature of the cytokine response. a) copy numbers of indicated proteins determined for different t-cell subsets using mass- spectrometry based proteomics (immpres - http://immpres.co.uk). b) model predictions for varying levels of stat and stat (left panel) or il- rα and gp (right panel) for phosphorylation kinetics of stat and stat . c) gene expression profiles determined by rnaseq analysis comparing indicated genes of a cohort of sle risk patients with a cohort of healthy controls. data obtained from: proc natl acad sci u s a , - . *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. d) dose-dependent phosphorylation of stat and stat as a response to il- and hypil- stimulation in naive and ifnα -primed ( nm, h) th- cells, normalized to maximal il- stimulation (ctrl). data was obtained from four biological replicates with each two technical replicates, showing mean ± std dev. e) phosphorylation of stat (left) and stat (right) as a response to il- ( nm, min) and hypil- ( nm, min) stimulation in healthy control (ctrl) and sle patient cd + t-cells. data was obtained from five healthy control donors ( ) and six sle patients. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. f) tofacitinib titration to inhibit stat and stat phosphorylation by hypil- ( nm, min) in th- cells (left) and rpe cells stably expressing wt il- rα (right). supp. figure : a) comparison of dose-dependent phosphorylation (stat / ) of purchased il- and mil- sc in activated cd + cells, normalized to maximal mfi levels. data was obtained from one (purchased) or two (mil- sc) biological replicates with each two technical replicates, showing mean ± std dev. b) schematic workflow of t-cell isolation, th differentiation, fluorescence barcoding and gating strategy for high throughput flow cytometry. c) phosphorylation kinetics of stat and stat followed after stimulation with il- ( nm) and hypil- ( nm) or unstimulated th cells. data (from fig. c) was normalized to maximal mfi levels for each cytokine. data was obtained from five biological replicates with each two technical replicates, showing mean ± std dev. d) phosphorylation kinetics of activated pbmcs (cd +, cd +) of stat and stat followed after stimulation with il- ( nm) and hypil- ( nm) or unstimulated cells. data was normalized to maximal il- stimulation. data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. e) dose-response experiments in wt rpe cells for pstat (left) and pstat (right), stimulated with il- or hypil- , normalized to maximal hypil- stimulation. data was .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / obtained from one representative experiment with each two technical replicates, showing mean ± std dev. supp. figure : a) dose-response experiments for pstat and pstat comparing rpe gp ko cells (left), wt rpe (middle) and rpe mxfpe-il ra (right) after stimulation with il- or hypil- , normalized to maximal hypil- stimulation. data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. b) ligand-induced receptor dimerization: top panel: dual-colour co-tracking of il- rα and gp in the absence (top) and presence (bottom) of il- ( nm). trajectories ( frames, ~ . s) of individual mxfpe-il rαnb-rho (red) and gp nb-dy (blue) and co-trajectories (magenta) are shown for a representative cell. bottom panel: dual-colour co-tracking of gp in the absence (top) and presence (bottom) of hypil- ( nm). trajectories ( frames, ~ . s) of individual mxfpe-il rαnb-rho (red) and gp nb-dy (blue) and co-trajectories (magenta) are shown for a representative cell. c) top: cartoon model of cell surface labeling of mxfp-tagged gp by dye-conjugated anti-gfp nanobodies (nb) and formation of single-colour homodimers (left) or dual- colour homodimers (right). below: examples for intensity traces of single-colour dual- step bleaching (left) or dual-colour single-step bleaching (right). insets show raw data for selected timepoints and corresponding trajectories. d) top: comparison of diffusion coefficients (d) for mxfpe-il- rαnb-rho (red) and mxfpmgp nb-dy (blue) in presence and absence of il- stimulation ( nm), as well as co-trajectories after il- stimulation (magenta). bottom: comparison of diffusion coefficients for mxfpm-gp nb-rho (red) in presence and absence of hypil- stimulation ( nm), as well as co-trajectories after hypil- stimulation (magenta). each data point represents the analysis from one cell with a minimum of cells measured for each condition. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. supp. figure : a) reactions involving ligand binding and dimerization in the hypil- model. b) reactions involving ligand binding and dimerization in the il- model. c) reactions involving the stat molecules (𝑆. 𝑓𝑜𝑟 𝑗 ∈ { , }) in the hypil- model. d) reactions involving the stat molecules (𝑆. 𝑓𝑜𝑟 𝑗 ∈ { , }) in the il- model. e) reactions involving receptor internalisation/degradation in the hypil- model. here 𝐻 = 𝛽) and 𝐻 = 𝛾)([𝑝𝑆 ] + [𝑝𝑆 ]). f) reactions involving receptor internalisation/degradation in the il- model. here 𝐻 = 𝛽"* and 𝐻 = 𝛾"*([𝑝𝑆 ] + [𝑝𝑆 ]). g) dephosphorylation of (𝑆. 𝑓𝑜𝑟 𝑗 ∈ { , }) in the cytoplasm. this reaction occurs in both models. h) key for the molecules in the reactions. supp. figure : a) stat (left) and stat (right) phosphorylation kinetics of rpe clones stably expressing wt il- rα after stimulation with il- or after stimulation with hypil- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / normalized to maximal il- stimulation. data was obtained from three experiments with each two technical replicates, showing mean ± std dev. b) dose-response experiments for pstat (left) and pstat (right) in rpe cells stably expressing wt il- rα or tyrosine-mutants after stimulation with il- , normalized to maximal stimulation of wt il- rα. data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. supp. figure : a) dose-response experiments for pstat (left) and pstat (right) in rpe cells stably expressing wt il- rα or il- ra-gp chimera after stimulation with il- . data normalized to maximal stimulation of wt il- rα. data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. b) stat (left) and stat (right) phosphorylation kinetics in rpe il- rα cells stimulated with il- or hypil- with and without jak inhibition by tofacitinib. data was normalized to maximal il- stimulation. data was obtained from two experiments with each two technical replicates, showing mean ± std dev. c) stat (left) and stat (right) phosphorylation kinetics in th- cells stimulated with il- or hypil- with and without jak inhibition by tofacitinib. data was normalized to to maximal il- stimulation. data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. d) mfi ratio of tofacitinib-treated and non-treated th- cells for pstat (left) and pstat (right) after stimulation with il- ( nm) and hypil- ( nm). data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. supp. figure : a) stat (left) and stat (right) phosphorylation kinetics in rpe il- rα cells stimulated with il- or hypil- with and without pretreatment with cycloheximide (chx). data was normalized to to maximal il- stimulation. data was obtained from two experiments with each two technical replicates, showing mean ± std dev. b) stat (left) and stat (right) phosphorylation kinetics in th cells stimulated with il- or hypil- with and without pretreatment with cycloheximide (chx). data was normalized to to maximal il- stimulation. data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. supp. figure : a) workflow for quantitative silac phospho-proteomic analysis of th- cells stimulated ( min) with il- ( nm), hypil- ( nm) or left untreated. b) schematic representation of the main go terms regulated by il as inferred from our p-proteomics studies. red represents downregulated p-sites and blue represents upregulated p-sites upon il stimulation of human primary th- cells. c) schematic representation of the main go terms regulated by hyil as inferred from our p-proteomics studies. red represents downregulated p-sites and blue upregulated p-sites upon hyil stimulation of human primary th- cells. supp. figure : .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a) venn diagrams comparing the numbers of unique upregulated (left) and downregulated (right) phospho-sites by il- (blue) and hypil- (red) as well as the number of shared phospho-sites. b) list of most strongly altered phosphosites (downregulated: green; upregulated: red) in response to il- (left) or hypil- (right). c) go analysis “cellular location” and “up keywords” of the phospho-sites regulated by il (red) and hypil- (blue) represented as bubble-plots. d) phosphorylation of target proteins related to treg functions and schematic representation of their activity on t-cells. supp. figure : a) kinetics of gene induction in th- cells induced by il- represented as volcano plots. genes significantly up- or downregulated are highlighted in red (p value £ . , fold change ³+ or £- ). data was obtained from three biological replicates. b) kinetics of gene induction in th- cells induced by hypil- represented as volcano plots. genes significantly up- or downregulated are highlighted in red (p value £ . , fold change ³+ or £- ). data was obtained from three biological replicates. c) kinetics of gene induction in th- cells induced by hypil- represented as volcano plots. genes identified to be significantly up- or downregulated by il- are highlighted in red (p value £ . , fold change ³+ or £- ). data was obtained from three biological replicates. supp. figure : a) gene induction kinetics represented as pie-charts, separated for upregulated genes (top panel) and downregulated genes (bottom panel). b) kinetics of isg induction (examples) as heatmap representation comparing il- with hypil- (top) and gsea reactome pathway enrichment “ifn signaling” for genes induced by il- after h (bottom). data represents the mean (log ) fold change of three biological replicates. c) heatmaps of the top up- and downregulated genes by il- compared to hypil- for h, h and h. data represents the mean (log ) fold change of three biological replicates. d) kinetics of irf protein expression as a response to continuous il- and hypil- stimulation in th- cells. data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. supp. figure : a) pie charts of proteomic changes (unique & shared) for upregulated (left) and downregulated (right) proteins in response to il- or hypil- stimulation in th- cells. b) left: gsea reactome pathway enrichment analysis “interferon signaling” for proteins induced by il- . middle: heatmap representation pathway-associated proteins comparing il- with hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. right: localization of the identified proteins in context to the data distribution of il- -induced proteomic changes. pathway-associated proteins are highlighted for il- (blue) and hypil- (red) as well as non-significant data distribution (grey). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / c) left: gsea reactome pathway enrichment analysis “cytokine signaling and immune system” for proteins induced by il- . middle: heatmap representation pathway- associated proteins comparing il- with hypil- stimulation. data represents the mean (log ) fold change of three biological replicates. right: localization of the identified proteins in context to the data distribution of il- -induced proteomic changes. pathway-associated proteins are highlighted for il- (blue) and hypil- (red) as well as non-significant data distribution (grey). d) average intensity distribution of untreated proteomic data. top up- and downregulated proteins (≥ + x or ≤ - x change) altered by il- (left) or hypil- (right) stimulation are indicated. supp. figure : a) pointwise median and % credible intervals of the wt and chimera mathematical models, using the posterior distributions for the parameters from the abc-smc. b) dose response curve in rpe using the posterior distributions from the abc-smc and varying the concentrations of hypil- and il- in the model. c) pointwise median and % credible intervals of the wt mathematical model and simulations of a mutant model with 𝑘#' & = ,> nm- s- and 𝑘#' , = m s- , using the posterior distributions for the parameters from the abc-smc for the other parameters. supp. figure : a) fold induction of total stat and stat levels in th- measured by flow cytometry. data was obtained from two biological replicates. b) total levels of stat and stat measured in cd + by flow cytometry for healthy control (ctrl) and lupus patients (sle). data was obtained from five (ctrl) and six (sle) biological replicates. *p < . , **p ≤ . ,***p ≤ . ; n.s., not significant. c) ratio of pstat and pstat after il- ( min, nm) or hypil- ( min, nm) stimulation measured in cd + by flow cytometry for healthy control (ctrl) and lupus patients (sle). data was obtained from five (ctrl) and six (sle) biological replicates normalized to mean ratio of healthy control samples. d) tofacitinib titration to inhibit stat and stat phosphorylation by il- ( nm) in th- cells (left) and rpe cells stably expressing wt il- rα (right). .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supp. movie : single-molecule co-tracking as a readout for dimerization of cytokine receptors. cell surface labelling of mxfpe-il- rα by enbrho (left, top) and mxfpm-gp by mnbdy (left, bottom) after stimulation with il- ( nm). in the overlay of the zoomed section of both spectral channels (mxfpe-il- rαrho : red, mxfpm-gp dy : blue), yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time. supp. movie : dynamics of il- -induced receptor assembly. formation of a single-molecule heterodimer of mxfpe-il- rαrho (red) and mxfpm-gp dy (blue) in presence of il- . yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time with break at time of receptor dimerization. supp. movie : ligand-induced heterodimerization of il- rα and gp . overlay of the two spectral channels (mxfpe-il- rαrho : red, mxfpm-gp dy : blue) in absence (left) or presence (right) of il- ( nm). yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time. supp. movie : ligand-induced homodimerization of gp . overlay of the two spectral channels (mxfpm- gp rho : red, mxfpm-gp dy : blue) in absence (left) or presence (right) of hypil- ( nm). yellow lines indicate co-locomotion of il- rα and gp (≥ steps). acquisition frame rate: hz, playback: real time. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . . . . . . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- fig. il- rα p ebi il- jak jak gp hypil- il- il- rα(ecd) pstat / a) b) e) time / min time / min ps ta t / re l. m fi ps ta t / re l. m fi pstat pstat 𝚫 𝚫 𝚫 𝚫 𝚫 - - - - . . . . . . . il- hypil- - - - - . . . . . . . c / log nmc / log nm ps ta t / re l. m fi ps ta t / re l. m fi pstat pstat 𝚫 c) µm gp il- il- rα gp co-localization enbrho mnbdy il- rα r el . c o- lo co m ot io n in te ns ity . / a .u . il- rα gp time / s il- rα gp dimers f) s . s . s . s nmil- rα gp rho bleached 𝚫fret rho bleached dy bleached g) h) d) time / mintime / min ps ta t / re l. m fi ps ta t / re l. m fi pstat pstat . . . . . . . heterodimerization il- rα + gp +hypil- +il- homodimerization gp + gp *** *** . . . . . . . wt [gp ] unstim. x [gp ] unstim. wt [gp ] + hypil- x [gp ] + hypil- . . . . . . . co un t receptor expression gp ko wt [gp ] x [gp ] a) fig. . receptor assembly . proteome changes . gene induction il- il- rα gp pstat / stat / . stat activation mathematical modelling ps ta t / re l. m fi ps ta t / re l. m fi time / min time / min 𝜹∗ n o. a cc ep te d pa ra m et er s c) b) d) . . . . . . . unstim. wt y f y f y f-y f . . . . . . . . . . . . . . . . . . . . . . . . . . . . unstim. wt chimera . . . . . . . . . . . . . . unstim. wt chimera . . . . . . . il- rα cytoplasmic domain y y tsgrcyhlrhkvlprwvwekvpdpansssgqphmeqvpeaqplgdlpileveemepppvmess qpaqatapldsgyekhflptpeelgllgpprpqvla* fig. min min min min min min min min +t of ac iti ni b unstim. +il- +hypil- time / min ps ta t / re l. m fi ps ta t / re l. m fi time / min - % pstat - % pstat b) a) d) . . . . . . . . il- hypil- time / min r at io p s ta t + /- to f. +tofacitinib . . . . . . . . il- hypil- time / min r at io p s ta t + /- to f. +tofacitinib il- rα gp +il- il- rα-gp gp +il- gp gp +hypil- ps ta t / re l. m fi time / min hypil- pstat ps ta t / re l. m fi time / min il- pstat 𝚫 𝚫 𝚫 𝚫 il- pstat hypil- pstat ps ta t / re l. m fi ps ta t / re l. m fi time / min time / min c) time / min ps ta t / re l. m fi ps ta t / re l. m fi time / min hypil- pstat il- pstat il- pstat hypil- pstat pstat pstat co un t receptor expression ctrl wt y f y f y f- y f jak jak ne lfa s pp m g t rc hy s la rp s po lr a s po lr a s po lr a s fig. - - - - fold change / log p v al u e / - lg unchanged downregulated upregulated - - - - fold change / log p v al u e / - lg unchanged downregulated upregulated map b chd scaf wrnip bola bad stat stat ubr stat map b chd scaf wrnip bola rchy nelfa stat stat ppm g b) a) il- hypil- c)shared and differentially regulated p-sites lgalsl (s) bad (s) stat (y) stat (y) stat (y) stat a,b (y) ptpn (y) ppm g (t) sugp (s) card (s) stat (s) rnase (s, t) ahnak (s) clk (s) ahnak (t) bad (s) arl ip (s) ubr (s) piezo (s) reps (s) srrm (s) ankrd c (t) cdca l (s) nelfa (s) ndrg (s) prr (s) rchy (s) osbpl (s) znf (s) rps ka (s) > cdh (s) map b (s) znf c (s,t) adgrf (t,y) zc hc a (s) bola (s) gtf i (s) tacc (s, y) scaf (s) abcc (s) wrnip (s) sec ip (s) osbpl (s) stau (s) lrrfip (s) top b (s) zcrb (s) rfx (s) pabpn (s) arhgdia (s) fam e (t,y) nudt (s) hnrnpf (s) tpr (s) taldo (s) pcnx (s) klc (s) rbm (s) irs (s) pml (s) - - - - < - il- hy pil - fc / lo g il- hy pil - fc / lo g fo ld c ha ng e p tef b sk snrnp larp ppm g rna pol- nelfacy clin t cdk stat p rchy cyclin c cdk mediator complex f) . . . . . il- hypil- time / min . . . . . . . il- hypil- time / min ps -s ta t r el . m fi e) fo ld c ha ng e stat y stat y stat y stat y stat s stat s tyrosine-p serine-p il- hypil- * * * ** *** ** *** il- hypil- ps -s ta t r el . m fi mr na p ro ce ss ing mr na s pli cin g mr na ex po rt ja k/ st at ca sc ad e ce ll-c ell ad he sio n tr an sc rip tio n po sit ive r na po l ii re gu lat ion ne ga tiv e r na po l ii re gu lat ion nu cle ar po re co mp lex as se mb ly re gu lat ion r ho si gn ali ng hi sto ne h -k t rim eth yla tio n dn a me th yla tio n re gu lat ion r na po l ii d) fos socs cd ifng egr nfkbia klf jun osm rhob il - - - - - il- hypil- - - il- hypil- gbp gbp gbp gbp ifi il rb il irf irf jak mx oas parp stat stat trafd trim trim ube l usp cd ifit ifit ifit ifit irf rgs socs - h h h h h h il- hypil- h h h h h h interferon signature stat dependent genes stat dependent genes - - il- hypil- fo ld c ha ng e / l og fo ld c ha ng e / l og h h h h h h h h il- hypil- fc / log fc / log fc / log il- hypil- il- hypil- time / h h h h h fig. z x - - - - - y il- hypil- h h h h h h y x - - - - - - - z h h h h h h . . . . . . upregulated genes downregulated genes upregulated genes downregulated genesa) time / h fr ac tio n sh ar ed w ith il - b) e) time / h fo ld c ha ng e / l og time / h il- hypil- il- hypil- ge ne s ge ne s time / h time / h upregulated downregulatedc) d) interferon signaling immune system interferon alpha/beta signaling interferon gamma signaling cytokine signaling in immune system h h h h fc / log il- hypil- h h fo ld c ha ng e / l og fig. . . . . . . . control sirna irf sirna ir f /r el . m fi time / h irf protein levels control sirna irf sirna gapdh sirna control sirna fo ld in du ct io n time / h fo ld in du ct io n time / h stat oas control sirna irf sirna control sirna irf sirna fo ld in du ct io n time / h fo ld in du ct io n time / h gbp socs b) c) irf protein levels ir f / m fi time / h a) control sirna irf sirna untransfected ps ta t / m fi time / h pstat control sirna irf sirna untransfected ps ta t / m fi time / h pstat d) il- hypil- - - - - - - - - - - differentiate to th in silac media light (r k ) medium (r k ) high (r k ) stimulation hisolate pbmcs from buffy coat & cd + isolation mix : cell numbers fractionation lc-ms/ms maxquant peptide quantification lyse reduce alkylate digest unstim. il- hypil- il- hypil- mx stat stat ifitm gbp gbp vps tgfb isg ube l unchanged changed isgs upregulated proteins il- hypil- downregulated proteins il- hypil- in du ct io n tgfb smarcd vps rala selplg drg atp b prkar a larp abcb tceal mapk hla-c rap c fam a suz bcat arid b arf mien mettl uvrag pip k a zmym nb cox isy eif c b m hbs l dnajc tmed itga mllt acsl foxo atg b ppp r slc b rnf dnajc rbm cul b casp ppp r rock mcm dennd c ndufa tmed sde kpna jak arhgap coa snx limd selk rnf cndp erbb ip pmpca hla-e srcap sec b anapc btaf ccdc rpl myh il r tubb rtn lancl aars qtrtd scpep ccdc hist h a kti gtf c rpap nudt l otulin acot gstm hist h e p rx myadm abcb pld gtf b npepps naa cbx mt-co luc l tp bp gdi sptbn ywhag rbm hla-dqb kdm a qars pcbp ehd yif b dnase lig gbf nudt rpl btn a txnrd lmnb tbc d b exosc ndufa ncbp mcm ap mipep cbx hmha csnk b tbc d b bop mlst snapin gbp ube l gbp stat trafd parp stat parp ddx mx isg gbp nmi bst nub ifi xrn lgals bp lap trank trim nt c a plscr dnaja gbp oas ifitm pml tympalox ap ppp r acadm prkcsh zcchc srpk mecp hmgn eif e psmb e nr ic hm en t s co re r an ke d lis t m et ri c rank in ordered dataset gsea pathway reactome: cytokine signaling and immune system il- hypil- tgfb gbp rala ube l gbp stat stat mx isg gbp mapk ifitm hla-c fig. a) b) d) c) e) gbp ube l gbp stat trafd parp stat parp mx gbp ddx ifi xrn lgals bp trim gbp h h h h h h h h fc/ log tra ns cr ipt pr ot ein tra ns cr ipt pr ot ein il- hypil- f) fc/ log fc / lo g ( / ) ( / ) ( / )( / ) ( / ) ( / ) isgs dennd c dnajc tgfb smarcd ndufa vps gbp rala rbm ube l selplg gbp stat trafd prkar a parp stat parp larp abcb tceal mx isg cul b drg gbp casp mapk atp b ddx ppp r bop tp bp ccdc alox ap tbc d b csnk b scpep hmha snapin cbx luc l qtrtd mlst mt-co nudt gbf aars lig btaf dnase yif b ehd lancl cbx pcbp mipep mcm ap qars ncbp - - - - - > il - hy pi l- ncbp dennd c dnaj c fold change / log fold change / log p va lu e / - lo g p va lu e / - lo g fig. ps ta t (n or m al iz ed ) c / log μm f) co py n um be rs n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il n ai ve c d n ai ve c d t h t h t h c t l n k m as t b m d m e o si n o p h il gp il- rα il- rα stat stat - - - . . . . . . . pstat pstat - - - . . . . . . . pstat pstat ps ta t (n or m al iz ed ) c / log μm th- rpe e) b) a) unstim. ctrl unstim. sle il- ctrl il- sle hypil- ctrl hypil- sleps ta t / m fi ps ta t / m fi pstat n.s. ** ** n.s. *** ** pstat ps ta t / re l. m fi c / log nm ps ta t / re l. m fi c / log nm d) - - - - . . . . . . . . . . il- il- primed hypil- hypil- primed - - - - . . . . . . . . . . il- il- primed hypil- hypil- primed pstat pstat time / min time / min time / min time / min ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi ps ta t / r el . m fi il- rα gp il- rα r p k m r p k m n.s. n.s.n.s. stat stat **** sle dis. risk healthy control c) supp. fig. - - - - . . . . . . . il- (miltenyi) mil- sc - - - - . . . . . . . il- (miltenyi) mil- sc il- / log nm ps ta t / re l. m fi pstat il- / log nm ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat time / min ps ta t / re l. m fi pstat cd + cd + b) d) . . . . . . . unstim. il- hypil- time / min ps ta t / re l. m fi pstat . . . . . . . unstim. il- hypil- time / min ps ta t / re l. m fi pstat 𝚫 𝚫 𝚫 c) dose-response or kinetic exp. ii) stimulation & sample barcoding iii) merge cells & ab staining leukocytes cd + cd + cd + leukocytes cd + cd -/cd + barcodeall data iv) flow cytometryi) pbmc isolation and th differentiation a) ps ta t / r el . m fi c / log nm ps ta t / r el . m fi c / log nm e) - - - . . . . . . . rpe + il- rpe + hypil- - - - . . . . . . . rpe + il- rpe + hypil- pstat pstat . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- . . . . . . . unstim. il- hypil- heterodimerization il- rα gp trajectories rho trajectories dy co-trajectories homodimerization gp gp unstim. +il- unstim. +hypil- µm c) . . . . . . . . . . nm nm fl uo re sc en ce in t. / a .u . time / s fl uo re sc en ce in t. / a .u . time / s dual-color dimersingle-color dimer single-color dual-step bleaching dual-color single-step bleaching labels label 𝚫fret dy bleached label bleached label bleached rho bleached hypil- . s . s . s . s . s . s . s . s . . . . . . . . . . . . . . d / µm s - gp il- rα dimer +il- +il- +il- d / µm s - gp dimer +hypil- d) +hypil- ** n.s. *** *** *** supp. fig. b) - - - - . . . . . . . - - - - . . . . . . . 𝚫gp 𝚫il- rα +gp 𝚫il- rα +gp +il- rα - - - - . . . . . . . il- pstat il- pstat hypil- pstat hypil pstat c / log nm ps ta t / r el . m fi c / log nm ps ta t / r el . m fi c / log nm ps ta t / r el . m fi a) a) b) c) d) e) f) g) h) supp. fig. b) il- / log nm ps ta t / re l. m fi il- / log nm ps ta t / re l. m fi - - - - . . . . . . . - - - - . . . . . . . - wt y f y f y f-y f 𝚫y f 𝚫y f . . . . . . . . unstim. il- hypil- ps ta t / re l. m fi ps ta t / re l. m fi time / min time / min 𝚫 𝚫 𝚫 𝚫 a) . . . . . . . . unstim. il- hypil- pstat pstat pstat pstat supp. fig. th cells (ratio +/- tofacitinib) . . . . . . . . il- hypil- . . . . . . . . il- hypil- time / min r at io p s ta t + /- to f. +tofacitinib +tofacitinib r at io p s ta t + /- to f. time / min d) - - - - . . . . . . . . . il- rα(wt) il- rα-gp ps ta t / r el . m fi il- / log nm a) - - - - . . . . . . . . . il- rα(wt) il- rα-gp ps ta t / r el . m fi il- / log nm c) . . . . . . . il- hypil- il- + tof. hypil- + tof. . . . . . . . il- hypil- il- + tof. hypil- + tof. time / min ps ta t / re l. m fi rpe il- rα cells th cells time / min ps ta t / re l. m fi b) +tofac. +tofac. . . . . . . . il- hypil- il- + tof. hypil- + tof. . . . . . . . il- hypil- il- + tof. hypil- + tof. time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi +tofac. +tofac. supp. fig. supp. fig. . . . . . . . . il- hypil- il- + chx hypil- + chx . . . . . . . . il- hypil- il- + chx hypil- + chx . . . . . . . il- hypil- il- + chx hypil- + chx . . . . . . . il- hypil- il- + chx hypil- + chx b) time / min ps ta t / re l. m fi rpe il- rα cells th cells time / min ps ta t / re l. m fi a) time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi il- gp il- rα p-s pias p-y s stat p-y s stat p-y stat p-y stat a p-y stat b jak/stat cascade cell-cell adhesion p-t s ahnak p-s ppfibp p-s pak p-y s stat p-s lima p-s s lrrfip p-s s micall p-s add p-s s aldoa p-t eif g p-s sept p-s snx p-s tmpo actin cytoskeleton p-t s ahnak p-s lima p-s s aldoa p-s sept p-s cd ap p-s fyb p-s cfl pre-autophagosomal structures p-t nbr p-s atg a p-s s sqstm regulation of rna pol ii negative regulation of rna pol ii p-s etv p-s hist h c p-s hist h d p-s hist h b p-s t smarca p-s rfx p-s dnmt a p-s sap p-s pias p-y s stat p-y s stat p-s s sqstm p-s s s spen p-s t znf c p-s spen aaa mrna processing p-s arl ip p-s rbm b p-s phrf p-s s scaf p-s sugp p-t acin p-t adar p-s ccar p-s mettl p-s s srrm mrna splicing p-s ncbp p-s rbm b p-s srrm p-s alyref p-s spen p-s s s polr a p-s hnrnpup-s mettl p-s s srrm p-s pabpn p-s srrm p-s s s spen mrna nuclear export p-s alyref p-s nup p-s s srrm p-s ncbp p-s nup p-s nup histone h -k methylation p-s hist h d p-s kmt a p-s hist h c dna methylation p-s baz a p-s kmt a p-s dnmt a transcription p-s dennd ap-t bclaf p-s s lrrfip p-s mrgbp p-s mysm p-s nfkbib p-s paxbp p-s pou f p-s rbm b p-s t smarca p-s baz b p-s baz a p-s ccar p-s chaf b p-s chd p-s gtf c p-s gon l p-s msl p-s naca p-s pphln p-s s ptmap-s rfx p-s rps p-s s s spen p-s tfdp p-s mga p-s phf p-s phf p-s rbl p-s sap bp p-s sap p-s itgb bp p-s pias p-y s stat p-y s stat p-y stat p-y stat a p-y stat b p-s spen p-s t znf c p-s znf p-s znf p-s znf p-y stat p-y stat p-y s stat p-y stat p-y stat a p-y stat b jak/stat cascade cell-cell adhesion p-s ndrg p-s ahnak p-y stat p-t ahnak p-s anxa p-s s snx p-s micall p-s t sept p-s lrrfip p-ss clint p-s tmpo golgi apparatus hypil- gp actin filament p-s akap p-y hck p-s s s akap p-s fkbp p-s myo b p-y hck p-s lrba p-y lyn p-s pask p-s rab fip p-s raf p-s wdr p-s clint p-s pphln p-s slc a p-t arhgef p-s arfgap p-s htt p-s osbpl p-s zdhhc regulation of rna pol ii p-s rbl p-s mrgbp p-s s lrrfip p-s rbbp p-s t smarca p-s gtf i p-s rfx p-s tfdp p-s nfatc p-y s stat p-y stat a p-y stat b positive regulation of rna pol ii p-s nelfa p-s s nucks p-s raf p-s sqstm p-s trim p-s thrap p-s pml p-s safbp-s nfatc p-s ncoa p-s rps ka p-s ybx p-s pknox p-s tp bp p-s arhgef aaa mrna processing p-s tfip p-s ccar p-s casc p-s s scaf p-s sugp p-s rbm p-s rbbp p-s rbm b p-s xrn p-s srrm mrna splicing p-s tfip p-s hnrnpf p-s casc p-s s spen p-s cdc p-s rnpc p-s srsf p-s srsf p-s srrm p-s pabpn p-s hnrnpd p-s ybx mrna nuclear export p-s nup p-s pom p-s srrm p-s cdc p-s srsf p-s casc transcription p-s dennd a p-s gatad bp-t bclaf p-s pml p-s rbm b p-s rbm p-s baz b p-s ccar p-s gtf c p-s hnrnpd p-s ncor p-s pphln p-s tp bp p-s s spen p-s t znf c p-s znf p-s znf p-s lrrfip p-s mga p-s phf p-s mier p-y stat p-s znf p-s cdca l p-s itgb bp p-s ncoa p-y stat p-y s stat p-y stat p-y stat a p-y stat b p-s actl a p-s nfkbib rho signaling p-s raf p-s s s akap p-s arhgdia p-s myo b p-t arhgef p-s akap p-s rbbp p-y stat p-s gtf i p-s lrrfip p-s s nucks p-s arid a p-s nfatc p-s actl a p-y stat b p-y s stat p-y stat a p-s safb p-y s stat p-y stat p-y stat p-y stat a p-y stat b p-y stat p-s thrap p-s srsf p-s srsf p-s tpr nuclear pore assembly p-s tpr p-s ahctf p-s nup p-s arid a p-s safb differentiate to th- in silac media light (r k ) medium (r k ) high (r k ) stimulation: min isolate pbmcs from buffy coat & cd + isolation mix : cell numbers fractionation lc-ms/ms maxquant peptide quantification lyse reduce alkylate digest unstim. il- hypil- phosphopeptide enrichment (tio ) a) b) c) supp. fig. nucleus membrane cytoplasm pre-autophagosomal struct. actin cytoskeleton actin filament golgi apparatus il- hypil- nucleus methylation cytoplasm transcription mrna processing chromatin regulator mrna transport actin cytoskeleton actin filament golgi apparatus golgi apparatus il- hypil- cellular location up keywords peptide fold change / log peptide fold change / log chd s - . lgalsl s . map b s - . rnase s t . znf c s t - . ahnak s t . adgrf t y - . bad s . zc hc a s - . clk s . bola s - . stat y . gtf i s - . dcp b s . tacc s y - . stat y . scaf s - . stat y . abcc s - . stat a/b y /y . wrnip s - . ptpn y . sec ip s - . bad s . rbm b s - . arl ip s . mecp s - . ubr s . psmd s - . piezo s . ospbl s - . ppm g t . peptide fold change / log peptide fold change / log tacc s y - . lgalsl s . cdh s - . stat y . map b s - . myo b s . znf c s t - . ankrd c t . adgfr t y - . cdca l s . zc hc a s - . stat y . bola s - . nelfa s . wrnip s - . ppm g t . fam e t y - . bad s . scaf s - . ndrg s . abcc s - . stat y . nudt s - . sugp s . gtf i s - . prr s . zc h s - . stat s . sec ip s - . ptpn y . psmd s - . rchy s . b) c) d) il- hypil- ubr s bad s pak s * il- hypil- downregulated phospho-sites upregulated phospho-sites il- hypil- th treg p-ubr p-pak p-bad a) fo ld c ha ng e supp. fig. a) b) c) - - - - - - fold induction / log p v al u e / - lg unchanged regulated h h h - - - - - - fold induction / log p v al u e / - lg unchanged regulated - - - - - - fold induction / log p v al u e / - lg unchanged regulated - - - - - - fold induction / log p v al u e / - lg unchanged regulated il- h h h - - - - - - fold induction / log p v al u e / - lg unchanged regulated - - - - - - fold induction / log p v al u e / - lg unchanged regulated h h h - - - - - - fold induction / log p v al u e / - lg - - - - - - fold induction / log p v al u e / - lg - - - - - - fold induction / log p v al u e / - lg hypil- hypil- (il- regulated genes highlighted) supp. fig. il- top up & downregulated genes fosb rgs ifit fos ifit c orf socs socs cd nfkbiz ptchd p prr rgs cmpk c orf pmaip dusp ccl ifng egr sgk ifit cfl grm klf nfkbia dnajb klf jun znf bcdin d plekhf zkscan senp tnfsf alg l hist h j b galt pars ajuba kbtbd efna id dusp trgv p igip adrb znf zswim sowahd hsa-mir- a gusbp cebpe cdk r arl d nuak nog sertad zfp l ddit - ifit ctsl ifi l rgs rsad gbp p slc a slamf lamp etv chac gbp fam b gtf ird gbp lrrc gbp sema g ptchd p cetp socs slc a stat cmpk wars hapln smtnl bcl l ifit epsti gas l rassf igfbp hbegf adora cgn fgf tnfrsf d p ha ddit nek tmem nptx mt dp dusp p ha il matn pde b hspg cd ak dtx ppfia cfd dhdh egr fos pfkfb mir hg - - - - - ifi l c orf gbp p ifi spag ifit ifit rsad slamf fcrl gbp rgs gbp etv lamp usp stat cmpk nfix rufy cetp gbp ifit wars alg -as ifi lrrn frmd tnfsf b bcl l map cdc ep itgax hspg aicda hist h bo apba vldlr c orf rimkla sdk atoh kiss r hist h bl dtx emp wnt ccdc b ak oscp pfkfb stc s a spon egr fos vegfa adora mir hg ppfia - - - - - - il - hy pi l- il - hy pi l- il - hy pi l- total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared total= il- hypil- shared upregulated genes downregulated genes time h h h il- hypil- interferon stimulated genes (isgs) h h h h h h gbp gbp gbp ifit ifit ifit ifng irf irf irf mx oas parp rgs socs socs stat stat usp - a) b) c) h h h gsea pathway enrichment: ifn signalling rank in ordered dataset en ric hm en t sc or e . . lis t m et ric - upregulated genes downregulated genes fc / lo g fc / lo g fc / lo g fc / lo g supp. fig. gsea pathway reactome: interferon signalling - protein id fo ld c h an g e / l o g data distribution il- hypil- e nr ic hm en t s co re r an ke d lis t m et ri c il- hypil- gbp ube l gbp stat stat mx isg gbp ifitm hla-c bst ifi trim b m oas . . . fc/ log a) b) c) e nr ic hm en t s co re r an ke d lis t m et ri c rank in ordered dataset gsea pathway reactome: cytokine signalling and immune system il- hypil- tgfb gbp rala ube l gbp stat stat mx isg gbp mapk ifitm hla-c - protein id fo ld c h an g e / l o g data distribution il- hypil- upregulated proteins downregulated proteins total= . % il- . % hypil- . % shared total= . % il- . % hypil- . % shared fc/ log supp. fig. rank in ordered dataset a) b) c) supp. fig. time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / r el . m fi c / log nm ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / re l. m fi time / min ps ta t / r el . m fi ps ta t (n or m al iz ed ) c / log μm ps ta t (n or m al iz ed ) c / log μm - - - . . . . . . . pstat pstat - - - . . . . . . . pstat pstat th- rpe tofacitinib titration – il- signaling supp. fig. a) d) . . . . . . stat stat fo ld in du ct io n time / h ctrl sle ctrl sle s ta t / m fi s ta t / m fi total stat total stat b) p: . p: . . . . . . . . il- ctrl il- sle hypil- ctrl hypil- sle ra tio p s ta t /p s ta t p: . p: . c) biorxiv.org - the preprint server for biology skip to main content home about submit alerts / rss search for this keyword advanced search subject areas all articles animal behavior and cognition biochemistry bioengineering bioinformatics biophysics cancer biology cell biology clinical trials developmental biology ecology epidemiology evolutionary biology genetics genomics immunology microbiology molecular biology neuroscience paleontology pathology pharmacology and toxicology physiology plant biology scientific communication and education synthetic biology systems biology zoology view by month a mammalian methylation array for profiling methylation levels at conserved sequences a mammalian methylation array for profiling methylation levels at conserved sequences adriana arneson , , amin haghani , michael j. thompson , matteo pellegrini , soo bin kwon , , ha vu , , caesar z. li , ake t. lu , bret barnes , kasper d. hansen , , wanding zhou , charles e. breeze , jason ernst , , - #, steve horvath , # affiliations bioinformatics interdepartmental program, university of california, los angeles, ca , usa department of biological chemistry, university of california, los angeles, los angeles, california, usa; dept. of human genetics, david geffen school of medicine, university of california los angeles, los angeles, ca , usa; molecular, cell and developmental biology, university of california los angeles, los angeles, ca , usa; dept. of biostatistics, fielding school of public health, university of california los angeles, los angeles, ca , usa; illumina, inc, illumina way, san diego, ca , usa; department of biostatistics, johns hopkins bloomberg school of public health, baltimore, maryland, usa; department of genetic medicine, johns hopkins school of medicine, baltimore, maryland, usa; van andel research institute, grand rapids, michigan, usa; altius institute for biomedical sciences, seattle, wa, usa; eli and edythe broad center of regenerative medicine and stem cell research at university of california, los angeles, los angeles, california, usa; computer science department, university of california, los angeles, los angeles, california, usa; .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / department of computational medicine, university of california, los angeles, los angeles, california, usa. jonsson comprehensive cancer center, university of california, los angeles, los angeles, california, usa; molecular biology institute, university of california, los angeles, los angeles, california, usa. # joint senior authorship correspondence: shorvath@mednet.ucla.edu and jason.ernst@ucla.edu summary infinium methylation arrays are widely used to robustly measure methylation of dna in humans. however, such arrays are not available for the vast majority of non-human mammals. moreover, even if species-specific arrays were available, probe differences between them would confound cross-species comparisons. to address these challenges, we developed the mammalian methylation array, a single custom infinium array that measures cytosine methylation levels of over thousand cpg sites that are well conserved across species within the mammalian class. by design, the probes on the array tolerate cross-species mutations. to design the array, we developed the conserved methylation array probe selector (cmaps) algorithm, which takes as input a multi-species sequence alignment and probe design constraints. a greedy search algorithm was used to identify oligonucleotide sequences (probes) with high coverage across different mammalian species. we annotate the probes on the array with respect to genes in different species and provide details on the sequence context including cpg island status and chromatin states. our calibration experiments demonstrate the high fidelity of this array in humans, rats, and mice. the mammalian methylation array has several strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of specific cytosines facilitating the development of highly robust epigenetic biomarkers, and it covers highly conserved cpgs which greatly increases the probability that biological insights .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / gained in one species will readily translate to others. the mammalian methylation array is expected to find many applications in preclinical studies, comparative biology, and epigenetic studies of aging and development. introduction methylation of dna by the attachment of a methyl group to cytosines is one of the most widely studied epigenetic modifications in vertebrates, due to its implications in regulating gene expression across many biological processes including disease (ooi et al., ; robertson, ; smith and meissner, ). a variety of different assays have been proposed for measuring dna methylation including microarray based methylation arrays (bibikova et al., , ) and sequencing based assays such as whole genome bisulfite sequencing (wgbs)(cokus et al., ; lister et al., ) and reduced representation bisulfite sequencing (rrbs)(meissner et al., ). despite the availability of sequencing based assays, array based technology remains widely used for measuring dna methylation due to its low-cost and high reproducibility and reliability(pidsley et al., ). the first human methylation array (illumina infinium k) was introduced by illumina inc in (bibikova et al., ), which were followed by the k(bibikova et al., ) and epic arrays with larger coverage(pidsley et al., ). more recently, illumina released a mouse methylation array (infinium mouse methylation beadchip) that profiles over k markers across diverse murine strains. it will probably not be economical to develop similar methylation arrays for less frequently studied mammalian species (e.g. elephants or marine mammals) due to insufficient demand. moreover, even if costs were no impediment, species-specific arrays would likely be sub-optimal in comparative studies across different species as the measurement platforms would be different. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / to address these challenges, we developed a single mammalian methylation array designed to be used to measure dna methylation across mammals. the array targets cpgs for which the cpg and flanking sequence are highly conserved across many mammals so that the methylation of many of these cpgs can be measured in each mammal. the design repurposes the degenerate base technology (originally used by illumina infinium probes to tolerate within- human variation) to tolerate cross-species mutations across mammalian species. to select the specific probe sequences including tolerated mutations that appear on the array we developed the conserved methylation array probe selector (cmaps). cmaps takes as input a multiple sequence alignment to a reference genome and a set of probe design constraints, and selects a set of probe sequences including tolerated mutations, which can be used to query methylation in many species. we apply cmaps to select over thousand cpgs for the mammalian methylation array, which we complemented with close to two thousand known human biomarker cpgs. we characterize the cpgs on the mammalian methylation array with various genomic annotations. further, we use calibration data to evaluate the fidelity of individual probes in humans, mice, and rats. cmaps has led to the design of the mammalian methylation array, which will facilitate the study of cytosine methylation at conserved loci across all mammal species. results designing the mammalian methylation array the cmaps algorithm is designed to select a set of illumina infinium array probes such that for a target set of species many probes are expected to work in each species (methods). array probes are sequences of length bp flanking a target cpg based on the human reference genome. selecting sequences present in the human reference genome increases the likelihood that measurements in other species will transfer to human. the mammalian methylation array adapts the degenerate base technology for tolerating human snps so that probes can tolerate a .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / limited number of cross-species mutations. the cmaps algorithm is provided as input a multiple- species sequence alignment to a reference genome. cmap uses these inputs to then select the cpgs to target on the array. as part of selecting the cpgs, cmap also selects the probe sequence design to target them including the specific set of degenerate bases. for designing the mammal methylation array, cmaps was applied to the subset of mammals within a -way alignment of vertebrate genomes with human genome(haeussler et al., ), but we note the cmaps method is general. in designing a probe for a cpg, cmaps considers multiple different options. one option is the type of probe. illumina’s current methylation array technology allows up to two types of probes: infinium i and infinium ii. the latter is newer technology requiring only one silica bead to query the methylation of a cpg, while the former requires two beads. by only requiring one bead infinium ii probes allow under fixed array capacity limits interrogating more cpgs, though infinium i probes are better able to query cpgs in cpg rich regions (bibikova et al., ). another option for each of these two types of probes is whether the probe is on the forward or reverse genomic strand, giving four total combinations of options for probe type and strand for each cpg. in addition, cmaps has options for the position and nucleotides identity of tolerated mutation across correspond to degenerate bases. the array degenerate base technology allows for potentially up to three degenerate bases per probe sequence, which are positions that can be designed to tolerate variation in the sequence being interrogated. for some probes fewer than three degenerate bases could be designed, which was determined based on a design score computed by illumina for each probe and in the case of infinium ii probes also the number of cpgs within the probe sequence. cmaps uses a greedy algorithm to select the tolerated mutations for each combination of probe type and strand. the algorithm aims to maximize the number of species in the alignment the probe is expected to work based on just local alignment information that is without considering how uniquely mappable the probe is across the genome. a probe for a cpg .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / is expected to work in a non-human species based on local alignment information if there are no differences in the alignment between the human genome sequence and the other species excluding those accounted for by the probe’s degenerate bases (figure a, methods). for each cpg site in the human genome, cmaps retained for further consideration the infinium i probe out of the two options (forward or reverse of the cpg) which had the greater number of species for which the probe was expected to work, and likewise for infinium ii. we next applied a series of rules to identify a reduced subset of candidate probes. first, we included all , infinium ii probes that were expected to work in mouse (based on the mm genome), which maximizes the expected array utility for one of the most widely used model organisms. for the remaining set of cpg not selected in the previous step, we sorted them in descending order of the number of species for which an infinium ii probe was expected to work. we then added the top , cpg sites for a total of , cpg sites. next, we ranked the cpgs targeted on the illumina epic array (pidsley et al., ) in descending order of the number of species for which a probe targeting the cpg is expected to work. for this the probe was required to be of the same probe type and strand as on the epic array, but used the degenerate bases picked by the cmaps algorithm. the probe was allowed to differ in terms of degenerate base positions, as epic probes typically do not account for degenerate bases across species. for this we selected the top , cpg sites ranked sites that had not already been picked based on the earlier criteria. lastly, we sorted the cpg sites in descending order of number of species they can target and picked the top , cpgs targeted by infinium i probes that had not already been included. the infinium i probes were selected to allow querying cpg dense regions such as cpg islands, as cpgs do not count towards the limited number of positions of variation as for infinium ii probes. this resulted in a set targeting , cpgs (figure b). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for some of these , cpgs, the sequence of the probe targeting it can map to multiple locations in a genome, which could result in a confounded signal coming from multiple cpg sites. this issue is compounded by individual probes corresponding to multiple sequences reflecting different possible combinations of the degenerate bases. to identify a subset of probes less susceptible to such confounders, for high quality genomes, we computed for each probe how many of its versions map uniquely in that genome (see methods). we then filtered cpgs down by requiring all versions of a probe targeting it map uniquely in at least % of the species they are expected to target out of the high quality genomes, unless the probe is expected to target at least mammals from the alignment, in which case the mapping criterion was discarded. this reduced the set of candidate cpgs to , cpgs. we added probes targeting cpgs to the mammalian methylation array based on their utility for human biomarker studies (supplementary data). these probes, which were previously implemented in human illumina infinium arrays (epic, k, k), were selected due to their utility for human biomarker studies estimating age, blood cell counts, or the proportion of neurons in brain tissue(guintivano et al., ; hannum et al., ; horvath, ; horvath and levine, ; horvath et al., ; houseman et al., ; levine et al., ). the final manufactured mammalian methylation array measures cytosine levels of , cytosines: , of these cytosines are followed by a guanine (cpgs) and are followed by another nucleotide (non-cpgs). the probe identifiers (cg numbers) of of these cytosines ends with either ". " or ". ", i.e. these are duplicate probes for genomic locations. a detailed analysis of the infinium probe context of the mammalian array and relation to human and mouse arrays is presented in supplementary figure s . the mammalian methylation array focus on highly conserved regions led to a an array that is distinct from other currently available infinium arrays that focus on specific species. for example, the mammalian array only shares .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / probes with the illumina mousemethylation array and only cpgs with the illumina epic array. mappability analysis all cpgs profiled on the mammalian methylation array apply to humans, but only a subset of these cpgs applies to other species. when conducting analyses in a specific species it can thus be desirable to restrict analyses to the subset of cpg that apply in that species. one approach for doing this is simply omit cpgs whose detection p-values from normalization methods (methods) are insignificant. this approach has the advantage of being applicable to species that have not yet been sequenced. mapping sequences to genomes has the added benefit of providing a candidate position of the sequence in the target genome from which other information about the cpg can be inferred such as the nearest gene or cpg island status. we have mapped the array cpgs to species, which also provides a candidate position from which a gene for the cpg can be associated. as expected, the closer a species is to humans, the more cpgs map to the genome of this species. over k cpgs on the array map to most placental mammalian genomes (eutherians, figure a, supplementary data). roughly k cpgs map to most non-placental mammalian genomes (marsupials), such as kangaroos or opossums. far fewer cpgs map to egg laying mammalian genomes (monotremes), such as platypus (figure ). a cpg that is adjacent to a given gene in humans may not map to a position adjacent the corresponding (orthologous) gene in another species. between k to k cpgs (over %) were assigned to human orthologous species based on their mapped position in most phylogenetic orders (rodents, bats, carnivores, figure b,c and supplementary data). these numbers surrounding orthologous genes are probably overly conservative (i.e. lower than the true numbers) because we found the majority of cpgs (about %) that do not map to orthologous .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / genes in the non-human species are located in intergenic regions outside of promoters (methods), which suggests that one of the gene assignments was inaccurate. chromosome and gene region coverage of array we analyzed the chromosome and gene region coverage of the mammalian methylation array for human and mouse. the mammalian methylation has substantial coverage of all chromosomes (human, - ; and mouse, - probes per chromosome), with the exception of chry that only has probes in both species (supplementary figure s a). when we assign the probes to the closest gene neighbor, around % of the probes are proximal to a gene in both of these species (supplementary figure s b). the remaining % of probes are neither aligned to a promoter nor a gene body. the distribution of gene region and the distances to transcriptional start sites are comparable between human and mouse (supplementary figure s b). cpgs on the mammalian array cover human and mouse genes when each cpgs is assigned uniquely to its closest gene neighbor (supplementary figure s c). the gene coverage is uneven: while on average a gene is covered by cpgs some genes are covered by as many as cpgs. in mouse, % of cpgs ( , ) were assigned to a human orthologous genes (supplementary figure s d), suggesting many cpg measurements from the array in mice will be informative to humans (and vice versa). gene sets represented in mammalian array we analyzed gene set enrichments of all genes that are represented on the mammalian array using great(mclean et al., ). significant gene sets covered implicated gene sets that were found to play a role in development, growth, transcriptional regulation, metabolism, cancer, mortality, aging, and survival (supplementary figure s ). we also used the tissueenrich(jain and tuteja, ) software to analyze gene expression (methods). the majority of mammalian .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / methylation array probes (~ %) are adjacent to genes that are expressed in all considered human and mouse tissue (supplementary figure s a,b). however, the mammalian array also contains cpgs that are adjacent to genes that are expressed in a tissue-specific manner, notably testis and cerebral cortex (supplementary figure s c). cpg island and methylation status we analyzed the cpg island and dna methylation properties of cpgs on the mammalian array. in general, an average of ( %) of probes in the mammalian array are located in cpg island depending on the species (figure a). we used a cpg island detection algorithm (gcluster software (li et al., )) that additionally provided several species-level quantitative measures for each cpg island including the length, gc content, and cpg density that we provide as a resource (supplementary data). we also analyzed the dna methylation levels in human for fractional methylation called from whole genome bisulfite sequencing data across human tissues(roadmap epigenomics consortium et al., ) (supplementary figure ). this confirmed that the mammalian methylation array target cpgs across a wide range of fractional methylation levels. chromatin state annotation of array probes we analyzed the overlap of human cpg’s targeted on the mammal methylation array with chromatin states for cell and tissues. the cpgs cover all available chromatin states including different types of promoters (including bivalent promoters), regions repressed by polycomb group proteins, transcription start and end site, and enhancer regions (figure b). among enhancers, cpg’s had greater overlap with brain and neurosphere than other tissue groups. in addition to analyzing the array cpg’s overlap for cell and tissue specific chromatin states, we also analyzed them for a universal chromatin state annotation, which provides a single annotation to the genome .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / per position based on data from more than cell and tissue types (vu and ernst, ) (supplementary figure s ). this revealed the greatest enrichment for bivalent promoter states and also strong enrichment for other promoter states and a state associated with polycomb repression. while the mammalian methylation array was specifically designed to profile cpgs in highly conserved stretches of dna based on sequence conservation, we assessed whether there was also evidence of conservation at the functional genomics level using human-mouse lecif scores (kwon and ernst, ). the human-mouse lecif quantifies evidence of conservation between human and mouse at the functional genomics level using chromatin state and other functional genomic annotations. in general, probes on the array had higher lecif score than regions that align between human and mouse in general (figure c). mammalian array study of calibration data to validate the accuracy of the mammalian methylation array we applied it to synthetic dna methylation samples for three species: human (n= arrays), mouse (n= ), and rat (n= ), where the methylation levels were known. the dna samples from human, mouse and rat were engineered such that the fractional methylation at all cpg sites in their genomes approximately %, %, %, % and % (methods). the calibration data thus allow us to define a benchmark annotation measure “proportionmethylated” (with ordinal values , . , . , . , ). the distribution of the intensity of the probes in each human sample is roughly centered around the benchmark measure (proportionmethylated) (figure a). however, as expected, the distributions in the mouse and rat samples of all the probes show somewhat different patterns in these two species compared to the human samples likely because many probes in the design of our array do not map to these genomes (figure b-c). we also evaluate these for each species after removing the probes that were not designed to map to that species, and normalizing the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / array data using the sesame package, which defines beta (relative intensity) values for each probe (zhou et al., ). after this procedure, we see sharper peaks close to and , though the quantification of absolute methylation levels are somewhat degraded around the beta value . as we move away from humans (figure d-f). additionally, for each species, dna methylation levels of each cpg we computed the correlation with the benchmark variable "proportionmethylated" across the arrays. high positive correlations would be evidence for the accuracy of the array, which is indeed what we observe. cpgs that map to the human, mouse, and rat genome have a median pearson correlation of r= . with an interquartile range of [ . , . ], r= . with iqr=[ . , . ], and r= . with iqr=[ . , . ] with the benchmark variable proportionmethylated in the respective species. the numbers of cpgs on the mammalian array that pass a given correlation threshold (irrespective of the mappability to a given species) are reported in table . we also compare the sesame normalization with the "noob" normalization that is implemented in the minfi r package (aryee et al., ; triche et al., ) (table ). we find that sesame slightly outperforms minfi when it comes to the number of cpgs that exceed a given correlation threshold with proportionmethylated. comparison with the human epic methylation array study in calibration data we compared the mammalian methylation to the human epic methylation array, which profiles k cpgs in the human genome, for non-human samples. some of the epic array probes are expected to apply to the mouse and rat genomes as well (needhamsen et al., ). to facilitate a comparison between the mammalian methylation array and the human epic array for non-human samples we applied the latter to calibration data from mouse (n= arrays) and rat (n= ). the same engineered dna data methylation data were analyzed on the human epic array as on the mammalian methylation array above. in particular, we were able to correlate each .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / cpg on the epic array with a benchmark measure (proportionmethylated) in mice and rats (table ). only (out of k) cpgs on the human epic exceed a correlation of . with proportionmethylated in mice. by contrast, cpgs on the mammalian array exceed the same correlation threshold in mice. similarly, the mammalian array outperforms the epic array in rats: only cpgs on the epic array exceed a correlation of . with proportionmethylated compared with cpgs on the mammalian array. the results are similar for the correlation thresholds of . and . (table ). the epic array contains cpgs that were also prioritized by the cmaps algorithm based on high levels of conservation, excluding the cpgs from human biomarker studies. out of these shared cpgs, and cpgs map to the mouse and rat genome, respectively. while human epic probes target the same cpg, the corresponding mammalian probe is typically different from epic probe due to differences in probe type (type i versus type ii probe), dna strand, or the handling of mutations across species degenerate bass. in the following comparison, we limited the analysis to the and probes when analyzing calibration data from mice or rats, respectively. we find that the mammalian array probes are better calibrated than the corresponding epic array probes when applied to mouse and rat calibration data according to two different analysis that focus on shared cpgs between the two platforms. first, the mammalian array outperforms the epic array when considering mean methylation levels across the shared cpgs (figure ). second, when correlating each of the shared cpgs with the benchmark value proportionmethylated we observe median correlation of . for both mice and rat calibration data generated on the epic array. for the same probes we observe median correlations of . and . for mice and rat calibration data generated on the mammalian array (sesame normalization), respectively. we are distributing the methylation data and results from our calibration data analysis in three species (supplementary data). these calibration results will .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / allow users to select cytosines whose methylation have a high correlation with the benchmark data in human, mice or rat. discussion the mammalian methylation array, which was enabled by the cmaps algorithm for selecting conserved probes, is applicable to all mammals and hence drives down the cost per chip through economies of scale. the mammalian methylation array has unique strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of specific cytosines which is a prerequisite for developing robust epigenetic biomarkers, and its focus on highly conserved cpgs increases the chances that findings in one species will translate to those in another species. we expect that the mammalian methylation array is particularly well suited for dna methylation based biomarker studies in mammals. our calibration data demonstrate that the array largely leads to high quality measurements in three species: human, mouse and rat. our calibration data shows that the mammalian methylation array greatly outperforms the human epic chip when it comes to high fidelity measurement applications to mice and rats. the array thus should be preferable for most non- human applications unless high-fidelity measurements are not needed in which case the larger content of the epic array may make it preferable. the mammalian methylation array has several limitations. first, only a fraction of genes in a given species are represented by cpgs. second, it focuses on cpgs in highly conserved stretches of dna and hence does not cover parts that are specific to a given species. third, it provides worse coverage in more distal species, particularly in marsupials than in placental mammals (eutherians). finally, the calibration data suggests there are some shifts in the absolute methylation levels detected for intermediate methylation levels, but the relative order is preserved. the correct relative ordering of beta values is of primary importance in most statistical tests and analyses. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / several software tools have been adapted for use with the mammalian methylation array that range from normalization to higher level gene enrichment analysis. software tools for generating normalized data include sesame and the minfi r package (aryee et al., ; zhou et al., ). the eforge software (breeze et al., ), which has been adapted for the use with the mammalian array, facilitates chromatin state analysis and transcription factor binding site analysis. many researchers will be interested in genome coordinates of the mammalian cpgs in different species. toward this end, we provide genome coordinates in species. this list of species will increase as more high quality genomes become available. detailed gene annotations in many species are available including details on gene region (e.g. exon, promoter, prime untranslated region) and cpg island status (supplementary data). for human and mice we provide chromatin state annotations (ernst and kellis, ; gorkin et al., ; roadmap epigenomics consortium et al., ; vu and ernst, ) and the lecif score on evidence of conservation at the functional genomics level between human and mouse(kwon and ernst, ). in other articles, we will describe the application of the mammalian methylation array to many different mammalian species. these upcoming studies will demonstrate that the mammalian methylation array is useful for many applications that involve mammalian species. methods conserved methylation array probe selector (cmaps) given a multi-species sequence alignment and reference genome, for each cg site and each of the four different possible probe designs, cmaps computes an estimate of the number of species from the alignment that could be targeted if the use of degenerate base technology is optimized for tolerated mutations. the four probe designs involve each combination of probe type (infinium i vs. infinium ii), and whether the probe sequence is on the forward or reverse dna strand. for .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / each probe option, cmaps conducts a greedy search to select tolerated mutations, including position and allele that maximize species coverage for the probe. the maximum number of degenerate bases that can be included in a probe is a function of a design score provided by illumina inc. for infinium ii probes only, cpgs present in the probe sequence count as if they are a degenerate base. more specifically, the algorithm for determining the number of species and selecting the mutations to handle performs the following steps for each probe design: . let m be the maximum number of degenerate bases that can be designed into a specific probe, based on the design score . for each species s in the alignment, let ms be the number of mismatches in the alignment between that species and the human reference sequence of the probe a. if ms > m or the species does not have the target cpg, continue to next species b. if ms <= m, i. for each mismatch in species s, add each degenerate position to a multiset p ii. add the species to a set f of feasible species to target with this probe . for all |p| choose m combinations of possible degenerate positions: a. for each unique position in the combination i. for each possible alternate nucleotide count the number of species in f that contain that alternate nucleotide ii. pick the top k alternate nucleotides based on the count in i., where k is the number of occurrences of the current position in s b. compute the number of species that match the human reference when accounting for the degenerate substitutions handled in a . select the combination of positions in s that maximizes .b .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / our procedure for selecting the specific targeted cpg and probe designs are described in the main text. we note that of the cpgs selected for the mammalian methylation array based on the conservation criteria (using the sequence alignment) overlap with the human biomarker cpgs. the design of the probes targeting them could differ however. the probe names of different probes targeting the same cpg are distinguished by extensions ". " and ". ". for example cg . and cg . target the same cytosine but use different probe chemistry. the array contains four probes that measure cytosines that are not followed by a guanine selected by human biomarkers, which are indicated with a "ch" instead of a "cg". the cmaps algorithm was applied with human hg as the reference genome and using the multiz alignment of vertebrates with the hg human genome downloaded from the ucsc genome browser (haeussler et al., ; rosenbloom et al., ). for the purpose of designing the mammalian array, only the mammalian species in this alignment were considered and for the mappability analysis. however, the current version of the mappability analysis provides genome coordinates for species. the mammalian methylation array includes an additional human snp markers (whose probe names start with "rs" for human studies), which can be used to detect plate map errors when dealing with multiple tissue samples collected from the same person. finally, the mammalian array also adopted a standard suite of probes from the illumina epic array for measuring bisulfite conversion efficiency in humans. mapping probes to genomic coordinates we used two different approaches for mapping probes to genomes. the first approach (bsbolt software) was primarily used in designing the array. subsequently, we adopted a second mappability approach (quasr software) that allowed us to map more probes to more species. mappability approach : bsbolt .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for version of our mappability analysis (i.e. for designing the array), we applied the bsbolt mapping approach to high quality genomes from: baboon (papham ), cat (felcat ), chimp (pantro ), cow (bostau ), dog(canfam ), gibbon(nomleu ), green monkey (chlsab ), horse, (equcab ), human (hg ), macacque (macfas ), marmoset(caljac ), mouse (mm ), rabbit (orycun ), rat (rn ), rhesus monkey (rhemac ), sheep (oviari ). we utilized the bsbolt software (farrell et al., ) package from https://github.com/nuttylogic/bsbolt to perform the alignments. for each species’ genome sequence, bsbolt creates an ‘in silico’ bisulfite-treated version of the genome. as many of the currently available genomes are in a low quality assembly state (e.g. thousands of contigs or scaffolds), we used the utility “threader” (which can be found in bsbolt’s forebear bsseeker (guo et al., ) as a standalone executable) to reformat these fasta files into concatenated and padded pseudo-chromosomes. the set of nucleotide sequences of the designed probes, which includes degenerate base positions, was explicitly expanded into a larger set of nucleotide sequence representing every possible combination of those degenerate bases. for infinium i probes, which have both a methylated and an unmethylated version of the probe sequence, only the methylated version was used as bsbolt’s version of the genome treats all cg sites as methylated. the initial k probe sequences resulted in a set of , sequences to be aligned against the various species genomes. we then ran bsbolt with parameters align -m –db [path to bisulfite- treated genome] -bt bowtie -bt -p -bt -k -bt -l -f [probe sequence file] –o [alignment output file] –s to align the enlarged set of probe sequences to each prepared genome. as we were not interested in the final bsbolt style output, we made a small modification to the code to retain its temporary output of alignment results in "sam" format. from these files, we collected only alignments where the entire length of the probe perfectly matched to the genome .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/nuttylogic/bsbolt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / sequence (i.e. the cigar string ‘ m’ and flag xm= ”). then, for each genome we collapsed all the sequence variant alignments for each probeid down to a list of loci for that genome and for that probe. mappability approach : quasr for version of our mappability analysis, we aligned the probe sequences to all available mammalian genomes in ensembl and ncbi refseq databases using the quasr package (gaidatzis et al., ). the fasta sequence files for each genome were downloaded from these public databases. the alignment assumed that the dna has been subjected to a bisulfite conversion treatment. for each species’ genome sequence, quasr creates an in-silico-bisulfite- treated version of the genome. the probes were aligned to these bisulfite treated genome sequences, which does not consider c-t as a mismatch. the alignment was ran with quasr (a wrapper for bowtie ) with parameters -k --strata --best -v and bisulfite = "undir” to align the enlarged set of probe sequences to each prepared genome. from these files, we collected the best candidate unique alignment to the genome. additionally, the estimated cpg coordinates at the end of each probe was used to extract the sequence from each genome fasta files and exclude any probes with mismatches in the target cpg location. genomic loci annotations gene annotations (gff ) for each genome considered were also downloaded from the same sources as the genome. following the alignment, the cpgs were annotated to genes based on the distance to the closest transcriptional start site using the chipseeker package(yu et al., ). genomic location of each cpg was categorized as either intergenic region, ’ utr, ’ utr, promoter (minus kb to plus bp from the nearest tss), exon, or intron. the unique .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / region assignment is prioritized as follows: exons, promoters, introns, ’ utr, ' utr, and intergenic. additional genomic annotations, including human ortholog ensembl id, were extracted from the biomart ensembl database(yates et al., ). the candidate gene for each probe was compared with human orthologous ensembl id to examine the similarity of the alignment with the human. for each probe, we examined if the assigned species ensembl id is identical to human-to-other-species-orthologous ensembl id in human mappability file. orthologous comparison with human was done for genomes that could be matched to human genome by “targetspecies_homolog_associated_gene_name" in biomart using getlds() function. cell and tissue specific chromatin state annotations were based on the -state chromhmm model based on imputed data for -marks (ernst and kellis, ; roadmap epigenomics consortium et al., ). the chromatin state annotations from a chromhmm model that was not specific to a single cell or tissue type were from (vu and ernst, ). we also provide in the annotation files of the array chromhmm chromatin state annotations for mouse from (gorkin et al., ). the human-mouse lecif score was from (kwon and ernst, ). cpg island annotation we called cpg islands using the “gcluster” algorithm(gómez-martín et al., ). this algorithm uses clustering methods to identify the sequences that have high g+c content and cpg density with the default parameters. besides cpg island status, this algorithm calculated several other attributes including length, gc content, and cpg density for each defined island. the outcome of this algorithm was a bed file that was used to annotate the probes using the “annotatr” package in r by checking the overlap of the aligned probes and cpg island genomic coordinates. human dna methylation distribution .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / we downloaded the fraction methylated values based on whole genome bisulfite sequencing data from different cells and tissues types from the roadmap epigenomics consortium (http://egg .wustl.edu/roadmap/data/bydatatype/dnamethylation/wgbs/fractionalmethylation.t ar.gz)(roadmap epigenomics consortium et al., ). for each cpg, we averaged the fractional methylation values across the roadmap samples. great analysis we applied the great analysis software tool(mclean et al., ) to conduct gene set enrichments for genes near cpgs on the array in human and mouse. the great software performs both a binomial test (over genomic regions) and a hypergeometric test over genes when using a whole genome background. we performed the enrichment based on default settings (proximal: . kb upstream, . kb downstream, plus distal: up to , kb) for gene sets associated with go terms, msigdb, panther and kegg pathway. to avoid large numbers of multiple comparisons, we restricted the analysis to the gene sets with between and , genes. we report nominal p values and two adjustments for multiple comparisons: bonferroni correction and the benjamini-hochberg false discovery rate. tissue enrichment analysis the enrichment of tissue specific genes was done by tissueenrich r package(jain and tuteja, ) using teenrichment() function limited to human protein atlas(uhlén et al., ) and mouse encode(yue et al., ) databases. normalization methods .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://egg .wustl.edu/roadmap/data/bydatatype/dnamethylation/wgbs/fractionalmethylation.tar.gz http://egg .wustl.edu/roadmap/data/bydatatype/dnamethylation/wgbs/fractionalmethylation.tar.gz https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / r software scripts implementing normalization methods can be accessed through our webpage (see the section on data availability). two software scripts are currently available for extracting beta values from raw signal intensities, based on minfi and sesame, respectively. both methods use the noob method (triche et al., ) for background subtraction. the two scripts evaluate each probe's hybridization and extension performance using normalization control probes and infinium-i probe out-of-band measurements (the poobah method (zhou et al. ), respectively. users can use the detection p-values for each cpg to filter out non-significant methylation readouts from probes unlikely to work in the target species. calibration data we generated methylation data on two different platforms: the mammalian methylation array (horvathmammalmethylchip ) and the human epic methylation array. the dna samples from each species were enzymatically manipulated so that they would exhibit %, %, %, % and % percent methylation at each cpg location, respectively. we purchased premixed dna standards from epigendx inc (products - h-premixhuman, - m-premixmouse, and standard - r-premixrat premixed calibration standard). the variable “proportionmethylated” (with ordinal values , . , . , . , ) can be interpreted as a benchmark for each cpg that maps to the respective genome. thus, the dna methylation levels of each cpg are expected to have a high positive correlation with proportionmethylated across the arrays measurement from a given species. the mammalian array was applied to synthetic dna data from species: human (n= mammalian arrays), mouse (n= ), and rat (n= ). similarly, the human epic array was applied to calibration data from of mouse (n= epic arrays) and rat (n= ). thus, we applied epic arrays and epic arrays per value ( , . , . , . , .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ) of proportionmethylated in our mouse and rat studies, respectively. the epic array data were normalized using the noob method (r function preprocessnoob in minfi). data availability the mammalian methylation array (horvathmammalmethylchip ) is registered at the ncbi gene expression omnibus (geo) as platform gpl . the chip manifest file, calibration data, supplementary data, and r software scripts are or will be available from available https://github.com/shorvath/mammalianmethylationconsortium/ or the gene expression omnibus. acknowledgements and funding this work was supported by the paul g. allen frontiers group (sh) and nsf career award # , national institutes of health (dp da ) and a jccc-bscrc ablon scholars award (je). conflict of interest statement the regents of the university of california is the sole owner of a provisional patent application directed at this invention for which aa, je and sh are named inventors. sh is a founder of the non-profit epigenetic clock development foundation, which plans to license several patents from his employer uc regents, and distributes the mammalian methylation array. bret barnes is an employee for illumina inc which manufactures the mammalian methylation array. the other authors declare no conflicts of interest. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/shorvath/mammalianmethylationconsortium/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / no. cpgs whose correlation with the proportionmethylation > threshold species threshold mammal+sesame mammal+minfi epic+minfi mouse . , , , mouse . , , , mouse . , , rat . , , , rat . , , , rat . , , human . , , na human . , , na human . , , na table . correlating dna methylation levels with calibration data. we evaluated the mammalian methylation array with two different software methods for normalization: sesame and minfi (noob normalization). the epic array data were only normalized with the noob normalization method in minfi. as indicated in the first column, the dna samples came from three species: human (n= arrays), mouse (n= ), and rat (n= ). for each species, the “artificial” chromosomes exhibited on average %, %, %, % and % percent methylation at each cpg location. thus, the variable “proportionmethylated” (with ordinal values , . , . , . , ) can be considered as benchmark/gold standard. the table reports the number of cpgs for which the pearson correlation with the proportionmethylation was greater than the correlation threshold (second column). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figures b figure . overview of mammalian methylation array design process. (a) toy example of multiple sequence alignment at a cpg site considered by the cmaps algorithm. the orange coloring highlights the cpg being targeted. positions where other species have alignment that matches the human sequence are in dark blue; positions where other species have alignment that does not match the human sequence are in neon yellow; positions where other species have no alignment are in grey. (b) flowchart detailing the selection of probes on the array by the cmaps algorithm. a small fraction of probes designed were dropped during the manufacturing process. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . cpg and gene coverage of probes on the mammalian methylation array across different phylogenetic orders. (a) probe localization based on the quasr package (gaidatzis et al., ). the rows correspond to different phylogenetic orders. the phylogenetic orders are ordered based on the phylogenetic tree and increasing distance to human. the boxplots report the median number of mapped probes across species from the given phylogenetic order. (b) the number of probes mapped to human orthologous genes for a subset of genomes (methods). (c) percentage of the probes associated with human orthologous genes among mapped probes in these species. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . cpg island and chromatin state analysis of mammalian methylation probes. we characterize the cpgs located on the mammalian methylation array regarding (a) cpg island status in different phylogenetic orders, (b) chromatin state analysis, and (c) lecif score of evidence of human-mouse conservation at the functional genomics level. (a) the boxplots report the median number (and interquartile range) of cpgs that map to cpg islands in mammalian species of a given phylogenetic order (x-axis). the notch around the median depicts the % confidence interval. (b) the heatmap visualizes the chromhmm chromatin state annotations of the location of the cpgs on the array (rows) in different human tissues (columns)(ernst and kellis, , ). the colors correspond to human chromatin states as detailed in the right panel. the probes in the left panel heatmap are ordered by the chromatin state with the maximum median frequency across human cell and tissue types. the right panel indicates the distribution of chromatin states in each tissue type represented on the mammalian methylation array. (c) comparison of distribution of lecif score for probes on the array and aligning bases between human and mouse. the lecif score has been binned as shown on the x-axis, and the fraction of probes or aligning bases with scores in that bin are shown on the y-axis. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . distribution of probe intensities within sample, colored by the expected percentage of methylation at each site. (a-c) distribution of beta values (relative intensity) of all probes on the array before normalization for (a) human samples, (b) mouse samples, and (c) rat samples. (d-f) distribution of probe intensity after sesame normalization and restricting probes to those that cmaps designed to (d) the human genome in human samples, (e) the mouse genome in mouse samples, and (f) the rat genome in rat samples. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . calibration data: mean methylation across probes shared between the human epic array and the mammalian array. the mammalian methylation array contained probes targeting the same cpg that can also be found on the human epic array that were not included based on being human biomarkers. however, the mammalian array probes were engineered differently than epic probes so that they would more likely work across mammals. by applying both array types to calibration data, we are able to compare the calibration of the overlapping probes in mice (a,b) and rats (c,d). upper panels (a,b) and lower panels (c,d) present the results for the mammalian array and the epic array, respectively. the benchmark measure (proportionmethylated, x-axis) versus the mean value across roughly cpgs that map to mice (a,c) and roughly cpgs that map to rats (b,d). the mean methylation (y-axis) was formed across a subset of cpgs that i) are present on the human epic array, ii) present on the mammalian array, and iii) apply to the respective species according to the mappability analysis genome coordinate file. . . . . . . . . . . mouse,mammalarray,sesame cor= . , p= . e- proportionmethylated m e a n m e th .i n te rs e c tm a m m a l. e p ic .m a p s t o m o u s e a . . . . . . . . . . rat,mammalarray,sesame cor= . , p= . e- proportionmethylated m e a n m e th .i n te rs e c tm a m m a l. e p ic .m a p s t o r a t b . . . . . . . . . . mouse dna, epic array cor= . , p= . proportionmethylated m e a n m e th .e p ic .p ro b e s t h a tm a p t o m o u s e c . . . . . . . . . . . . rat dna, epic array cor= . , p= . proportionmethylated m e a n m e th .e p ic .p ro b e s t h a tm a p t o r a t d .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figures supplementary figure s : comparison of probe context between the illumina epic, k and the mammalian methylation array: (a) analysis of cpg and non-cpg (ch) probes, (b) color channel assignment, (c) type i and type ii probes, and (d) next base reveals similar percentages across probes from these three array platforms. color channel assignment and probe basepair context are important for dna methylation array analysis and the similarity between these different arrays can facilitate extension of published analysis and normalization methods. analysis of type i and type ii probes shows a slightly lower percentage of type i probes for the mammalian methylation array. type i probes assay dna methylation using one color channel and two bead .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / types, i.e. one unmethylated bead type and one methylated bead type. conversely, type ii probes assay dna methylation using one bead type and two color channels indicating methylated and unmethylated cytosines. adjustment for dna methylation signal detected by these different probe types is one of the most important steps in dna methylation array normalization, and a sufficient number of type i probes were included in the mammalian methylation array to facilitate the extension of published data normalization methods. (e) comparison of shared and non-shared probes between the mammalian methylation array and mousemethylation array loci reveals shared probes. (f) comparison of shared and non-shared probes between the epic, k and the mammalian methylation array. comparative analysis was performed using illumina probe ids, which are unique to each probe. intersection of ids between arrays reveals over , probes that are common to all platforms (center). these probes can be used to follow up published human epigenome-wide association study (ewas) results in model organisms such as mouse (mus musculus) or rat (rattus norvegicus), or across a range of other species, including all primates and other mammals. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s . chromosome and gene region analysis of mammalian methylation probes in humans and mice. the analysis is based on mapping probes on the mammalian methylation array to the human (hg ) and mouse (mm ) genome using quasr package(gaidatzis et al., ). (a) the number of probes per human and mouse chromosome. (b) the left panel reports the percentage of probes that are located in different gene regions (promoters, ' utr, ' utr, introns, exons) in humans and mice. the right panel reports the distribution of the probes relative to the nearest transcriptional start site. (c) histogram of cpg number in different gene regions in human and mouse genomes (as defined in the legend of panel d). (d) alignment to orthologous genes between humans and mice. the colors indicate the mapped gene region in the mouse genome. the unique region assignment are prioritized as follows: exons, promoters, introns, ' utr, ' utr. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / summary figure s . great gene set enrichment analysis of all probes on the mammalian methylation array. the figure shows the top enriched pathway based on gene-level enrichment analysis for genes proximal to probes using great . the two columns correspond to enrichment analysis for human (hg ) and mouse (mm ) genomes, respectively, using the whole genome as background. the top five enriched datasets from each category (canonical pathways, diseases, gene ontology, human and mouse phenotypes, and upstream regulators) were selected and further filtered for significance at p < - . the category is indicated by the shape, the number of genes by the size of the shape, and the significance of the enrichment is indicated by the color scale. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s . human and mouse tissue-specific probes on mammalian methylation array. characterization of the tissue specificity of cpg probes on the mammalian methylation array using the human protein atlas(uhlén et al., ) and mouse encode gene expression data(yue et al., ). the left and right panels report results for human and mouse genomes, respectively. each probe is mapped to the closest gene while other genes in the flanking region are ignored in this analysis. the number of genes (a) and the number of cpg probes (b) versus a categorical measure of tissue specificity. the categories on the y-axis have the following definitions. the following categories are defined in the tissueenrich software "tissue enriched" labels genes with an expression level greater than (tpm or fpkm) that also have at least five-fold higher expression levels in a particular tissue compared to all other tissues. "group enriched" labels genes with an expression level greater than (tpm or fpkm) that also have at least five-fold higher expression levels in a group of - tissues compared to all other tissues, and that are not considered tissue enriched. "tissue enhanced" labels genes with an expression level greater than (tpm or fpkm) that also have at least five-fold higher expression levels in a particular tissue compared to the average levels in all other tissues, and that are not considered tissue enriched or group enriched. (c) the number of tissue-enriched genes represented on mammalian array vs background in human and mouse transcriptome. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s . distribution of dna methylation levels. distribution of average fractional methylation across cell and tissue types(roadmap epigenomics consortium et al., ) at cpg sites on the array (blue) and all sites in the genome (red). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary figure s : mammalian methylation array enrichment for universal chromatin state annotations. (left) distribution of probe overlap with a universal chromatin state annotation by the stacked modeling approach of chromhmm applied to data from more than cell or tissue types(vu and ernst, ). (right) the same as left, but showing the fold enrichments of the state relative to a uniform background. the strongest enrichment is seen for some bivalent promoter states. a full characterization of the states can be found in (vu and ernst, ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references aryee, m.j., jaffe, a.e., corrada-bravo, h., ladd-acosta, c., feinberg, a.p., hansen, k.d., and irizarry, r.a. ( ). minfi: a flexible and comprehensive bioconductor package for the analysis of infinium dna methylation microarrays. bioinformatics , – . bibikova, m., le, j., barnes, b., saedinia-melnyk, s., zhou, l., shen, r., and gunderson, k.l. ( ). genome-wide dna methylation profiling using infinium® assay. epigenomics , – . bibikova, m., barnes, b., tsan, c., ho, v., klotzle, b., le, j.m., delano, d., zhang, l., schroth, g.p., gunderson, k.l., et al. ( ). high density dna methylation array with single cpg site resolution. genomics , – . breeze, c.e., reynolds, a.p., van dongen, j., dunham, i., lazar, j., neph, s., vierstra, j., bourque, g., teschendorff, a.e., stamatoyannopoulos, j.a., et al. ( ). eforge v . : updated analysis of cell type-specific signal in epigenomic data. bioinformatics , – . cokus, s.j., feng, s., zhang, x., chen, z., merriman, b., haudenschild, c.d., pradhan, s., nelson, s.f., pellegrini, m., and jacobsen, s.e. ( ). shotgun bisulphite sequencing of the arabidopsis genome reveals dna methylation patterning. nature , – . ernst, j., and kellis, m. ( ). chromhmm: automating chromatin-state discovery and characterization. nat. methods , – . ernst, j., and kellis, m. ( ). large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. nat. biotechnol. , – . farrell, c., thompson, m., tosevska, a., oyetunde, a., and pellegrini, m. ( ). bisulfite bolt: a bisulfite sequencing analysis platform. biorxiv . . . . gaidatzis, d., lerch, a., hahne, f., and stadler, m.b. ( ). quasr: quantification and annotation of short reads in r. bioinformatics , – . gómez-martín, c., lebrón, r., oliver, j.l., and hackenberg, m. ( ). prediction of cpg islands as an intrinsic clustering property found in many eukaryotic dna sequences and its relation to dna methylation. methods mol. biol. clifton nj , – . gorkin, d.u., barozzi, i., zhao, y., zhang, y., huang, h., lee, a.y., li, b., chiou, j., wildberg, a., ding, b., et al. ( ). an atlas of dynamic chromatin landscapes in mouse fetal development. nature , – . guintivano, j., aryee, m.j., and kaminsky, z.a. ( ). a cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. epigenetics , – . guo, w., fiziev, p., yan, w., cokus, s., sun, x., zhang, m.q., chen, p.-y., and pellegrini, m. ( ). bs-seeker : a versatile aligning pipeline for bisulfite sequencing data. bmc genomics , . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / haeussler, m., zweig, a.s., tyner, c., speir, m.l., rosenbloom, k.r., raney, b.j., lee, c.m., lee, b.t., hinrichs, a.s., gonzalez, j.n., et al. ( ). the ucsc genome browser database: update. nucleic acids res. , d –d . hannum, g., guinney, j., zhao, l., zhang, l., hughes, g., sadda, s., klotzle, b., bibikova, m., fan, j.-b., gao, y., et al. ( ). genome-wide methylation profiles reveal quantitative views of human aging rates. mol. cell , – . horvath, s. ( ). dna methylation age of human tissues and cell types. genome biol. , r . horvath, s., and levine, a.j. ( ). hiv- infection accelerates age according to the epigenetic clock. j. infect. dis. , – . horvath, s., oshima, j., martin, g.m., lu, a.t., quach, a., cohen, h., felton, s., matsuyama, m., lowe, d., kabacik, s., et al. ( ). epigenetic clock for skin and blood cells applied to hutchinson gilford progeria syndrome and ex vivo studies. aging , – . houseman, e.a., accomando, w.p., koestler, d.c., christensen, b.c., marsit, c.j., nelson, h.h., wiencke, j.k., and kelsey, k.t. ( ). dna methylation arrays as surrogate measures of cell mixture distribution. bmc bioinformatics , . jain, a., and tuteja, g. ( ). tissueenrich: tissue-specific gene enrichment analysis. bioinforma. oxf. engl. , – . kwon, s.b., and ernst, j. ( ). learning a genome-wide score of human-mouse conservation at the functional genomics level. biorxiv . . . . levine, m.e., lu, a.t., quach, a., chen, b.h., assimes, t.l., bandinelli, s., hou, l., baccarelli, a.a., stewart, j.d., li, y., et al. ( ). an epigenetic biomarker of aging for lifespan and healthspan. aging , – . li, x., chen, f., and chen, y. ( ). gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes. bioinforma. oxf. engl. , – . lister, r., pelizzola, m., dowen, r.h., hawkins, r.d., hon, g., tonti-filippini, j., nery, j.r., lee, l., ye, z., ngo, q.-m., et al. ( ). human dna methylomes at base resolution show widespread epigenomic differences. nature , – . mclean, c.y., bristor, d., hiller, m., clarke, s.l., schaar, b.t., lowe, c.b., wenger, a.m., and bejerano, g. ( ). great improves functional interpretation of cis-regulatory regions. nat. biotechnol. , – . meissner, a., gnirke, a., bell, g.w., ramsahoye, b., lander, e.s., and jaenisch, r. ( ). reduced representation bisulfite sequencing for comparative high-resolution dna methylation analysis. nucleic acids res. , – . needhamsen, m., ewing, e., lund, h., gomez-cabrero, d., harris, r.a., kular, l., and jagodic, m. ( ). usability of human infinium methylationepic beadchip for mouse dna methylation studies. bmc bioinformatics , . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ooi, s.k.t., qiu, c., bernstein, e., li, k., jia, d., yang, z., erdjument-bromage, h., tempst, p., lin, s.-p., allis, c.d., et al. ( ). dnmt l connects unmethylated lysine of histone h to de novo methylation of dna. nature , – . pidsley, r., zotenko, e., peters, t.j., lawrence, m.g., risbridger, g.p., molloy, p., van djik, s., muhlhausler, b., stirzaker, c., and clark, s.j. ( ). critical evaluation of the illumina methylationepic beadchip microarray for whole-genome dna methylation profiling. genome biol. , . roadmap epigenomics consortium, kundaje, a., meuleman, w., ernst, j., bilenky, m., yen, a., heravi-moussavi, a., kheradpour, p., zhang, z., wang, j., et al. ( ). integrative analysis of reference human epigenomes. nature , – . robertson, k.d. ( ). dna methylation and human disease. nat. rev. genet. , – . rosenbloom, k.r., armstrong, j., barber, g.p., casper, j., clawson, h., diekhans, m., dreszer, t.r., fujita, p.a., guruvadoo, l., haeussler, m., et al. ( ). the ucsc genome browser database: update. nucleic acids res. , d –d . smith, z.d., and meissner, a. ( ). dna methylation: roles in mammalian development. nat. rev. genet. , – . triche, t.j., weisenberger, d.j., van den berg, d., laird, p.w., and siegmund, k.d. ( ). low-level processing of illumina infinium dna methylation beadarrays. nucleic acids res. , e . uhlén, m., fagerberg, l., hallström, b.m., lindskog, c., oksvold, p., mardinoglu, a., sivertsson, Å., kampf, c., sjöstedt, e., asplund, a., et al. ( ). proteomics. tissue-based map of the human proteome. science , . vu, h., and ernst, j. ( ). universal annotation of the human genome through integration of over a thousand epigenomic datasets. biorxiv . . . . yates, a.d., achuthan, p., akanni, w., allen, j., allen, j., alvarez-jarreta, j., amode, m.r., armean, i.m., azov, a.g., bennett, r., et al. ( ). ensembl . nucleic acids res. , d –d . yu, g., wang, l.-g., and he, q.-y. ( ). chipseeker: an r/bioconductor package for chip peak annotation, comparison and visualization. bioinformatics , – . yue, f., cheng, y., breschi, a., vierstra, j., wu, w., ryba, t., sandstrom, r., ma, z., davis, c., pope, b.d., et al. ( ). a comparative encyclopedia of dna elements in the mouse genome. nature , – . zhou, w., triche, t.j., jr, laird, p.w., and shen, h. ( ). sesame: reducing artifactual detection of dna methylation by infinium beadchips in genomic deletions. nucleic acids res. , e –e . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / periodicity in the embryo: emergence of order in space, diffusion of order in time bradly alicea​ ​,​ ​, ujjwal singh​ ,​ keywords: periodicity, dynamical systems, ​c. elegans​, zebrafish, developmental biology, modeling and simulation abstract does embryonic development exhibit characteristic temporal features? this is quite apparent in evolution, where evolutionary change has been shown to occur in bursts of activity. using two animal models (nematode, ​caenorhabditis elegans and zebrafish, ​danio rerio​) and simulated data, we demonstrate that temporal heterogeneity exists in embryogenesis at the cellular level, and may have functional consequences. cell proliferation and division from cell tracking data is subject to analysis to characterize specific features in each model species. simulated data is then used to understand what role this variation might play in producing phenotypic variation in the adult phenotype. this goes beyond a molecular characterization of developmental regulation to provide a quantitative result at the phenotypic scale of complexity. introduction while the case for the effects of "tempo and mode" [ ] have been made for the evolutionary process, a similar relationship between phenotypic change, time, and space may also exist in development. one obvious answer to this question is to examine the expression and sequence variation of genes associated with cell cycle and developmental patterning [ ]. however, there is a potentially more compelling top-down explanation. we will use two model organisms to demonstrate how periodicity becomes less synchronized over developmental time and space. in the case of the nematode ​caenorhabditis elegans, a comparison of embryogenetic and postembryonic cells (developmental and terminally-differentiated cell birth times acquired from [ ]) reveals two general patterns. for the zebrafish ( ​danio rerio ​), comparisons within and between embryogenesis stages based on measurements of cell nuclei in the animal hemisphere [ ] reveal patterns at multiple scales. one of the most notable signatures is burstiness [ , ], or a large number of events occurring in a short period of time. these bursts can either be periodic or aperiodic, and these statistical features define the temporal nature of development, potentially in a universal manner across species. based on two species and a computational model, we predict that periodic changes in the frequency of new cells over developmental time represents cell proliferation without functional distinction. we also analyze the intervals between bursts in cell division (and cell differentiation in the case of ​c. elegans ​). these bursts are derived from both time-series segmentation and decomposition in the frequency domain. we show that these results consistently point to great temporal variation at the cellular level, and may play a role in shaping morphogenesis. in addition, these ​openworm foundation, boston, ma usa. ​balicea@openworm.org ​orthogonal research and education laboratory, champaign, il usa. ​iiit delhi, delhi, india. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:balicea@openworm.org https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / changes in frequency and periodicity over time results in spatial variation (supplemental figure ). to characterize spatial variation, we utilize embryo networks [ ]. embryo networks are complex networks based on the relative proximity of cells as they divide and migrate during the developmental process. the resulting network topologies provide not only information about spatial variation, but cellular interactions and other signaling connections as well [ , ]. the existence of network structure in the form of modules or regions of dense connectivity can reveal a great deal about the unfolding of lineage trees in time. returning to the first prediction, we can create computational summaries of cell division events called numeric embryos to model the proliferation of cells over time. we call these computational models, numeric embryos, and can be used to model branching events in a lineage tree. numeric embryos can be used to model the distribution of branching events in time, independent of cell identity or spatial context. approximating this distribution provides us with a periodic time-series that tells us something about the speed of embryogenesis: how quickly can different underlying distributions of cell division produce a phenotype with many undifferentiated cells. the rate at which developmental cells are produced could affect the rate of overall development, as we will see in an example from zebrafish. finally, we predict that the emergence and subsequent changes in spatiotemporal periodicity at the cellular level lead to regulatory phase transitions. for example, there is a one-to-one correspondence between cell division and waves of differentiation after the syncytial stage in ​drosophila melanogaster [ ]. in a similar fashion, amphibians exhibit a decay of synchrony of division [ , ] that corresponds to differentiation wave activity [ ]. based on data analysis, modeling, and literature review, we anticipate that further investigation could uncover whether, in regulating embryos, mitosis and cell differentiation are correlated. in interpreting the data, we discuss the potential applicability of holtzer’s quantal mitosis hypothesis [ , ] as it relates to the process of differentiation relative to the proliferation of developmental (undifferentiated) cells. methods a summary of the methods could be given here for smooth reading and interest. all materials are located on github: ​https://github.com/orthogonal- research-lab/periodicity-in-the-embryo ​. this repository includes processed data, supplemental materials, and associated code. secondary datasets the ​c. elegans ​and ​d. rerio data sets were acquired from the systems science of biology database ( ​http://ssbd.qbic.riken.jp/ ​). the ​c. elegans (nematode) data [ ] is based on cell tracking of the nucleus, pmid: . the ​d. rerio (zebrafish) data [ ] is likewise based on cell track of the nucleus, pmid: . the cell tracking data is used to determine the total number of new cells (cell birth time) present at a particular time step. for the ​c. elegans ​data, cell births correspond to minutes of developmental time, and windows of size five ( minutes of developmental time) is used for the time-series plots and histograms. since lineage trees and the nature of developmental .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/orthogonal-research-lab/periodicity-in-the-embryo https://github.com/orthogonal-research-lab/periodicity-in-the-embryo http://ssbd.qbic.riken.jp/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / cell identification are different in zebrafish, cell births correspond to the number of observed cells at discrete points in developmental time. windows representing a certain number of cells in the embryo observed at a given sampling point are used instead of directly converting this process to minutes of developmental time. zebrafish developmental stages estimates and calculations of ​d. rerio developmental stages are derived from [ ] and the zfin zebrafish developmental staging series web resource ( ​https://zfin.org/zf_info/zfbook/stages/​). where applicable, embryo stages are approximated from the number of cells observed at any given point in developmental time. peak-finding method for both the ​c. elegans and ​d. rerio data, a peak finding method is used to evaluate periodicity and to generate data points representing distinct bursts of cell birth. briefly, local peaks in the cell division series are discovered by finding the highest value around the peak over an interval of data points. the data are then visually inspected to ensure that local maximal fluctuations were not selected. using this segmentation method, we are able to define intervals between peaks in a way that allows for the aperiodic regions of our series to be compared to the highly periodic regions. the peak finding method results are supplemented by a fast frequency analysis (fft) of cell divisions in ​c. elegans embryo (supplemental figure ), cell differentiation events in ​c. elegans embryo (supplemental figure ), and time series for cell divisions in zebrafish embryo (supplemental figure ). the power spectra largely confirm the nature of our interval and peak analysis. while the analysis of zebrafish reveals a power spectrum at a single scale, the c. elegans embryo reveals a power spectrum of multiple time scales for both cell divisions and differentiations. embryo networks the full methodology for constructing and evaluating can be found in [ ]. briefly, embryo networks are complex networks constructed from the locations of cells in an embryo. nodes are represented by centroids representing cell nuclei, and edges represent the spatial (euclidean) distance between cells in a three- (static) or four- (dynamic) dimensional graph. all nuclei are plotted in embryo space, which is a coordinate system normalized to the center point between all cell locations in a complete embryo. for example, an edge of length . represents two centroids at opposite edges of the embryo space. a distance threshold is then derived from the length of the edge: in this paper, a distance threshold of . is used, excluding all but the cell nuclei in very close proximity to each other. numeric embryo numeric embryos are statistical summaries of the type of information acquired from our secondary datasets, but in a more generic manner. numeric embryos are based on generated pseudo data and are meant to capture the structure of hypothetical developmental scenarios. all analyses of our pseudo data were conducted using scilab . (paris, france). each numeric embryo consists of one or more vectors describing rounds of cell division in the embryo. briefly, each minute of developmental time is represented by either a zero or a positive non-zero value. for .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://zfin.org/zf_info/zfbook/stages/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / purposes of temporal comparison, all non-zero values are thresholded to one. to generate cell division intervals of different sizes, we start with a uniform distribution (division events occur every ​n minutes) and then compare this with a distribution generated using the grand function in scilab. for the poisson distribution, we use a 𝜆 = . (except where otherwise noted), while for the binomial distribution, we use parameters ​n ​= . and ​p ​= . . this produces intervals that are variable over developmental time. results our analysis will proceed from ​c. elegans to zebrafish, to a comparison of the two species, then to a network analysis, and finally to a simulation of cell division in development. first, we plot the developmental cell division dynamics in ​c. elegans and zebrafish in figures and , respectively, and cell differentiation in ​c. elegans in figure . we then examine the intervals between cell division events ( ​c. elegans ​) and relative frequency of birth rates across development (zebrafish) in figures and , respectively. focusing on the peaks (maximum of bursts of cell births) shown in figures and , figure shows the distribution of intervals between peak values for ​c. elegans and zebrafish. figure helps us extend this finding from temporal dynamics to connectivity between cells and spatial distributions of newly-born cells. we conclude with an investigation of how the intervals found between cell divisions can be modeled using various statistical distributions and is shown in figure . these simulations (called numeric embryos) can reveal properties related to the speed of development, particularly the linear and nonlinear accumulation of cells. caenorhabditis elegans ​ example to understand the temporal nature of cell division and differentiation, we start by looking at patterns in ​c. elegans development over time. figure shows a time series of such events from zygote to adulthood. we are particularly interested in potential spikes or bursts of events in a short period of time. figure shows the fluctuations in cell divisions in embryonic division (figure , top) and differentiation (figure , bottom) events. differentiation events occurring after minutes of developmental time (postembryonic development) occur in a long series of bursts, likely corresponding to the differentiation of seam cells. this can be contrasted with the burstiness that occurs in embryonic development, which is similar to the burstiness of division events. figure shows the intervals between cell division events across embryonic development in ​c. elegans ​. this plot confirms an exponential distribution with a long tail, presumably representing intervals in postembryonic development. yet this plot is also sparse, yielding only distinct intervals of cell division throughout all of ​c. elegans development. this is likely due to the deterministic nature of ​c. elegans development along with the relatively small number of cells. supplemental figures and reveal the power spectrum for cell division and cell differentiation in ​c. elegans​, respectively. to compare, contrast, and understand these trends further, we now turn to the embryonic development of the zebrafish (​d. rerio​). .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . developmental cell births in the nematode ​c. elegans​. cell divisions occur according to developmental time (minutes). the timeline ranges from fertilized egg (zygote) to adulthood. embryonic division events (blue), differentiation events (red). zebrafish in figure (top), we observe six regular busts of cell division, followed by aperiodic cell division behavior. this transition in periodicity is observed after the embryo reaches cells in size (figure , bottom). we do not observe this in ​c. elegans embryos, and may have to do with the more regulative nature of zebrafish embryogenesis [ ]. changes in periodicity may also have to do with the establishment of spatial differentiation beyond the axial variability observed in ​c. elegans​. to better understand the nature of periodicity in zebrafish, we examined the distribution of intervals between birth times. figure and supplemental figure confirms the bursty nature of cell division in zebrafish, in that most sampling time points only feature a few cell births, while a small number of sampling time points represents a large number of cells born. for example, a large majority of sampling time points feature fewer than new cells per time point. by contrast, there are also single sampling points where over cells are born at a single time. in terms of the .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / power spectrum shown in supplemental figure , there is a very high amplitude at very low frequencies, perhaps related to the significant noise and aperiodicity in the later part of the time-series shown in figure . figure . the interval between cell division events across embryonic development in c. elegans ​. considering the cell divisions for the first period of zebrafish embryogenesis, we conduct an interval analysis for each oscillation of the data shown in figure for ​c. elegans (top) and ​d. rerio (bottom). these are measured from peak to peak as described in the methods. for the analysis of ​c. elegans data (figure , top), our analysis yields a roughly unimodal distribution, with a mean peak interval of - minutes. in pre-hatch ​c. elegans embryogenesis, there are many quick bursts of cell division as confirmed in figure (top). this results in bursty behavior that is regular and perhaps even periodic. by contrast., an analysis of our zebrafish data yields three interval groups (figure , bottom): the greatest number of oscillations occurs at a period of - minutes, while a smaller number of oscillations occur with periods from - . there is also a longer -minute interval between oscillations. this is consistent with the shift from periodic bursts to aperiodic but still bursty behavior later in zebrafish development shown in figure . this multimodal distribution of peaks points to a more complex process at play, something that might be better understood by investigating morphogenesis as a spatial process. embryo networks: an example from zebrafish another way to identify the consequences of bursts in cell division timing and other non-uniform temporal phenomena is to utilize embryo networks. an embryo network was constructed (figure , top) for cells born during our sampling time .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / points of ​d. rerio ​embryogenesis. the resulting circular graph demonstrates a high degree of modularity, but only across part of the graph. figure . cell births in zebrafish embryos during embryogenesis up to the gastrula stage. instead of developmental time, relative developmental progress is plotted as all cells observed in the embryo at each sampling time point. for figure , bottom: periodic region (red), aperiodic region (unshaded). a three-dimensional plot (figure , bottom) demonstrating the position of each cell born during these stages of development shows that the highest degrees of connectivity are clustered in the center of the embryo, while cells that are disconnected based on our connectivity threshold exist on the edges of the embryo. importantly, it appears that cells are more densely clustered toward the center of the .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / embryo early at the earliest stages of development. these dense clusters are likely the product of cell division fluctuations shown in figures and . figure . relative frequency of birth rate across developmental time in ​d. rerio​. histogram demonstrates the distribution of cells born during a single sampling time point. numeric embryo experiments a numeric embryo (or perhaps more accurately a numeric one) allows us to understand the fundamental features of cell division events relative to the efficiency of their timing. is one timing scheme superior to another? we know that in real (biological) lineage trees that cell divisions do not occur at a completely regular rate. are there advantages in one particular statistical signature over another, particularly when comparing it to an artificial (regular) scheme? table shows a summary of how this simulation is constructed. table . an example of our numeric simulation, with variable and sample values. we use the uniform distribution as the basis for poisson noise, which helps to execute things a bit faster on average. compare this to uniform division times such as a division event occurring once every units of time. generated poisson interval represents the size of the interval between division events, while division interval developmental time unit division time (au) generated poisson interval division interval .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / represents when the event occurs in developmental time. our timing data can be modeled as branches of a binary tree which are generated every ​n units of developmental time. the intervals between ​n​ ​, n​ ​, n​ ​,…. n​t are determined by a probability distribution, which can be uniform (every branching event occurring at completely regular intervals), or a poisson distribution (where branching events are distributed in an exponential fashion). figure . interval size of peaks in cell division for all developmental cells in ​c. elegans ​(top) and first minutes of zebrafish (bottom). ​c. elegans sampling time points correspond to most of the pre-hatch developmental period ( minutes .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / post-fertilization), while the zebrafish sampling time points correspond roughly to the period between the zygote and the oblong/sphere stages of the blastula. figure . top: an embryo networks for the ​d. rerio embryo at the cell stage (all cells born during the zygote and cleavage stages), with edges. the edge threshold is an embryo distance of . . bottom: cells in developmental location color-coded by status in the network. white: all cells not above the threshold, red: .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / all source cells with at least one edge to another cell. blue: all destination cells with at least one edge to another cell. red and blue are equivocal. black: all cells with more than eight edges to other cells. the graphs in figure tells us that modeling division events using a poisson distribution is that we can achieve the same number of divisions as fewer developmental time units. figure (top) shows a uniform distribution of division events, while figure (bottom) shows the uniform case as compared to other distributions (exponential, poisson, and binomial). the poisson distribution yields the “fastest” time relative to the number of divisions produced. by contrast, the binomial distribution yields the lowest number of divisions (hence is the slowest method examined). however, none of these methods produce orders-of-magnitude differences in division rate, which is what would be expected from a bursty signature. discussion in this paper, we examine the periodicity of cell proliferation and division examined using three model systems: zebrafish ( ​danio rerio ​), nematode ( ​caenorhabditis elegans ​), and a simulated embryo. when we refer to periodicity in development, we mean events that reoccur over time. regular pulses of cell proliferation events in a short period of time. this leads us to propose a principle of development based on timing. there can also be a spatial component of developmental periodicity as well. these include signatures of time-independent spatial periodicity such as tilings and other repeatable patterns across space. interpretation of figures we interpret figures and in a number of ways. the first is by looking at components of variation over time. we measure this in terms of the interval between cell birth times in ​c. elegans (figure ) and the frequency of cell birth rates in zebrafish (figure ). we also focus on intervals between other features in the time-series such as peaks for both species in figure . in investigating peak intervals, we discover a similar distribution of cell division events between species in figures and , but a difference between species when looking at specific time-series features (figure ). the reason for this is clear: features such as peaks (magnitude) have a different underlying mechanism than events such as cell division. while both are linked to the lineage tree, magnitude differences are linked to the synchronization of cell division due to deterministic timing. with deterministic timing, synchronized cell divisions produce a lot of cells at any one point in developmental time, but little fluctuation between time points. in the case of stochastic timing, a lot of cells can be produced with a great degree of fluctuation between time points. there are a number of ways to interpret the embryo network and -d plot shown in figure . one interpretation is that in zebrafish, the phenotype is built from the inside out, with densely-packed cells representing fledgling anatomical structures such as the notochord and heart. these clusters may be linked to rounds of cell division (occuring in temporal bursts), while cell divisions occurring during the inter-burst intervals may contribute to cells at the outer edge of the embryo and perhaps representing the ectoderm layer [ , ]. in this way, temporal bursts of cell division lead to a spatial hierarchy of cell differentiation. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . comparison of cumulative cell division events and the speed of division generated by a numeric embryo. top: uniform only (blue). bottom: uniform (blue), exponential (orange), poisson (gray), and binomial (yellow). this spatial hierarchy involves a number of evolutionary and biophysical constraints that have been demonstrated in a number of experimental settings. for example, physical confinement affects the overall axial alignment and geometry of an embryo [ ]. this includes our zebrafish embryo network. other types of fishes (astyanax, see [ ]) exhibit morphological changes in neural crest cell proliferation based on evolutionary changes due to ecological constraints. in c. elegans, asymmetrical cells (or daughter cells with significantly different volumes) result from physical constraints and compose % of c. elegans developmental cell divisions [ , ]. asymmetric cell divisions set up key cell-cell interactions [ ] that are highlighted by the edges of embryo networks. finally, by comparing nematic .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / alignment of liquid crystals to spindles of mitotic cells, phase transitions in actively dividing cells are found to result from the timing of centrosome separation [ ]. figure provides an introduction to the numeric embryo concept. in this figure, we focus exclusively on the timing component of lineage trees. this is essentially a version of the time series shown for zebrafish and ​c. elegans developmental time series, but with the temporal fluctuations smoothed out. these fluctuations are replaced with a cumulative sum of all cell division events occurring over a certain period of time. it is also apparent that comparisons between different distributions do not yield an appreciable difference in developmental speed (or the accumulation of ​x cells over a certain period of time). in figure , all simulations were run for iterations. investigating the potential of the poisson distribution further, we investigate how this distribution approximates cumulative cell division (as was done in figure ) for three values of λ ( . , . , and . ). the results of this experiment are shown in supplemental figure . as this parameter value is increased, the number of cells per developmental time point increases while the interval between cell divisions decreases. while the function derived from λ = . is always slowest, the functions derived from λ = . and λ = . are similar for the first timepoints, then diverge to reveal that λ = . clearly results in both faster cell divisions and a larger number of total cells after iterations. broader questions we can ask what it means when embryogenetic systems exhibit multiple pulses of cell proliferation from division events. in particular, the intervals between pulses provide information about the generative mechanisms behind production of the embryo. our inquiry is particularly suited to quantitative interpretation, particularly in terms of characterizing "bursty" behaviors. these bursty behaviors are non-normally distributed generative processes [ ] that describe the tempo and mode of development. while tempo and mode is generally an evolutionary phenomenon, these concepts also yield a model of developmental regulation that is explicitly temporal. our results also suggest that developmental regulation is not simply a molecular mechanism. our network analysis also demonstrates a connection between the spatiotemporal dynamics of cell division, cell differentiation, and systems-level view of timing. for example, we have found that structure and timing of interactions shape embryo network coherence signaling [ ], which in turn is an indicator of diffusion between developmental cells that share network connections. while it is not discussed in this paper, gene expression fluctuations and stochastic noise in gene expression drives heterogeneity in division timing and even timing of differentiation [ , ]. in particular, a focus on the molecular biology of the cell cycle across groups of developmental cells [ , ] can provide more information about how fluctuations work in general at the single-cell level. yet single cells acting in synchrony (or in the aggregate) define the patterns observed in our empirical data. one way to generalize our results to a broader cross-species context is to examine related phenomena such as mitotic bookmarking [ ], in which heritable regulatory information is transmitted from mother to daughter cells in a cell lineage. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / our approach is also quite valuable [see ] for understanding this particular scale of the biological organism. to understand these results more fully in the context of groups of cells producing mean behaviors, we can appeal to the quantal mitosis hypothesis. quantal mitosis involves changes in gene expression, in which the fate depends upon mitosis. this is also a gene expression-related memory mechanism that is widespread in development [ ]. in cases of an observed wave or peak in cell divisions at a certain point in developmental time, mitosis provides an opportunity to change gene expression [ ], and ultimately serves as a collective signal for changes in cell fate [ ]. finally, the way in which we decompose the spatiotemporal dynamics of the embryo might be useful as a supplement to reaction-diffusion models of morphogenesis [ ]. future work will involve extending this type of analysis to other species, in addition to developing our numerical models to include explicitly spatial phenomena. acknowledgements we would like to thank members of the devoworm group for their support and feedback, particularly susan crawford-young. thanks also go to the openworm foundation for their institutional support. supplemental figures supplemental figure . example of an embryo network from the -cell ​c. elegans embryo build using cell tracking data. data shown in the context of a cartoon showing the anterior end of the embryo. different colored edges represent cells born at different generations of the lineage tree (levels). .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplemental figure . frequency-domain plot of cell division event frequencies in c. elegans embryo. all events greater than an amplitude of shown in red, while all events greater than an amplitude of shown in blue. supplemental figure . frequency-domain plot of cell differentiation event frequencies in ​c. elegans embryo. all events greater than an amplitude of shown in red, while all events greater than an amplitude of shown in blue. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplemental figure . frequency-domain plot of cell division event frequencies in zebrafish embryo. supplemental figure . comparison of cumulative cell division events and the speed of division generated by a numeric embryo for the poisson distribution at three different values of λ. blue: λ = . , black: λ = . , red: λ = . . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references [ ] simpson, g.g. ( ). tempo and mode in evolution. columbia university press, new york. [ ] ogura, y. & sasakura, y. ( ). developmental control of cell-cycle compensation provides a switch for patterned mitosis at the onset of chordate neurulation. ​developmental cell​, ( ), p - . doi: . /j.devcel. . . [ ] bhatla, n. ( ). an interactive visualization of the ​c. elegans cell lineage. wormweb ​, wormweb.org/celllineage [ ] keller. p.j., schmidt, a.d., wittbrodt, j., & stelzer, e.h.k. ( ). reconstruction of zebrafish early embryonic development by scanned light sheet microscopy. ​science ​, ( ), - . doi: . /science. . [ ] barabasi, a.l. ( ). the origin of bursts and heavy tails in human dynamics. nature ​, ( ), – . [ ] abney, d.h., dale, r., louwerse, m.m., and kello, c.t. ( ). the bursts and lulls of multimodal interaction. ​cognitive science​, ( ), - . [ ] alicea, b. and gordon r. ( ). cell differentiation processes as spatial networks: identifying four-dimensional structure in embryogenesis. ​biosystems​, , - . [ ] alicea, b. ( ). the emergent connectome in ​caenorhabditis elegans embryogenesis. ​biosystems ​, , - . [ ] alicea, b. ( ). raising the connectome: the emergence of neuronal activity and behavior in ​c. elegans ​. ​frontiers in cellular neuroscience ​, doi: . / fncel. . . [ ] foe, v.e. & alberts, b.m. ( ). studies of nuclear and cytoplasmic behaviour during the five mitotic cycles that precede gastrulation in ​drosophila embryogenesis. journal of cell science ​, , - . [ ] boterenbrood, e.c., narraway, j.m. & hara, k. ( ) duration of cleavage cycles and asymmetry in the direction of cleavage waves prior to gastrulation in xenopus laevis ​. ​roux's archives developmental biology​, ( ), - . [ ] boterenbrood, e.c. & narraway, j.m. ( ). the direction of cleavage waves and the regional variation in the duration of cleavage cycles on the dorsal side of the xenopus laevis ​ blastula. ​roux's archives of developmental biology​, , - . [ ] gordon, n.k. & gordon, r. ( ). embryogenesis explained. world scientific publishing, singapore. .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] holtzer, h., rubinstein, n., fellini, s., yeoh, g., chi, j., birnbaum, j. & okayama, m. ( ). lineages, quantal cell cycles, and the generation of cell diversity. ​quarterly reviews in biophysics ​, ( ), - . [ ] holtzer, h., biehl, j., antin, p., tokunaka, s., sasse, j., pacifici, m. & holtzer, s. ( ). quantal and proliferative cell cycles: how lineages generate cell diversity and maintain fidelity. ​progress in clinical biological research​, , - . [ ] bao, z., murray, j.i., boyle, t., ooi, s.l., sandel, m.j., and waterston, r.h. ( ). automated cell lineage tracing in ​caenorhabditis elegans ​. ​pnas​, ( ), - . [ ] keller, p.j., schmidt, a.d., wittbrodt, j., and stelzer, e.h.k. ( ). reconstruction of zebrafish early embryonic development by scanned light sheet microscopy. ​science ​, ( ), - . [ ] kimmel, c.b., ballard, w.w., kimmel, s.r., ullmann, b., and schilling, t.f. ( ). stages of embryonic development of the zebrafish. ​developmental dynamics​, , - . [ ] raible, d.w. and eisen, j.s. ( ). regulative interactions in zebrafish neural crest. ​development ​, , - . [ ] menon, t., borbora, a.s., kumar, r., and nair, s. ( ). dynamic optima in cell sizes during early development enable normal gastrulation in zebrafish embryos. developmental biology ​, ( - ), - . [ ] shah, g., thierbach, k., schmid, b., waschke, j., reade, a., hlawitschka, m., roeder, i., scherf, n., and huisken, j. ( ). multi-scale imaging and analysis identify pan-embryo cell dynamics of germ layer formation in zebrafish. ​nature communications ​, , . [ ] desmaison, a., guillaume, l., triclin, s., and weiss, p., ducommun, b., and lobjois, v. ( ). impact of physical confinement on nuclei geometry and cell division dynamics in d spheroids. ​scientific reports ​, , . doi: . /s - - - . [ ] yoshizawa, m., hixon, e., and jeffery, w.r. ( ). neural crest transplantation reveals key roles in the evolution of cavefish development. integrative and comparative biology ​, ( ), - . [ ] fickentscher. r. and weiss, m. ( ). physical determinants of asymmetric cell divisions in the early development of ​caenorhabditis elegans ​. ​scientific reports ​, , . doi: . /s - - - . [ ] alicea, b. and gordon, r. ( ). quantifying mosaic development: towards an evo-devo postmodern synthesis of the evolution of development via differentiation trees of embryos [invited]. biology, ( ), . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] leoni, m., manyuhina, o.v., bowick, m.j., and marchetti, m.c. ( ). defect driven shapes in nematic droplets: analogies with cell division. ​soft matter ​, , - . doi: . /c sm f [ ] bono, r., blanca, m.j., arnau , j., and gómez-benito, j. ( ). non-normal distributions commonly used in health, education, and social sciences: a systematic review.​ frontiers in psychology ​, , . doi: . /fpsyg. . [ ] akbarpour, m. and jackson, m. ( ). diffusion in networks and the virtue of burstiness. ​pnas ​, ( ), e -e . [ ] ben-moshe, s. and itzkovitz, s. ( ). bursting through the cell cycle. elife​, , e . [ ] wang, h., yuan, z., liu, p., and zhou, t. ( ). division time-based amplifiers for stochastic gene expression. ​molecular biosystems ​, ( ), - . doi: . /c mb a. [ ] csikasz-nagy, a. ( ). computational systems biology of the cell cycle. ​briefs in bioinformatics ​, ( ), - . doi: . /bib/bbp . [ ] dangarh, p., pandey, n., vinod, p.k. ( ). modeling the control of meiotic cell divisions: entry, progression, and exit. ​biophysical journal ​, ( ), - . doi: . /j.bpj. . . . [ ] festuccia, n., gonzalez, i., owens, n., and navarro, p. ( ). mitotic bookmarking in development and stem cells. ​development​, , - . [ ] alfieri, r., merelli, i., mosca, e., and milanesi, l. ( ). a data integration approach for cell cycle analysis oriented to model simulation in systems biology. bmc systems biology ​, , . doi: . / - - - . [ ] halley-stott, r.p., jullien, j., pasque, v., and gurdon, j. ( ). mitosis gives a brief window of opportunity for a change in gene transcription. ​plos biology​, ( ), e . https://doi.org/ . /journal.pbio. [ ] perez-carrasco, r., beentjes, c. and grima, r. ( ). effects of cell cycle variability on lineage and population measurements of messenger rna abundance. journal of the royal society interface ​, . [ ] green, j.b.a. and sharpe, j. ( ). positional information and reaction-diffusion: two big ideas in developmental biology combine. ​development​, , - ; doi: . /dev. . .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fibrinolysis influences sars-cov- infection in ciliated cells fibrinolysis influences sars-cov- infection in ciliated cells yapeng hou , yan ding , hongguang nie , *, hong-long ji department of stem cells and regenerative medicine, college of basic medical science, china medical university, shenyang, liaoning , china. department of cellular and molecular biology, university of texas health science center at tyler, tyler, tx , usa. *address correspondence to hgnie@cmu.edu.cn (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract rapid spread of covid- has caused an unprecedented pandemic worldwide, and an inserted furin site in sars-cov- spike protein (s) may account for increased transmissibility. plasmin, and other host proteases, may cleave the furin site of sars-cov- s protein and  subunits of epithelial sodium channels ( enac), resulting in an increment in virus infectivity and channel activity. as for the importance of enac in the regulation of airway surface and alveolar fluid homeostasis, whether sars-cov- will share and strengthen the cleavage network with enac proteins at the single-cell level is urgently worthy of consideration. to address this issue, we analyzed single-cell rna sequence (scrna-seq) datasets, and found the plau (encoding urokinase plasminogen activator), scnn g (enac), and ace (sars-cov- receptor) were co- expressed in alveolar epithelial, basal, club, and ciliated epithelial cells. the relative expression level of plau, tmprss , and ace were significantly upregulated in severe covid- patients and sars-cov- infected cell lines using seurat and deseq r packages. moreover, the increments in plau, furin, tmprss , and ace were predominately observed in different epithelial cells and leukocytes. accordingly, sars-cov- may share and strengthen the enac fibrinolytic proteases network in ace positive airway and alveolar epithelial cells, which may expedite virus infusion into the susceptible cells and bring about enac associated edematous respiratory condition. keywords: sars-cov- ; plasmin; enac; covid- ; furin (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction the sars-cov- infection leads to covid- with pathogenesis and clinical features similar to those of sars and shares the same receptor, angiotensin-converting enzyme (ace ), with sars-cov to enter host cells (zhou et al. , li and zheng ). by comparison, the transmission ability of sars-cov- is much stronger than that of sars-cov, owning to diverse affinity to ace (wrapp and wang ). the fusion capacity of coronavirus via the spike protein (s protein) determines infectivity (wrapp and wang , kam et al. b). highly virulent avian and human influenza viruses bearing a furin site (rxxr) in the haemagglutinin have been described (coutard et al. ). cleavage of the furin site enhances the entry ability of ebola, hiv, and influenza viruses into host cells (claas et al. ). consisting of receptor-binding (s ) and fusion domains (s ), coronavirus s protein needs to be primed through the cleavage at s /s site and s ’ site for membrane fusion (jaimes et al. , huggins ). the newly inserted furin site in sars-cov- s protein significantly facilitated the membrane fusion, leading to enhanced virulence and infectivity (xia et al. , wang, qiu, et al. ). plasmin cleaves the furin site in sars-cov s protein (kam et al. b), which is upregulated in the vulnerable populations of covid- (ji et al. ). however, whether plasmin cleaves the newly inserted furin site in the sars-cov- s protein remains obscure. plasmin cleaves the furin site of human subunit of epithelial sodium channels (enac) as demonstrated by lc-ms and functional assays (zhao, ali, and nie , sheng et al. ). very recently, it has been proposed that the global pandemic of covid- may partially be driven by the targeted mimicry of enac α subunit by sars-cov- (gentzsch and rossier , muhanna et al. ). enac are located at the apical side of the airway and alveolar cells, acting as a critical system to maintain the homeostasis of airway surface and alveolar fluid homeostasis (ji et al. , matalon, bartoszewski, and collawn ). the luminal fluid is required for keeping normal ciliary beating to expel inhaled pathogens, allergens, and pollutants and for migration of immune cells that release pro-inflammatory cytokines and chemokines (hou et al. a). the plasmin family and ace are expressed in the respiratory epithelium (nie et al. , hanukoglu and hanukoglu , kam et al. a). however, if the plasmin system and enac are involved in the fusion of sars-cov- into host cells is unknown. this study aims to determine whether plau, scnn g, and ace are co-expressed in the airway and lung epithelial cells and whether sars-cov- infection alters their expression at the single-cell level. we found that these genes, especially the plau was significantly upregulated in epithelial cells of severe/moderate covid- patients and sars-cov- infected cell lines, mainly owning to ciliated cells. we conclude that the most susceptible cells for sars-cov- infection could be the ones co-expressing these genes and sharing plasmin-mediated cleavage. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results furin sites are identified in both virus and host enac proteins a furin site was located at the s proteins of sars-cov- from arginine- to serine- (rrar|s), and similar site was also seen in the s protein of hcov-oc , mers, and hcov-hku coronavirus (fig. a). in addition, the highly conserved rxxr motif existed in the hemagglutinin protein of influenza h n , herpes, ebola, hiv, dengue, hepatitis b, west nile, marburg, zika, epstein-barr, and respiratory syncytial virus (rsv). the furin site (rkrr|e) was found in the gating relief of inhibition by proteolysis (grip) domain of the extracellular loop of the mouse, rat, and human enac (fig. b). the similarity of these furin sites is - %. respiratory cells co-express plau, scnn g, and ace to identify subpopulations of cells co-expressing plau, scnn g, and ace , we analyzed scrna- seq datasets by nferx scrna-seq platform (https://academia.nferx.com/) (supplementary table ). all three genes were co-expressed in the following cells ranked by the expression level of plau from high to low: club cells, goblets, basal cells, at cells, ciliated cells, fibroblasts, mucous cells, deuterosomal cells, and at cells (fig. c), which were supported by previous studies (sungnak et al. , wang et al. , hanukoglu and hanukoglu ). these results suggest that these cell populations co-expressing plau-enac-ace may be more susceptible to the sars-cov- infection compared with others. in addition, the top ten ranked cell sub-populations expressing plau, scnn g, or ace alone were listed in supplementary table . to compare the transcript of the proteases in different lung epithelial cells, we analyzed the lung dataset from gene expression omnibus (geo) by seurat, and the cells were annotated by their specific markers (supplementary fig. a). the data showed that all these proteases were expressed in at cells, including plau, furin, prss (trypsin), elane (elastase), prtn (myeloblastin), cela (elastase- ), cela a (elastase- a), ctrc (chymotrypsin-c), tmprss (transmembrane protease serine ), and tmprss (transmembrane protease serine ) (supplementary fig. b). in at cells, the proteases expression level in order is: tmprss > furin > tmprss > plau > cela > elane > prss > prtn > ctrc > clea a. for plau, the high to low order is basal > club > ciliated > at > at . the expression levels of proteases (plau, furin, tmprss , plg), ace , and scnn g in cell types co-expressing ace , scnn g, and plau were compared in fig. . the club cells showed the highest expression level of plau, and the ace , scnn g, tmprss , furin, and plg showed a higher expression level in club cells compared with other cell types. of note, the ciliated cell was the second and seventh highest expression cell type of plau and ace , respectively. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://academia.nferx.com/ https://doi.org/ . / . . . expression levels of plau, scnn g, and ace in sars-cov- infection to detect the potential changes in the cell populations that co-express plau, scnn g, and ace , we analyzed the scrna-seq datasets of bronchoalveolar lavage fluid (balf) cells, which are mainly composed of epithelial cells and leukocytes. there were three groups to be studied: healthy controls, moderate, and severe covid- patients. the expression level and the percentage of total cells expressing plau and furin were significantly upregulated in the severe group compared with controls (p < . ), as well as the expression levels of ace , tmprss , scnn g, and plg were also slightly upregulated (fig. a and b). the expression levels of plau, furin, tmprss , and ace and the number of cells were profiled in fig. a. the data showed that these genes were upregulated in covid- patients, and the number of cells expressing these upregulated genes almost increased in a severity-dependent manner. plau was significantly elevated in severe group (p < . ), and the other genes also showed an increasing trend (fig. b). the increments in plau (alveolar epithelial cells, basal, and ciliated cells), plg (basal cells), furin (alveolar epithelial cells, basal, ciliated cells), tmprss (basal and ciliated cells), scnn g (alveolar epithelial cells and basal cells), and ace (alveolar epithelial cells, basal, and club) were predominately observed in different cells. especially, a significant increase in plau expression was seen in ciliated cells, while the expression of measured genes showed a decline in covid- goblets (fig. c). in addition, similar changes of these genes in leukocytes were shown in supplementary fig. . to corporate the results in covid- patients, we analyzed bulk-seq data of human respiratory epithelial cell lines infected with sars-cov- : a , calu- , and nhbe (blanco-melo et al. ). plau transcript was significantly upregulated in all three cell lines after sars-cov- infection (multiplicity of infection = ) (fig. , p < . ). however, tmprss was only upregulated in infected calu- cells, evidenced by recent studies (p < . ) (xu et al. ). similar to those of sars and mers, the sars- cov- infection also increased the expression level of ace in a cells (p < . ) (smith et al. ). although sars-cov- did not change the mrna level of scnn g significantly in these cell lines as that for influenza virus, researchers are warned to pay more attention to the post-translational modification ofenac (hou et al. b). discussion the novel coronavirus, sars-cov- , was identified as the causative agent for a series of atypical respiratory diseases, and the disease termed covid- was officially declared a pandemic by the world health organization on march , (pollard, morran, and nestor-kalinoski ). sars-cov- has a great impact on human health all over the world, the virulence and pathogenicity of which may be relevant to (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the inserted furin site. whilst the sars-cov- s ’ cleavage site has a similar sequence motif to sars-cov and would thus be suitable for cleavage by trypsin-like proteases, insertions of additional arginine residues at the sars-cov- s /s (rrar|s) clearly generate a furin cleavage site (zhou et al. ). interestingly, this difference has been implicated in the viral transmissibility of sars-cov- (anand et al. ). our data supported the investigation that furin sites (rrar|s) not only exist in human virus but also in the -subunit of enac, which expresses highly in alveolar epithelial cells and a substrate to be cleaved by plasmin. plasmin has also been reported to have the ability to cleavage the furin site, and enhance the virulence and pathogenicity of viruses in their envelope proteins (sidarta-oliveira et al. ). sars-cov- has evolved a unique s /s cleavage site, absent in any previous coronavirus sequenced, resulting in the striking mimicry of an identical furin-cleavable peptide on αenac, a protein critical for the homeostasis of airway surface liquid (anand et al. ). all the above indicates that sars-cov- infection will hijack the enac proteolytic network, which is associated with the edematous respiratory condition (fig. ) (chen et al. , zhao, ali, and nie ). our data showed that the respiratory cells co-express sars-cov- receptor, enac (scnn g), and plasmin family mainly belonged to alveolar type Ⅰ/Ⅱ, basal, club, and ciliated cells, respectively. the plg (plasminogen) expression in different cell types is not shown for its expression is too low to be detected in many lung scrna-seq datasets. of note, the ciliated cell is the predominant contributor to upregulate the plau gene in severe covid- patients. as expected, plau levels, as well as tmprss , are upregulated in respiratory epithelial cell lines after sars-cov- infection, supporting the idea that sars- cov- can facilitate ace -mediated viral entry via tmprss spike glycoprotein priming (roberts et al. ). enhanced plau expression induced by sars-cov- infection will activate the plasminogen, which may reduce the difficulty of sars-cov- invasion by cleaving the s protein. the scrna-seq data of bronchoalveolar lavage fluid cells from covid- patients do not show the expression difference of scnn g (enac), which is considered to be regulated by plasmin through proteolytic hydrolysis. enac activity is not only determined by mrna/protein expression but also cell proteases. once the enac is biosynthesized and trafficked to the golgi, it is likely to be modified by intracellular protease (furin). after inserted into plasma membrane, enac will encounter the opportunity for full proteolytic activation of the channel by extracellular proteases (elastase, plasmin, chymotrypsin, and trypsin) (thibodeau and butterworth ). intriguingly, the plg gene also did not show a difference between covid- patients and healthy control, indicating that hyperfibrinolysis in covid- patients may be induced by enhanced urokinase (ji et al. ). additional analysis of clinical studies or animal models is urgently needed to future explore the relationship between the plasmin, enac, and sars-cov- receptors at the protein level. the amplified incidence of thrombotic events had been previously reported on covid- , and tissue (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . plasminogen activator (tpa) was tried to treat stroke in covid- patients (vinayagam and sattu ). we did not analyze the changes of plat in balf cells of covid- patients due to the tpa (plat) is generally expressed in endothelial cells. similarly, the beneficial effects of plasmin on alveolar fluid clearance and novel mechanisms underlying the cleavage of human enacs at multiple sites by plasmin have been provided in our recent studies (zhao, ali, and nie ). new drugs that regulate the upa/ upa receptor (upar) system have been demonstrated to help treat the severe complications of pandemic covid- (d'alonzo, de fenza, and pavone ). amiloride, a prototypic inhibitor of enac, can be an ideal candidate for covid- patients, supporting that enac is a downstream target of plasmin and involved in the luminal fluid absorption in sars- cov- infection (adil, narayanan, and somanath ). considering the two diametrically different therapeutic regimes in practice to address the complicated coagulopathic changes in covid- , fibrinolytic (alteplase, tpa) (bona et al. , ly et al. , wang, hajizadeh, et al. , barrett et al. , christie et al. , papamichalis et al. , poor et al. , arachchillage et al. ) and antifibrinolytic therapies (nafamostat and tranexamic acid) (asakura and ogawa , doi et al. , thierry ), our data provide new and comprehensive information on fibrinolytic related therapy targeting plasmin(ogen) as a promising approach to combat covid- . methods alignment of furin sites in viral and enac proteins the sequences of enac proteins (rat, mouse, and humans) and human viruses were acquired from the uniprot (https://www.uniprot.org/). the accession numbers were p dtc (for sars-cov- ), p (hiv), p (h n ), a a g xeb (ebola), a a ayz (mers), p (epstein-barr), p (herpes), p (dengue), p (hepatitis), q q p (west nile), a a b w (zika), p (respiratory syncytial virus), p (marburg), p (hcov-oc ), a a h h (hcov-hku ), p (human enac), q wu (mouse enac), and p (rat enac). alignment was performed using the jalview software (version: . . . ). the d structure of sars-cov- s (pdb id: x a) and enac (pdb id: bqn) was modified and downloaded from the protein data bank (http://www.rcsb.org/). co-expression profiles of enac, ace , and proteases we performed a systematic expression profiling of ace and enac across published human single- cell rna sequence (scrna-seq) studies comprising ~ . million cells using the nferx single-cell platform (https://academia.nferx.com/) (anand et al. ). the mean expression of plau, scnn g, and ace in a given cell-population (mean cp k) was z-score normalized (to ensure the standard deviation = and mean ~ for all the genes) to obtain relative expression profiles across all the samples. the expression of plau, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://academia.nferx.com/ https://doi.org/ . / . . . scnn g, and ace in the respiratory system were analyzed and graphed as heatmaps using r package pheatmap. acquisition, filtering, and processing of scrna-seq data the dataset downloaded from the gene expression omnibus was filtered for integration. lung scrna- seq dataset ( healthy controls in gse ) were filtered by total number of reads (nreads > , ), number of detected genes ( < ngenes < , ), and mitochondrial percentage (mito.pc < . ). balf scrna-seq dataset was composed of healthy controls, moderate and severe covid- patients in gse , and healthy control in gsm . these datasets were filtered by total number of reads (nreads > , ), number of detected genes ( < ngenes < , ), and mitochondrial percentage (mito.pc < . ). finally, a filtered gene-barcode matrix of all samples was integrated with the seurat v to remove batch effects across different donors as described previously (stuart et al. ). dimensionality reduction and clustering the filtered gene-barcode matrix was first normalized using the ‘lognormalize’ methods in seurat v. with default parameters. the top , variable genes were then identified using the ‘vst’ method in seurat findvariablefeatures function. principal component analysis (pca) was performed using the top , variable genes. then uniform manifold approximation and projection for dimension reduction (umap) or t-distributed stochastic neighbor embedding (tsne) was performed on the top principal components for visualizing the epithelial cells. meanwhile, the graph-based clustering was performed on the pca-reduced data for clustering analysis with seurat v. . the resolution was set to . and . for the lung and balf datasets to obtain a finer result, respectively. the markers used for balf cell annotation were shown by the bubble plot in supplementary fig. . differentiation of gene expression levels differentiation of gene expression level in balf cells among the healthy, moderate, and severe groups was achieved using the wilcox in seurat v. (findmarkers function). then, we divided balf cells into epithelial cells and leukocytes and compared gene expression levels among their subgroups. both epithelial and leukocytes were re-clustered to detect the differences in gene expression of all cell types between healthy controls and severe/moderate covid- patients. bulk-seq data (gse ) was analyzed for the differential genes in respiratory epithelial cell lines using the deseq with wald test and benjamini-hochberg post-hoc test (blanco-melo et al. , love, huber, and anders ). it was considered significant if p < . . acknowledgment (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . this study was supported by nsfc , nih grants hl , hl , and hl , aha awards aha grnt and aha grnt . we were grateful to yunlai zhou (yangzhou university) and congxi zhang (gene denovo) for their assistance on bioinformatics. conflict of interest the authors declare no conflicts of interest. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references adil, m. s., s. p. narayanan, and p. r. somanath. . "is amiloride a promising cardiovascular medication to persist in the covid- crisis?" drug discov ther no. ( ): - . doi: . /ddt. . . anand, p., a. puranik, m. aravamudan, and a. j. venkatakrishnan. . "sars-cov- strategically mimics proteolytic activation of human enac." elife no. :e . doi: . /elife. . arachchillage, d. j., a. stacey, f. akor, m. scotz, and m. laffan. . "thrombolysis restores perfusion in covid- hypoxia." no. ( ):e -e . doi: . /bjh. . asakura, h., and h. ogawa. . "potential of heparin and nafamostat combination therapy for covid- ." j thromb haemost no. ( ): - . doi: . /jth. . barrett, c. d., a. oren-grinberg, e. chao, a. h. moraco, m. j. martin, s. h. reddy, a. m. ilg, r. jhunjhunwala, m. uribe, h. b. moore, e. e. moore, e. n. baedorf-kassis, m. l. krajewski, d. s. talmor, s. shaefi, and m. b. yaffe. . "rescue therapy for severe covid- -associated acute respiratory distress syndrome with tissue plasminogen activator: a case series." j trauma acute care surg no. ( ): - . doi: . /ta. . blanco-melo, d., b. e. nilsson-payant, w. c. liu, s. uhl, d. hoagland, r. moller, t. x. jordan, k. oishi, m. panis, d. sachs, t. t. wang, r. e. schwartz, j. k. lim, r. a. albrecht, and b. r. tenoever. . "imbalanced host response to sars-cov- drives development of covid- ." cell no. ( ): - e . doi: . /j.cell. . . . bona, r. d., a. valbusa, g. malfa, d. r. giacobbe, p. ameri, n. patroniti, c. robba, v. gilad, a. insorsi, m. bassetti, p. pelosi, and i. porto. . "systemic fibrinolysis for acute pulmonary embolism complicating acute respiratory distress syndrome in severe covid- : a case series." eur heart j cardiovasc pharmacother. doi: . /ehjcvp/pvaa . chen, z., r. zhao, m. zhao, x. liang, d. bhattarai, r. dhiman, s. shetty, s. idell, and h. l. ji. . "regulation of epithelial sodium channels in urokinase plasminogen activator deficiency." am j physiol lung cell mol physiol no. ( ):l - . doi: . /ajplung. . . christie, d. b., rd, h. m. nemec, a. m. scott, j. t. buchanan, c. m. franklin, a. ahmed, m. s. khan, c. w. callender, e. a. james, a. b. christie, and d. w. ashley. . "early outcomes with utilization of tissue plasminogen activator in covid- -associated respiratory distress: a series of five cases." j trauma acute care surg no. ( ): - . doi: . /ta. . claas, e. c., a. d. osterhaus, r. van beek, j. c. de jong, g. f. rimmelzwaan, d. a. senne, s. krauss, k. f. shortridge, and r. g. webster. . "human influenza a h n virus related to a highly pathogenic avian influenza virus." lancet no. ( ): - . doi: . /s - ( ) - . coutard, b., c. valle, x. de lamballerie, b. canard, n. g. seidah, and e. decroly. . "the spike glycoprotein of the new coronavirus -ncov contains a furin-like cleavage site absent in cov of the same clade." antiviral res no. : . doi: . /j.antiviral. . . d'alonzo, d., m. de fenza, and v. pavone. . "covid- and pneumonia: a role for the upa/upar system." drug discov today no. ( ): - . doi: . /j.drudis. . . . doi, k., m. ikeda, n. hayase, k. moriya, and n. morimura. . "nafamostat mesylate treatment in combination with favipiravir for patients critically ill with covid- : a case series." crit care no. ( ): . doi: . /s - - -z. gentzsch, m., and b. c. rossier. . "a pathophysiological model for covid- : critical importance of transepithelial sodium transport upon airway infection." function (oxf) no. ( ):zqaa . doi: . /function/zqaa . hanukoglu, i., and a. hanukoglu. . "epithelial sodium channel (enac) family: phylogeny, structure-function, tissue distribution, and associated inherited diseases." gene no. ( ): - . doi: . /j.gene. . . . hou, y., y. cui, z. zhou, h. liu, h. zhang, y. ding, h. nie, and h. l. ji. a. "upregulation of the wnk signaling pathway inhibits epithelial sodium channels of mouse tracheal epithelial cells after influenza a infection." front pharmacol no. : . doi: . /fphar. . . hou, yapeng, yong cui, zhiyu zhou, hongfei liu, honglei zhang, yan ding, hongguang nie, and hong-long ji. b. "upregulation of the wnk signaling pathway inhibits epithelial sodium channels of mouse tracheal epithelial cells after influenza a infection." frontiers in pharmacology no. : . doi: . /fphar. . . huggins, d. j. . "structural analysis of experimental drugs binding to the sars-cov- target tmprss ." j mol graph model no. : . doi: . /j.jmgm. . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . jaimes, j. a., n. m. andre, j. s. chappie, j. k. millet, and g. r. whittaker. . "phylogenetic analysis and structural modeling of sars-cov- spike protein reveals an evolutionary distinct and proteolytically sensitive activation loop." j mol biol no. ( ): - . doi: . /j.jmb. . . . ji, h. l., x. f. su, s. kedar, j. li, p. barbry, p. r. smith, s. matalon, and d. j. benos. . "delta-subunit confers novel biophysical features to alpha beta gamma-human epithelial sodium channel (enac) via a physical interaction." j biol chem no. ( ): - . doi: m [pii] . /jbc.m . ji, h. l., r. zhao, s. matalon, and m. a. matthay. . "elevated plasmin(ogen) as a common risk factor for covid- susceptibility." physiol rev no. ( ): - . doi: . /physrev. . . kam, y. w., y. okumura, h. kido, l. f. ng, r. bruzzone, and r. altmeyer. a. "cleavage of the sars coronavirus spike glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in vitro." plos one no. ( ):e . doi: . /journal.pone. . kam, yiu-wing, yuushi okumura, hiroshi kido, lisa f. p. ng, roberto bruzzone, and ralf altmeyer. b. "cleavage of the sars coronavirus spike glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in vitro." plos one no. ( ):e -e . doi: . /journal.pone. . li, t., and q. zheng. . "sars-cov- spike produced in insect cells elicits high neutralization titres in non-human primates." no. ( ): - . doi: . / . . . love, m. i., w. huber, and s. anders. . "moderated estimation of fold change and dispersion for rna-seq data with deseq ." genome biol no. ( ): . doi: . /s - - - . ly, a., c. alessandri, e. skripkina, a. meffert, s. clariot, q. de roux, o. langeron, and n. mongardon. . "rescue fibrinolysis in suspected massive pulmonary embolism during sars-cov- pandemic." resuscitation no. : - . doi: . /j.resuscitation. . . . matalon, s., r. bartoszewski, and j. f. collawn. . "role of epithelial sodium channels in the regulation of lung fluid homeostasis." am j physiol lung cell mol physiol no. ( ):l - . doi: . /ajplung. . . muhanna, d., s. r. arnipalli, s. b. kumar, and o. ziouzenkova. . "osmotic adaptation by na(+)-dependent transporters and ace : correlation with hemostatic crisis in covid- ." no. ( ). doi: . /biomedicines . nie, h. g., t. tucker, x. f. su, t. na, j. b. peng, p. r. smith, s. idell, and h. l. ji. . "expression and regulation of epithelial na+ channels by nucleotides in pleural mesothelial cells." am j respir cell mol biol no. ( ): - . papamichalis, p., a. papadogoulas, p. katsiafylloudis, a. l. skoura, m. papamichalis, e. neou, d. papadopoulos, s. karagiannis, t. zafeiridis, d. babalis, and a. komnos. . "combination of thrombolytic and immunosuppressive therapy for coronavirus disease : a case report." int j infect dis no. : - . doi: . /j.ijid. . . . pollard, c. a., m. p. morran, and a. l. nestor-kalinoski. . "the covid- pandemic: a global health crisis." physiol genomics. doi: . /physiolgenomics. . . poor, h. d., c. e. ventetuolo, t. tolbert, g. chun, g. serrao, a. zeidman, n. s. dangayach, j. olin, r. kohli-seth, and c. a. powell. . "covid- critical illness pathophysiology driven by diffuse pulmonary thrombi and pulmonary endothelial dysfunction responsive to thrombolysis." clin transl med no. ( ). doi: . /ctm . . roberts, k. a., l. colley, t. a. agbaedeng, g. m. ellison-hughes, and m. d. ross. . "vascular manifestations of covid- - thromboembolism and microvascular dysfunction." front cardiovasc med no. : . doi: . /fcvm. . . sheng, s., m. d. carattino, j. b. bruns, r. p. hughey, and t. r. kleyman. . "furin cleavage activates the epithelial na+ channel by relieving na+ self-inhibition." am j physiol renal physiol no. ( ):f - . doi: . /ajprenal. . . sidarta-oliveira, d., c. p. jara, a. j. ferruzzi, m. s. skaf, w. h. velander, e. p. araujo, and l. a. velloso. . "sars-cov- receptor is co-expressed with elements of the kinin-kallikrein, renin-angiotensin and coagulation systems in alveolar cells." sci rep no. ( ): . doi: . /s - - - . smith, j. c., e. l. sausville, v. girish, m. l. yuan, a. vasudevan, k. m. john, and j. m. sheltzer. . "cigarette smoke exposure and inflammatory signaling increase the expression of the sars-cov- receptor ace in the respiratory tract." dev cell no. ( ): - .e . doi: . /j.devcel. . . . stuart, t., a. butler, p. hoffman, c. hafemeister, e. papalexi, w. m. mauck, rd, y. hao, m. stoeckius, p. smibert, and r. satija. . "comprehensive integration of single-cell data." cell no. ( ): - e . doi: . /j.cell. . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sungnak, w., n. huang, c. becavin, m. berg, r. queen, m. litvinukova, c. talavera-lopez, h. maatz, d. reichart, f. sampaziotis, k. b. worlock, m. yoshida, j. l. barnes, and h. c. a. lung biological network. . "sars-cov- entry factors are highly expressed in nasal epithelial cells together with innate immune genes." nat med no. ( ): - . doi: . /s - - - . thibodeau, p. h., and m. b. butterworth. . "proteases, cystic fibrosis and the epithelial sodium channel (enac)." cell tissue res no. ( ): - . doi: . /s - - -z. thierry, a. r. . "anti-protease treatments targeting plasmin(ogen) and neutrophil elastase may be beneficial in fighting covid- ." physiol rev no. ( ): - . doi: . /physrev. . . vinayagam, s., and k. sattu. . "sars-cov- and coagulation disorders in different organs." life sci no. : . doi: . /j.lfs. . . wang, i. m., s. stepaniants, y. boie, j. r. mortimer, b. kennedy, m. elliott, s. hayashi, l. loy, s. coulter, s. cervino, j. harris, m. thornton, r. raubertas, c. roberts, j. c. hogg, m. crackower, g. o'neill, and p. d. paré. . "gene expression profiling in patients with chronic obstructive pulmonary disease and lung cancer." am j respir crit care med no. ( ): - . doi: . /rccm. - oc. wang, j., n. hajizadeh, e. e. moore, r. c. mcintyre, p. k. moore, l. a. veress, m. b. yaffe, h. b. moore, and c. d. barrett. . "tissue plasminogen activator (tpa) treatment for covid- associated acute respiratory distress syndrome (ards): a case series." no. ( ): - . doi: . /jth. . wang, q., y. qiu, j. y. li, z. j. zhou, c. h. liao, and x. y. ge. . "a unique protease cleavage site predicted in the spike protein of the novel pneumonia coronavirus ( -ncov) potentially related to viral transmissibility." virol sin no. ( ): - . doi: . /s - - - . wrapp, d., and n. wang. . "cryo-em structure of the -ncov spike in the prefusion conformation." no. ( ): - . doi: . /science.abb . xia, s., q. lan, s. su, x. wang, w. xu, z. liu, y. zhu, q. wang, l. lu, and s. jiang. . "the role of furin cleavage site in sars-cov- spike protein-mediated membrane fusion in the presence or absence of trypsin." signal transduct target ther no. ( ): . doi: . /s - - - . xu, j., x. xu, l. jiang, k. dua, p. m. hansbro, and g. liu. . "sars-cov- induces transcriptional signatures in human lung epithelial cells that promote lung fibrosis." no. ( ): . doi: . /s - - - . zhao, r., g. ali, and h. g. nie. . "plasmin improves blood-gas barrier function in oedematous lungs by cleaving epithelial sodium channels." br j pharmacol no. ( ): - . doi: . /bph. . zhou, p., x. l. yang, x. g. wang, b. hu, l. zhang, w. zhang, h. r. si, y. zhu, b. li, c. l. huang, h. d. chen, j. chen, y. luo, h. guo, r. d. jiang, m. q. liu, y. chen, x. r. shen, x. wang, x. s. zheng, k. zhao, q. j. chen, f. deng, l. l. liu, b. yan, f. x. zhan, y. y. wang, g. f. xiao, and z. l. shi. . "a pneumonia outbreak associated with a new coronavirus of probable bat origin." nature no. ( ): - . doi: . /s - - - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . targeted molecular mimicry by sars-cov- of human enac and profiling ace -scnn g- plau/plat co-expression. (a) the cartoon showed the s-protein of sars-cov- (pdb id: x a), which was highlighted in green. the s /s cleavage site required for the activation of sars-cov- was enlarged and highlighted in red. furin/plasmin cleavage sites of common human viruses were shown in a box. (b) the cartoon represents the human enac protein (pdb id: bqn), which was highlighted in green. furin/plasmin cleavage site was enlarged and highlighted in red. the cleavage sites of enac in other species were shown in a box. (c) the single-cell transcriptomic co-expression of ace , scnn g (enac), and plau was summarized. the heatmap depicted the mean relative expression of each gene across the identified cell populations. the cell types were ranked based on decreasing expression of plau. the box highlighted the ace , scnn g (enac), and plau co-expressing cell types in the human respiratory system. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . expression of proteases, enac, and ace in the human respiratory system. violin plots showing the expression level of plau, plg, furin, tmprss , and scnn g in nferx platform. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . overall expression levels of proteases, ace , and scnn g in balf bulk cells of covid- patients. (a) bubble plot of proteases, ace , and scnn g in balfs of covid- patients. the size of the dots indicateed the proportion of cells in the respective cell type having a greater-than-zero expression of these genes, while the color indicated the mean expression of these genes. (b) the gene expression levels of proteases, ace , and scnn g from health controls (n = ), moderate cases (n = ) and severe cases (n = ). ***padj < . (wilcoxon test, padj was performed using bonferroni correction). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . transcription levels of proteases, ace , and scnn g in single epithelial cells of covid- patients. (a) bubble plot of sars-cov- receptor (ace ) and proteases in balfs epithelial cells of covid- patients. the size of the dots indicated the proportion of cells in the respective cell type having a greater-than-zero expression of these genes, while the color indicated the mean expression of these genes. (b) the gene expression levels of selected proteases and ace in epithelial cells from health controls (n = ), moderate (n = ), and severe cases (n = ). (c) the gene expression levels of selected proteases and ace in different epithelial cell types from health controls, moderate and severe cases. ***padj < . (wilcoxon test, padj was performed using bonferroni correction). aec: alveolar epithelial cells. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . changes of proteases, ace , and scnn g in respiratory cell lines after sars-cov- infection. normal human bronchial epithelial (nhbe) and alveolar epithelial (a , calu- ) cells were infected with sars-cov- for  h (infected), and control cells received culture medium only (mock). the boxplot showed the changes of proteases (plau, furin, and tmprss), scnn g, and ace in a , calu- , and nhbe after sars-cov- infection. differential genes were calculated by deseq , ***padj < . , *padj < . (wald test, padj was performed using benjamini-hochberg post-hoc test). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . figure . sars-cov- infection hijacks the enac proteolytic network. in physiological conditions, the urokinase activates the plasminogen to plasmin, which will cleave the γenac, leading to its activation. after infected by sars-cov- , the plau (urokinase) expression level is significantly upregulated, which may help other viruses’ invasion by activating the plasminogen to cleave the s protein. the green solid line represents the urokinase, plasminogen, enac mrna transcripts and activation by plasmin under physiological conditions. the red solid line represents the activation process under infection conditions, while the grey dotted line denotes the repression effects. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . impact of gene annotation choice on the quantification of rna-seq data impact of gene annotation choice on the quantification of rna-seq data david chisanga , , , , yang liao , , , and wei shi , , , * olivia newton-john cancer research institute, heidelberg, victoria, , australia, school of cancer medicine, la trobe university, bundoora, victoria, , australia, walter and eliza hall institute of medical research, parkville, victoria, , australia, department of medical biology, the university of melbourne, parkville, victoria, , australia and school of computing and information systems, the university of mel- bourne, parkville, victoria, , australia abstract rna sequencing is currently the method of choice for genome-wide profiling of gene expression. a popular approach to quantify expression levels of genes from rna-seq data is to map reads to a reference genome and then count mapped reads to each gene. gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. there are several major sources of gene annotations that can be used for quantification, such as ensembl and refseq databases. however, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an rna-seq analysis. in this paper, we present results from our comparison of ensembl and refseq human annotations on their impact on gene expression quantification using a benchmark rna-seq dataset generated by the sequencing quality control (seqc) consortium. we show that the use of refseq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from > real-time pcr validated genes, known titration ratios of gene expression and microarray expression data. we also found that the recent expansion of the refseq annotation has led to a decrease in its annotation accuracy. finally, we demonstrated that the rna-seq quantification differences observed between different annotations were not affected by the use of different normalization methods. *to whom correspondence should be addressed. tel: + ; fax: + ; email: wei.shi@onjcri.org.au .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction gene expression profiling using rna sequencing (rna-seq) is a core activity in molec- ular biology. comprehensive gene expression analysis in various settings is important for generating hypotheses for ongoing research, investigating drug-effects in biological or clinical settings and as a diagnostic tool. in this paper, we explore the fact that a popular approach in gene-level quantification from rna-seq data involves mapping reads to a ref- erence genome and then counting mapped reads associated with each gene [ , , , , ]. the process of counting mapped reads to genes requires a database of known genes. a gene is only quantified if it or its components have genomic coordinates already defined with respect to the genome sequence in a process called annotation. for each genome annotation model, a different set of annotation techniques and information sources are used and as such, these annotations vary in terms of comprehensiveness and accuracy of annotated genomic features. annotation techniques often include computer-based predic- tions and/or evidence-based techniques such as manual curation [ , ]. computer-based predictions result in more complex gene models that have a higher proportion of predic- tive genomic features while evidence-based generated gene models are simpler with fewer genes and isoforms. common annotation models for human and mouse genomes include ensembl [ ], refseq [ ], gencode [ ] and ucsc [ ] annotations. annotations are, therefore, an important component in an rna-seq analysis as the results are dependent on what is known in the annotation database. despite the importance of gene annotations in rna-seq data analysis, very little re- search has been conducted to examine how differences in annotations impact on gene expression quantification, which is crucial for downstream analyses such as discovery of differentially expressed genes and identification of perturbed pathways. previous studies compared the effect of human genome annotations from popular databases including en- sembl, gencode and refseq on various aspects of rna-seq analysis and they showed that the choice of annotations had an impact on gene-level quantification in the rna- seq analysis [ , ]. however, these studies are out of date as they were based on old annotations and they also lacked a reliable ground truth for assessing the impact of annotation. major annotation databases have undergone significant expansions over the years, thanks to the wide application of sequencing technologies and the massive amount of se- quencing data that have been generated across the world. however, it is unclear whether the quality of gene annotations have been successfully maintained. a recent study sug- gested that gene annotations have become less accurate and lagging during this expansion [ ]. this can be attributed to the errors from sequencing experiments, sequence analysis .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / or automation in the annotation process. it is important to systematically assess the accuracy of the new gene annotations generated in recent years to ensure the popular annotation databases can continue to be utilized by the community for rna-seq analysis. furthermore, the use of different annotations in different studies makes it difficult for researchers to reproduce the findings from such studies. for example, large consortia such as the european molecular biology laboratory (embl) use ensembl in their studies while the national centre for biotechnology information (ncbi) tend to use refseq. since this can significantly impact on gene expression data, there is a need to develop a comprehensive understanding of how these differences in annotations impact the gene- level expression quantification. in this study, we compared three human gene annotations, including a recent ensembl annotation (released in april ), a recent refseq annotation (released in august ) and an old refseq annotation (released in april ), to understand their impact on gene-level expression quantification in an rna-seq data analysis pipeline. although the old refseq annotation is not available at the ncbi refseq database anymore, it has been included as part of rsubread, a popular rna-seq quantification toolkit, for quantifying human rna-seq data. we used a benchmark rna-seq dataset generated by the sequencing quality control (seqc/maqc iii) consortium for this evaluation. we show that the use of refseq gene annotations led to better quantification accuracy than the use of ensembl annotation, based on the correlation with ground truths including expression data from > real-time pcr validated genes, known genome-wide titration ratios of gene expression and microarray gene expression data. we also show that the older refseq annotation yielded higher quantification accuracy than the recent refseq annotation in our evaluations, suggesting that the recent expansion and changes made to the refseq annotation have led to a decline in annotation accuracy resulting in less accurate quantification result. furthermore, we investigated if any normalization method can mitigate the differences in quantification results caused by the annotation differences. our results show that the quantification differences remained almost the same no matter how the rna-seq data were normalized. materials and methods . seqc/maqc data the rna-seq data used for evaluation in this study are a benchmark dataset generated by the sequencing quality control (seqc) project [ ], the third stage of the microarray quality control (maqc) study [ , ]. the seqc dataset includes the universal .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / human reference rna (uhrr) as sample a and the human brain reference rna (hbrr) as sample b. it also includes two other samples c and d, which are combination of a and b mixed in the ratios of : in c and : in d respectively. the samples were sequenced in four replicate paired-end libraries using an illumina hiseq sequencer at the australian genomics research facility (agrf). each library contains ∼ million bp read pairs. a taqman real-time polymerase chain reaction (rt-pcr) dataset with expression values measured for over , genes, which was generated in the maqc-i study [ ], was used to validate the expression of the rna-seq data in this study. the expression values were measured for both the uhrr and hbrr samples together with their respec- tive combinations. around – taqman rt-pcr genes, which had matching gene identifiers with expressed rna-seq genes from different annotations, were included for assessing the accuracy of rna-seq quantification. in addition, microarray data generated in the maqc-i study with samples a to d hybridized to the illumina human- bead- chip microarrays were also used in the assessment. the taqman rt-pcr and illumina microarray datasets are available as part of the bioconductor package ‘seqc’ [ ]. . annotations used three human gene annotations were included in this study, including a recent ensembl annotation, a recent refseq annotation and an old refseq annotation. all these anno- tations were generated based on the human reference genome grch /hg . the ensembl gene annotation used in this study was generated in april . its ver- sion number is . it was downloaded from ftp://ftp.ensembl.org/pub/release- / gtf/homo_sapiens/homo_sapiens.grch . .gtf.gz. the recent refseq gene annotation used was released by the ncbi in august . its release number is . and it is part of the refseq release version . it was downloaded from the ncbi ftp site ftp://ftp.ncbi.nlm.nih.gov/refseq/h_ sapiens/annotation/annotation_releases/ . /gcf_ . _grch . p /gcf_ . _grch .p _genomic.gtf.gz. we refer this refseq annota- tion as ‘refseq-ncbi’ in this study. the old refseq annotation included in this study was released by the ncbi in april . it was released as part of the patch release of the grch /hg genome build. this annotation has also been included in the popular rna-seq quantification toolkit rsubread [ ] as the default annotation used for quantifying human rna-seq data. the inclusion of this old refseq annotation allowed us to investigate how the annotation changes made recently to refseq affect the quantification result of rna-seq data. the .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / refseq annotation in rsubread is slightly different from the original one in that the overlapping exons from the same gene were collapsed to form a single continuous exon for the gene in the rsubread annotation, however this difference will not change the gene-level rna-seq quantification result because the set of exonic bases belonging to each gene is the same between the original annotation and the rsubread annotation. as this old refseq annotation is no longer available for downloading at the ncbi ftp site, we instead used the rsubread annotation in this study and we denote this annotation as ‘refseq-rsubread’. when matching genes from different annotations, we converted the gene identifiers using the bioconductor package ‘org.hs.eg.db’ [ ] and then compared them to find common genes between annotations. . mapping, quantification and normalization of rna-seq data analysis of the rna-seq data was performed using bioconductor r packages rsubread and limma [ , , ]. the human reference genome (grch ) from gencode (version downloaded from ftp://ftp.ebi.ac.uk/pub/databases/gencode/gencode_human/ release_ /grch .primary_assembly.genome.fa.gz) was indexed using the buildin- dex function in rsubread v . . [ ]. sequencing reads were then mapped to the reference genome using the align function in rsubread [ , ]. during the alignment, the en- sembl, refseq-ncbi and refseq-rsubread annotations were also included as an extra parameter to improve alignment. gene-level read counts were obtained with featurecounts [ , ], a read count summa- rization function within the rsubread package. the ensembl, refseq-ncbi and refseq- rsubread annotations were provided to featurecounts to generate read counts for genes included in these annotations respectively. the gene-level read counts were transformed using the voom function in limma [ , ] and then normalized using the library size [ ], quantile [ ] and trimmed mean of m- values (tmm) [ ] methods, respectively, prior to performing further analysis. the library size normalization was performed by providing raw read counts to voom and then running voom with the ‘normalize.method’ parameter set to ‘none’. the quantile nor- malization was performed by providing raw read counts to voom and then running voom with the ‘normalize.method’ parameter set to ‘quantile’. for tmm normalization, we first calculated the tmm normalization factor for each library using the calcnormfactors method in edger [ ]. then we provided raw read counts and the tmm normalization factors to voom and ran it with the ‘normalize.method’ parameter set to ‘none’. the log cpm (log counts per million) values, produced by the voom function for each gene .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in each library, were converted to log fpkm (log fragments per kilo exonic bases per million mapped fragments) expression values for further analysis. . titration monotonicity the rna-seq data from the seqc project have titration monotonicity built into them, such that a gene is considered to preserve titration monotonicity if the expression of the gene follows a ≥ c≥ d ≥b when its expression in sample a is greater than or equal to that in sample b, or follows a ≤ c≤ d ≤b when its expression in sample a is less than or equal to that in sample b. to test if the titration monotonicity is preserved, equation ( ) was used to compute the expected log fold-change for a gene in the comparison of c vs d given the log fold-change between a vs b. e = log ( × x + x + ) ( ) where e is the expected log fold-change for c vs d and x is the log fold-change for a vs b. expression levels of genes in the replicates of the same sample were averaged before fold change of gene expression was calculated between samples. . validation gene expression data generated using taqman rt-pcr and illumina’s beadchip mi- croarray were used to validate the gene-level quantification results from the rna-seq analysis. pearson correlation coefficients were computed to assess the concordance be- tween the rna-seq quantification data obtained from using different annotations and the gene expression data obtained from the rt-pcr and microarray experiments. the genome-wide built-in truth of titration monotonicity of gene expression in the rna-seq data was also utilized to evaluate the quantification accuracy of rna-seq data generated from using different annotations. . access to data and code the data and analysis code used in this study can be accessed at the following url: https://github.com/shilab-bioinformatics/geneannotation. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results . discrepancy between different gene annotations the ensembl and ncbi refseq annotations are among the most widely used gene anno- tations that have been utilized for rna-seq gene expression quantification in the field. in this study, we downloaded recent ensembl and refseq annotations and also used an older version of refseq annotation to assess the impact of gene annotation choice on the accuracy of rna-seq expression quantification. the inclusion of an older refseq annotation allowed us to investigate the accuracy of new annotation data generated in recent years when the next-gen sequencing data have been used as a new data source for genome-wide annotation generation. the ensembl annotation used in this study was released in april and it has a version number . the recent refseq annotation included in this study was released in august . we call this annotation as ‘refseq-ncbi’ in this study. the older refseq annotation was released in april , and it has also been included as part of the popular rna-seq quantification toolkit ‘rsubread’ for quantifying human rna- seq data. as this annotation is not available in the ncbi refseq database anymore, we instead used the rsubread refseq annotation in our evaluations and we denote this annotation as ‘refseq-rsubread’. as rna-seq gene-level expression quantification is typically performed for genes that contain exons [ , , ], in this study we only focused on the genes that have annotated exons in each annotation. figure a shows that, as expected, the ensembl annotation contains a lot more exon-containing genes than the two refseq annotations. the en- sembl annotation is known to contain a large number of computationally predicted genes whereas refseq genes were mainly annotated based on the biological evidence. however, it is worth noting that the refseq-ncbi annotation still has > , genes that are not included in the ensembl annotation. nearly % of the ensembl genes were found to be absent from both of the two refseq annotations. in total, , common genes were found between the three annotations. most of the genes included in the refseq-rsubread annotation can be found in the refseq-ncbi or ensembl annotations. we then examined the effective gene lengths in each annotation. the effective length of a gene is the total number of unique bases included in all the exons belonging to the gene. figure b shows the distributions of effective lengths of genes in the three annota- tions. around half of the ensembl genes have an effective length less than , bases, whereas in the two refseq annotations only ∼ % of the genes are shorter than , bases in length. the median effective gene lengths in refseq-ncbi and refseq-rsubread .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b ensembl refseq-ncbi refseq-rsubread , , , , , e ns em bl r ef s eq -r su br ea d lo g ef fe ct iv e ge ne le ng th r ef s eq -n c b i e ns em bl vs r ef s eq -n c b i e ns em bl vs r ef s eq -r su br ea d r ef s eq -n c b i vs r ef s eq -r su br ea d d iff er en ce in lo g ef fe ct iv e ge ne le ng th c to ta le ffe ct iv e ge ne le ng th s (x ^ ) e ns em bl r ef s eq -r su br ea d r ef s eq -n c b i d − − figure : concordance and differences between gene annotations. (a) venn diagram showing genes that are common or unique in the ensembl, refseq-ncbi and refseq-rsubread annotations. (b) boxplots showing the distribution of effective gene lengths (log scale) in each annotation. (c) boxplots showing the differences in effective lengths of common genes between each pair of annotations. values shown in the plots are the ratio of effective lengths of the same gene from two different annotations (log scale). (d) the size of transcriptome calculated from each annotation. shown are the sum of effective gene lengths in each annotation. are ∼ , bases, which is much larger than that in ensembl (∼ , bases). although the ensembl annotation contains a lot more genes than the two refseq annotations, it also contains a much higher percentage of short genes. we further performed gene-wise comparison of effective gene lengths using common genes between each pair of annotations. although every annotation contains both longer and shorter genes in comparison to the corresponding genes from other annotations, the ensembl genes were found to have a larger effective length than genes from the two refseq annotations overall (figure c). this is in contrast to the higher proportion of short genes observed in the ensembl annotation (figure b), which indicates that the ensembl genes that are also present in refseq-ncbi or refseq-rsubread annotations tend to be longer than those ensembl genes that can only be found in the ensembl annotation. although at least half of the genes were found to have a less than -fold ( -fold at log scale) length difference between annotations (figure c), the length differences could be as high as more than -folds ( -folds at log scale). the refseq-ncbi genes seem to .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / be slightly longer than the corresponding refseq-rsubread genes overall. ensembl and refseq-rsubread were found to be the least concordant annotations among the three annotations being compared. lastly, we compared the size of the transcriptome represented by each annotation. the transcriptome size of an annotation is computed as the sum of effective gene lengths from all the genes included in that annotation, which also represents the total num- ber of exonic bases that were annotated in an annotation. figure d shows that the ensembl annotation has a larger transcriptome size than both refseq-ncbi and refseq- rsubread annotations. this is not surprising because the ensembl annotation contains more genes and also ensembl genes common to other annotations are longer in general. refseq-rsubread has a much smaller transcriptome size than refseq-ncbi, indicating a significant expansion of the refseq-ncbi annotation in the past five years. however, it is important to note that the refseq-rsubread annotation is not a subset of the refseq- ncbi annotation, as demonstrated by the existence of refseq-rsubread genes that are absent in the refseq-ncbi annotation, the difference in gene length distribution and the length differences of the same genes between the two annotations (figure a-c). this indicates that not only were new genes added to the refseq annotation during the expansion, but existing genes have been modified. it is against this background that we sought to understand how these differences in the annotations impact on the overall gene-level quantification results. . fragments counted to annotated genes we used a benchmark rna-seq dataset generated by the seqc project [ ] to evaluate the impact of gene annotation on the accuracy of rna-seq expression quantification. this dataset contains paired-end bp read data generated for four samples including a universal human reference rna sample (sample a), a human brain reference rna sample (sample b), a mixture sample with %a and %b (sample c) and a mixture sample with %a and %b (sample d). we mapped the rna-seq reads to the human genome grch /hg using the sub- read aligner [ , ], and then counted the number of mapped fragments (read pairs) to each gene in each annotation using the featurecounts program [ , ]. featurecounts assigns a mapped fragment to a gene if the fragment overlaps any of the exons in the gene. figure shows that across all the libraries, the refseq-rsubread annotation constantly has substantially more fragments assigned to it than the ensembl and refseq- ncbi annotations. this is surprising because refseq-rsubread contains much less an- notated genes and also has a significantly smaller transcriptome, compared to ensembl .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / p er ce nt ag e of co un te d fr ag m en ts key ensembl refseq-ncbi refseq-rsubread a − a − a − a − b − b − b − b − c − c − c − c − d − d − d − d − figure : barplots showing the percentage of fragments successfully assigned to genes in each annota- tion, out of all the fragments included in each library. the horizontal axis represents the sixteen seqc rna-seq libraries generated from the four samples ‘a’, ‘b’, ‘c’ and ‘d’. each sample has four replicates that are numbered from to . and refseq-ncbi (figure a,d). we then performed a detailed investigation into the mapping and counting results to find out what enabled refseq-rsubread to achieve a higher percentage of successfully assigned fragments. although gene annotations were utilized in mapping reads to the human reference genome, the use of different annotations was not found to affect the number of success- fully aligned fragments for each library (supplementary figure s ). we found that when assigning fragments to genes using the ensembl or refseq-ncbi annotation, more frag- ments were unable to be assigned because they did not overlap any genes (ie. failed to overlap any exons included in any genes), despite there are more genes included in these annotations compared to the refseq-rsubread annotation (supplementary figure s ). this is particularly the case for the fragment assignment in the human brain reference samples. we also found that the use of ensembl and refseq-ncbi annotations led to more fragments being unassigned due to the assignment ambiguity, ie. a fragment over- laps more than one gene (supplementary figure s ). this should be because there are more genes that overlap with each other (ie. exons from different genes overlap with each other) in the ensembl and refseq-ncbi annotations compared to the refseq-rsubread annotation. our investigation revealed that less gene overlapping in the refseq-rsubread annotation and better compatibility of this annotation with the mapped fragments have led to more fragments being successfully counted for each library in this dataset. given .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / − − lo g r p k m a − a − a − a − b − b − b − b − c − c − c − c − d − d − d − d − key ensembl refseq-ncbi refseq-rsubread figure : boxplots comparing the intensity range of gene expression between the three annotations. all the genes from each annotation were included in the plots. raw read counts of genes were transformed to log fpkm values. a prior count of . was added to raw counts to avoid log-transformation of zero. that both the universal human reference and human brain reference samples used in this study are known to contain a very high number of expressed genes and the rna-seq data generated from these samples are expected to cover most of the human transcrip- tome, our analysis suggests that the refseq-rsubread annotation is likely to contain more transcribed region in the genome than the other two annotations in general. . intensity range of gene expression we examined if the gene annotation choice has an impact on the range of gene expression levels in the rna-seq data. raw gene counts of the seqc data were converted to log fpkm (log fragments per kilo exonic bases per million mapped fragments) values for all the genes included in each annotation. a prior count of . was added to the raw counts to avoid log-transformation of zero. figure shows that the two refseq annotations exhibit a desirable larger intensity range of gene expression than the ensembl annotation, as shown by the larger boxes in the boxplots. it is surprising to see that the ensembl genes have the smallest intensity ranges in all the libraries, give that the ensembl annotation contains the largest number of genes in all the three annotations being examined. in addition to the large intensity range, the refseq-rsubread genes were also found to have a markedly higher median expression level than genes in the refseq-ncbi and ensembl annotations. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . gene annotation discrepancy after expression filtering as it is a common practice to filter out genes that are deemed as lowly expressed, or are completely absent in an rna-seq data analysis [ ], we also set out to assess the differences between alternative annotations after excluding such genes. we excluded those genes that failed to have at least . cpm (counts per million) in at least four libraries (each sample has four replicates) in the analysis of the seqc dataset. the expression-filtered data were also used for comparing the accuracy of quantification from using alternative annotations presented in the following sections. the bar plot in figure a shows that ensembl has significantly more genes (also higher proportion of genes) filtered out due to low or no expression, compared to refseq- ncbi and refseq-rsubread. after expression filtering, the total numbers of remaining genes from the three annotations became more similar to each other. , genes were found to be common between the three annotations after filtering, accounting for %, % and % of the filtered genes in the ensembl, refseq-ncbi and refseq-rsubread an- notations respectively (figure b). almost all the filtered genes in the refseq-rsubread annotation can be found in the other two annotations. after expression filtering, the median effective gene length has increased to ∼ , bases for all annotations (figure c), meaning that a higher proportion of short genes were removed due to low expression in every annotation. the median effective length of ensembl genes now became comparable to, or slightly higher than those in the two refseq annotations, indicating that the ensembl annotation contained a higher proportion of lowly expressed short genes than the two refseq annotations. when comparing the effective lengths of genes common to all three annotations after filtering, the ensembl genes were found to have the largest median effective length and the refseq-rsubread genes have the smallest median effective length (figure d). this is not surprising because the ensembl annotation is known to be more aggressive than the refseq annotations and the refseq-rsubread annotation is an old annotation that has not been updated in the last five year. the expression filtering did not seem to affect the distribution of differences of effective gene lengths between each pair of annotations (using genes common to each pair of annotations), with ensembl and refseq-rsubread remaining to be the least concordant annotations (figure e and figure c). using genes common to all three annotations after filtering exhibited similar distributions of gene length differences between each pair of annotations compared to using genes common to each pair of annotations (figure f). similar to before filtering, the gene-wise length comparison performed after filtering also showed that overall the ensembl genes had the largest gene lengths and the refseq- .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / , , , , ensembl refseq-ncbi refseq-rsubread b d ensembl refseq-rsubreadrefseq-ncbi before after g en e co un t( x ) , , , , , , a e ns em bl r ef s eq -n c b i r ef s eq -r su br ea d lo g ef fe ct iv e ge ne le ng th d iff er en ce in lo g ef fe ct iv e ge ne le ng th f r ef s eq −n c b i vs r ef s eq −r su br ea d e ns em bl vs r ef s eq −r su br ea d e ns em bl vs r ef s eq -n c b i e ns em bl r ef s eq -n c b i r ef s eq -r su br ea d r ef s eq −n c b i vs r ef s eq −r su br ea d e ns em bl vs r ef s eq −r su br ea d e ns em bl vs r ef s eq -n c b i c e d iff er en ce in lo g ef fe ct iv e ge ne le ng th lo g ef fe ct iv e ge ne le ng th − − − − figure : concordance and differences between gene annotations after filtering for lowly expressed genes. (a) bar plot showing the differences in the number of genes included in each annotation before and after filtering for lowly expressed genes. (b) venn diagram comparing genes from different annotations after filtering for lowly expressed genes. distributions of effective gene lengths after filtering are shown for all genes in each annotation (c) and for genes that are common between all three annotations (d). distributions of differences of effective gene lengths between annotations after filtering are shown for common genes between each pair of annotations (e) and for genes that are common between all three annotations (f). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rsubread genes had the shortest gene lengths. . comparison of titration monotonicity preservation to assess the impact of gene annotation choice on the accuracy of rna-seq quantification result, we utilized as ground truth the inbuilt titration monotonicity in the seqc data, the taqman rt-pcr data and the microarray data generated for the same samples, to evaluate which annotation gives rise to a better expression correlation of the rna-seq quantification data with the truth. in this section, we compared the ability of ensembl and the two refseq annotations in retaining the inbuilt titration monotonicity in the rna-seq dataset. in figure , the reference titration curve depicts the expected fold change that genes are expected to follow in sample c vs sample d based on the fold change in sample a vs sample b. this is computed using the equation ( ) (see materials and methods). we then calculated the mean squared error (mse) between the reference titration monotonicity and the titration monotonicity obtained from each annotation. a smaller mse value means that the generated quantification data is closer to the truth. figure shows that the mse computed for the refseq-rsubread annotation is constantly lower than those computed for the ensembl and refseq-ncbi annotations, regardless if filtering was applied or if only common genes were included for comparison. refseq-rsubread was also found to yield comparable or lower mse compared to the other two annotations when the data were tmm or quantile normalized (supplementary figures s and s ), in addition to the library-size normalized data shown in figure . these results demonstrated that the use of refseq-rsubread annotation led to better quantification accuracy for the rna-seq data. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : titration monotonicity plots. the ability of ensembl, refseq-ncbi and refseq-rsubread to retain the titration monotonicity built into the seqc rna-seq data was measured using the mean squared error (mse) between the reference titration and the actual titration obtained from each an- notation. the red curve in each plot represents the reference titration calculated from using equation ( ). plots in the top row include all the genes available in each annotation. plots in the middle row includes those genes that remained after filtering for lowly expressed genes, in each annotation. plots in the bottom row includes genes that are common between the three annotations after the expression filtering was performed. in each plot, the horizontal axis represents the log fold changes of gene expres- sion between samples a and b and the vertical axis represents the log fold changes of gene expression between samples c and d. . validation against taqman rt-pcr data the taqman rt-pcr dataset generated in the maqc study [ , ] was used to validate the gene-level quantification results from the rna-seq dataset. this dataset contains measured expression levels for > , genes in the four seqc samples. the aim was to understand how well ensembl and refseq annotated gene expression correlated with the taqman rt-pcr data. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / key ensembl refseq-ncbi refseq-rsubread all genes after filtering common genes after filtering . . . . . . . . a b c d a b c d . . . . li br ar y si ze no rm al iz at io n q ua nt ile no rm al iz at io n tm m no rm al iz at io n . . . . . . . . . . . . c or re la tio n co ef fic ie nt c or re la tio n co ef fic ie nt c or re la tio n co ef fic ie nt figure : validation of rna-seq against taqman rt-pcr dataset. shown are pearson correlation coefficients computed from comparing rna-seq data against rt-pcr data, using the rt-pcr genes matched with each individual annotation (left column) or matched with all three annotations (right column). the rows represent the different rna-seq normalization methods used. lowly expressed genes in the rna-seq data were filtered out before the correlation analysis was performed. the rna-seq data generated from each annotation were filtered to remove lowly expressed genes before being compared to the rt-pcr data. numbers of matched genes between the rt-pcr data and the rna-seq data were , and for ensembl, refseq-ncbi and refseq-rsubread, respectively. rt-pcr genes were found to be common to all the three annotations. the raw taqman rt-pcr data were log - .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / transformed before comparing to the filtered rna-seq data. pearson correlation analysis of the rna-seq gene expression (log fpkm values) and rt-pcr gene expression (log values) from using the rt-pcr genes matched with each individual annotation showed that the refseq-rsubread annotation constantly yielded a higher correlation than the ensembl and refseq-ncbi annotations, across all the samples and the three different normalization methods (left panel in figure ). the ensembl annotation was found to produce the worst correlation in all these comparisons. when using the rt-pcr genes matched with all three annotations for comparison, refseq- rsubread was again found to yield the highest correlation (right panel in figure ). ensembl and refseq-ncbi were found to produce similar correlation coefficients. taken together, results from this evaluation showed that the use of refseq-rsubread annotation led to a better concordance in gene expression between the rna-seq data and the rt- pcr data, compared to the use of ensembl and refseq-ncbi annotations. . validation against microarray data an illumina beadchip microarray dataset, which was generated by the maqc-i project [ ] for the same samples as in the rna-seq data used in this study, was used to further validate the gene-level rna-seq quantification results obtained from different annota- tions. the microarray dataset was background corrected and normalized using the ‘neqc’ function in the limma package [ , ]. microarray genes were then matched to the rna- seq genes included in the filtered rna-seq data. , , , and , microarray genes were found to be matched with rna-seq genes from ensembl, refseq-ncbi and refseq-rsubread annotations, respectively. , microarray genes were found to be present in all three annotations. for those microarray genes that contain more than one probe, a representative probe was selected for each of them. the representative probe selected for a gene had the highest mean expression value across the four samples among all the probes the gene has. a pearson correlation analysis was then performed between microarray data and rna-seq data for each of the three annotations. both rna-seq and microarray data include log expression values of genes. figure shows that the use of refseq-rsubread annotation consistently yielded the highest correlation between rna-seq and microarray data in all the comparisons, no matter which rna-seq normalization method was used and if all or common matched genes were included in the evaluation. on the other hand, the use of the ensembl annotation resulted in the worst correlation between rna-seq data and microarray data in all the comparisons. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c or re la tio n co ef fic ie nt all genes after filtering common genes after filtering c or re la tio n co ef fic ie nt . . . . a b c d a b c d li br ar y si ze no rm al iz at io n q ua nt ile no rm al iz at io n tm m no rm al iz at io n c or re la tio n co ef fic ie nt . . . . . . . . . . . . . . . . . . . . . . . . . . key ensembl refseq-ncbi refseq-rsubread figure : validation of rna-seq quantification results against microarray data. shown are pearson correlation coefficients computed from comparing rna-seq data against illumina beadchip microarray data, using the microarray genes matched with each individual annotation (left column) or matched with all three annotations (right column). rows in the plots represent the different rna-seq normalization methods used. lowly expressed genes in the rna-seq data were filtered out before the correlation analysis was performed. for those microarray genes that include more than one probe, a representative probe was selected and used for this analysis. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion the rna-seq technique is currently routinely used for genome-wide profiling of gene expression in the biomedical research field. the analysis of rna-seq data relies on the accurate annotation of genes so that expression levels of genes can be accurately and re- liably quantified. there are several major gene annotation sources that have been widely adopted in the field such as ensembl and refseq annotations. the ensembl and refseq annotations have been well maintained and under continuous development. in particular, new gene information collected from the next-generation sequencing technologies, such as rna-seq, has been incorporated into the expansion of these annotations in recent years. however, differences between these annotations have raised concerns over the quality and reproducibility of rna-seq data analyses. there are particularly concerns regarding the accuracy of new gene annotations generated from the use of the sequencing tech- nologies, due to known errors in the generation and analysis of the sequencing data. to address these concerns, in this study we systematically assessed the differences in rna- seq quantification results attributed to the gene annotation discrepancy. annotations being evaluated in this study included recent ensembl and ncbi refseq annotations and also an older version of the refseq annotation. we compared the recent and old refseq annotations to assess the quality of the new annotations that were added when the sequencing technology was utilized at ncbi for curating refseq gene annotations. although the ensembl annotation contains significantly more genes than both the recent and old refseq annotations, it was also found to have a much higher proportion of short genes. interestingly, we found that a much higher fraction of these short genes in ensembl were filtered out due to low or no expression in the analysis of the seqc rna- seq dataset, compared to the short genes included in the two refseq annotations. the seqc rna-seq data is a widely used benchmark dataset including the human brain reference rna and universal human reference rna samples, in which a very large number of gene expressed making the entire human transcriptome well covered. the use of the refseq-rsubread annotation (the older version of the refseq anno- tation used in this study) has led to substantially more fragments being successfully counted to genes than the use of refseq-ncbi (the recent refseq annotation used in this study) or ensembl annotations. a detailed investigation revealed that this was be- cause (a) there are less overlapping between genes in the refseq-rsubread annotation leading to less read assignment ambiguity and (b) the refseq-rsubread annotation con- tains more genes that are compatible with mapped fragments, despite the transcriptome represented by this annotation is much smaller than those represented by the refseq- ncbi and ensembl annotations. moreover, the quantification data obtained from using .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / refseq-rsubread exhibited desirable larger intensity range and higher median expression level than the quantification data obtained from using the other two annotations. the evaluation of quantification accuracy from using genome-wide titration mono- tonicity truth built in the rna-seq data, the taqman rt-pcr data and the microarray data, showed that overall the refseq-ncbi annotation yielded better quantification re- sults than the ensembl annotation. this may not be surprising because the ncbi refseq annotation is a traditionally conservative annotation that is known to be highly accurate as it uses an evidence-based approach to annotate genes. however, we also found that the refseq-rsubread annotation yielded more accurate quantification results than the refseq-ncbi annotation in almost all the comparisons, which is very surprising. we suspect that this might be due to the annotation errors arising from the sequencing data recently utilized in the ncbi refseq annotation generation pipeline. it was reported that the sequencing data, including rna-seq data and epigenome sequencing data, started to be utilized by ncbi for curating refseq gene annotations in around [ , ]. between march and july , the number of gene transcripts in the vertebrate mammalian organisms included in the refseq database increased significantly from . million to . million (https://www.ncbi.nlm.nih.gov/refseq/statistics/), a more than twofold increase in just around years. the use of sequencing data for annotation generation should be a significant driver for this rapid expansion of the refseq database. it is known that some errors associated with the generation and analysis of sequencing data are difficult to correct, such as sample contamination, sequencing errors, read mapping errors and read assembly errors. when these errors were brought to the annotation process, they could result in incorrect gene annotations being generated and consequently led to less accurate quantification of the rna-seq data. conclusion in conclusion, our findings from this study revealed that the ncbi refseq human gene annotations outperformed the ensembl human gene annotation in the quantification of rna-seq data. however, we also raised concerns over the recent changes made to the refseq database due to the use of sequencing data in the annotation generation process. these changes need to be reviewed and validated so as to ensure the refseq database continues to be a reliable and high-quality gene annotation resource for the research com- munity. similarly, such review should be conducted for other gene annotation databases as well. the research findings from this study also have an implication for the quantification of rna-seq data generated by the recently emerged single-cell sequencing technologies. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / same as the quantification of bulk rna-seq data, an accurate gene annotation is also required for quantifying single-cell rna-seq data. it is therefore important to understand if and how the annotation choice impacts the quantification accuracy of the single-cell rna-seq data as well. references [ ] zhenqiang su, pawe l p labaj, sheng li, jean thierry-mieg, danielle thierry-mieg, wei shi, charles wang, gary p schroth, robert a setterquist, john f thomp- son, et al. a comprehensive assessment of rna-seq accuracy, reproducibility and information content by the sequencing quality control consortium. nature biotech- nology, ( ): , . [ ] yunshun chen, aaron tl lun, and gordon k smyth. from reads to genes to pathways: differential expression analysis of rna-seq experiments using rsubread and the edger quasi-likelihood pipeline. f research, : , . [ ] simon anders, paul t pyl, and wolfgang huber. htseq–a python framework to work with high-throughput sequencing data. bioinformatics, ( ): – , . [ ] yang liao, gordon k smyth, and wei shi. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. bioinformatics, ( ): – , . [ ] yang liao, gordon k smyth, and wei shi. the r package rsubread is easier, faster, cheaper and better for alignment and quantification of rna sequencing reads. nucleic acids research, ( ):e –e , . [ ] steven l. salzberg. next-generation genome annotation: we still struggle to get it right. genome biology, ( ): , . [ ] mihaela pertea, alaina shumate, geo pertea, ales varabyou, florian p breitwieser, yu-chi chang, anil k madugundu, akhilesh pandey, and steven l salzberg. chess: a new human gene catalog curated from thousands of large-scale rna sequencing experiments reveals extensive transcriptional noise. genome biology, ( ): , . [ ] daniel r zerbino, premanand achuthan, wasiu akanni, m ridwan amode, daniel barrell, jyothish bhai, konstantinos billis, carla cummins, astrid gall, car- los garćıa girón, et al. ensembl . nucleic acids research, (d ):d –d , . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] nuala a o’leary, mathew w wright, j rodney brister, stacy ciufo, diana haddad, rich mcveigh, bhanu rajput, barbara robbertse, brian smith-white, danso ako- adjei, et al. reference sequence (refseq) database at ncbi: current status, taxo- nomic expansion, and functional annotation. nucleic acids research, (d ):d – d , . [ ] adam frankish, mark diekhans, anne-maud ferreira, rory johnson, irwin jun- greis, jane loveland, jonathan m mudge, cristina sisu, james wright, joel arm- strong, et al. gencode reference annotation for the human and mouse genomes. nucleic acids research, (d ):d –d , . [ ] christopher m lee, galt p barber, jonathan casper, hiram clawson, mark diekhans, jairo n gonzalez, angie s hinrichs, brian t lee, luis r nassar, con- ner c powell, brian j raney, kate r rosenbloom, daniel schmelter, matthew l speir, ann s zweig, david haussler, maximilian haeussler, robert m kuhn, and w j kent. ucsc genome browser enters th year. nucleic acids research, (d ):d –d , . [ ] po-yen wu, john h phan, and may d wang. assessing the impact of human genome annotation choice on rna-seq expression estimates. bmc bioinformatics, ( ):s , . [ ] shanrong zhao and baohong zhang. a comprehensive evaluation of ensembl, ref- seq, and ucsc annotations in the context of rna-seq read mapping and gene quantification. bmc genomics, ( ): , . [ ] leming shi, gregory campbell, wendell d jones, fabien campagne, zhining wen, stephen j walker, zhenqiang su, tzu-ming chu, federico m goodsaid, lajos pusz- tai, et al. the microarray quality control (maqc)-ii study of common practices for the development and validation of microarray-based predictive models. nature biotechnology, ( ): – , . [ ] maqc consortium, leming shi, laura h reid, wendell d jones, richard shippy, janet a warrington, shawn c baker, patrick j collins, francoise de longueville, ernest s kawasaki, et al. the microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. nature biotechnology, ( ): – , . [ ] yang liao and wei shi. seqc: rna-seq data generated from seqc (maqc-iii) study, . r package version . . . http://bioconductor.org/packages/release/data/experiment/html/seqc.html. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] marc carlson. org.hs.eg.db: genome wide annota- tion for human, . r package version . . . https://www.bioconductor.org/packages/release/data/annotation/html/org.hs.eg.db.html. [ ] matthew e ritchie, belinda phipson, di wu, yifang hu, charity w law, wei shi, and gordon k smyth. limma powers differential expression analyses for rna- sequencing and microarray studies. nucleic acids research, ( ):e –e , . [ ] wolfgang huber, vincent j carey, robert gentleman, simon anders, marc carlson, benilton s carvalho, hector corrada bravo, sean davis, laurent gatto, thomas girke, et al. orchestrating high-throughput genomic analysis with bioconductor. nature methods, ( ): , . [ ] yang liao, gordon k smyth, and wei shi. the subread aligner: fast, accurate and scalable read mapping by seed-and-vote. nucleic acids research, ( ):e –e , . [ ] charity w law, yunshun chen, wei shi, and gordon k smyth. voom: precision weights unlock linear model analysis tools for rna-seq read counts. genome biology, ( ):r , . [ ] ali mortazavi, brian a williams, kenneth mccue, lorian schaeffer, and barbara wold. mapping and quantifying mammalian transcriptomes by rna-seq. nat methods, ( ): – , . [ ] benjamin m bolstad, rafael a irizarry, magnus åstrand, and terence p. speed. a comparison of normalization methods for high density oligonucleotide array data based on variance and bias. bioinformatics, ( ): – , . [ ] mark d robinson and alicia oshlack. a scaling normalization method for differen- tial expression analysis of rna-seq data. genome biology, ( ):r , . [ ] mark d robinson, davis j mccarthy, and gordon k smyth. edger: a biocon- ductor package for differential expression analysis of digital gene expression data. bioinformatics, ( ): – , . [ ] wei shi, alicia oshlack, and gordon k smyth. optimizing the noise versus bias trade-off for illumina whole genome expression beadchips. nucleic acids research, ( ):e , . [ ] kim d pruitt, garth r brown, susan m hiatt, françoise thibaud-nissen, alexander astashyn, olga ermolaeva, catherine m farrell, jennifer hart, melissa j landrum, .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / kelly m mcgarvey, et al. refseq: an update on mammalian reference sequences. nucleic acids research, (database issue):d –d , . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a self-supervised machine learning approach for objective live cell segmentation and analysis michael c. robitaille , jeff m. byers , joseph a. christodoulides , marc p. raphael* materials science and technology division, u.s. naval research laboratory, washington d.c. * corresponding author: marc.raphael@nrl.navy.mil abstract machine learning algorithms hold the promise of greatly improving live cell image analysis by way of ( ) analyzing far more imagery than can be achieved by more traditional manual approaches and ( ) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm’s initial training steps and then repeatedly applied to the imagery. to address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. the approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. because it is self-trained, the software has no user- adjustable parameters and does not require curated training imagery. the algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (dic), transmitted light and interference reflection microscopy (irm). the approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery’s optical modality, magnification or cell type. key words: live cell imaging, segmentation, phenotyping, machine learning, unsupervised, classification introduction live cell phenotyping is an information rich experimental approach, capable of providing mechanistic insights into cell biology , , guiding drug development and elucidating disease pathologies , . the wealth of information available from live cell microscopy results from the fact that there are numerous optical modalities that can be integrated within a given experiment – from fluorescence imaging which provides spatio-temporal information on specific signaling pathways and organelles to label-free techniques such as phase contrast and differential interference contrast (dic) which enable the visualization of whole cellular morphologies and dynamics. each of these modalities provides its own outcome measures which can be viewed as static snapshots or dynamic variations within the four-dimensional space of x, y, z and time . however, compared to genotyping - its synergistic partner technique - live cell phenotyping remains a far more subjective science. the generation of genomic sequencing data and its analysis can now be achieved autonomously by employing a combination of robotics and microfluidics for sample preparation and and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . machine learning algorithms for data collection and interpretation. in contrast, the extraction of quantitative information from live cell imagery by manual means is still commonplace in live cell microscopy, a fact which speaks to the human visual system’s adeptness at detecting small changes and low contrast features with high fidelity. but with automated live cell microscopes now able to collect high resolution imagery for days on end, the resulting data files can quickly grow to tens of gigabytes, leaving the analyst with an overwhelming amount of imagery to work through. furthermore, if the analyst is not blinded to the experimental design, unconscious bias can creep into the data extraction process. enter computational algorithms capable of extracting the relevant outcome variables from the imagery in an automated fashion. - broadly speaking, the algorithms are often classified as model based approaches (e.g. cell profiler) , and machine learning algorithms, (e.g. u-net, ilastik) - . neither approach is completely autonomous when it comes to cell segmentation: model-based approaches require the manual tuning of multiple parameters, while machine learning requires the user provide curated data from which the algorithm is trained. once tuned or trained, the software is able to process far more imagery than could be achieved manually - but there is still a human-in-the-loop. it is just that the manual contribution has been moved to the front end for training purposes and is then continuously reapplied by the algorithm. algorithms that are tuned or trained at the onset can problematically miss relevant features as the cellular phenotypes or background characteristics evolve, inadvertently skewing the analysis. for instance, variations in label intensity (e.g. photobleaching, quenching) or new morphological features that were not present during the initial training (e.g. differentiation, mitosis, blebbing) can go undetected if not retrained with a freshly curated data set or parameters that capture the offending features. in the same way, temporal variations in the background illumination intensity or homogeneity can also result in improper cell segmentation. especially concerning is that the user-supervised training process is inherently subjective in nature and can cause unconscious biases to be effectively baked in to the extracted data by the training process. to optimize objectivity and efficiency, an essential goal is to develop software that can accept imagery from any optical modality, labeled or unlabeled, and extract the cellular features of interest without input from the user. as participants in a synthetic biology real-time reproducibility project administered by u.s. defense advanced research projects agency (darpa), referred to as independent verification & validation (iv&v), we have recently experienced all of these algorithmic limitations and how they can result in large amounts of data either being incorrectly segmented, subjectively segmented, or left unanalyzed due to time constraints. the program involves a wide range of cell types (amoeboid to eukaryotic) from multiple cell biology laboratories; multiple imaging modalities – both fluorescent and label free; and objective magnifications ranging from x to x. the cumbersome process of retraining supervised machine learning software to match this variety of conditions proved impractical and a human-in-the loop training step was deemed too subjective. the challenge then was to develop a completely automated segmentation algorithm for live cell microscopy applications. in particular, the image analysis software should be ‘self-supervised’, meaning it trains itself to classify cells versus background and then regularly updates this training so that it can adapt to evolving intensities and morphologies. the software was required to segment a variety of cell types from live cell imagery given the most common imaging modalities as inputs - phase contrast, transmitted light, dic, fluorescence and interference reflection microscopy (irm) – and to do so without user-adjustable parameters or user-selected training imagery. it was additionally required that the generated models adapt to changing cell phenotypes and lighting conditions for long-term imaging applications (hours to days). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . methods to replace more manual model based and machine learning training approaches for segmenting cells with an automated, self-supervised algorithm, we took advantage of the one phenotypic feature which is present in live cell microscopy no matter what the modality: motion. from the nanoscale diffusion of proteins and vesicles to the migration of cells that are tens of microns in length, the ever present dynamics captured by live cell microscopy make it ideal for applying optical flow (of) algorithms designed to identify not just spatial intensity features in a given frame but also the variation or ‘flow’ of those features from frame to frame. the central assumption in optical flow algorithms is that the overall image intensity will remain constant if the time difference between frames is reasonably small. this leads to the following time-derivative constraint equation:   ( , , ) d dx i dy i i i x y t dt dt x dt y t u v ∂ ∂ ∂ = → + + = ∂ ∂ ∂  ( , , ) d dx i dy i i i x y t dt dt x dt y t u v ∂ ∂ ∂ = → + + = ∂ ∂ ∂ where 𝐼𝐼(𝑥𝑥, 𝑦𝑦, 𝑡𝑡) is the in-plane image intensity at time 𝑡𝑡, 𝑢𝑢 and 𝑣𝑣 being the optical flow in the x and y directions, respectively. the methods used to solve this constraint equation are matched with the imaging goal, such as reducing jitter in imagery taken from helicopters, aligning medical imagery or, in the case of this study, cell motion segmentation. in testing a range of optical flow algorithms for cell segmentation, we found the farnebäck method to be the most robust due to its sensitivity to object deformation – a natural fit for cells which are morphologically variable. , of assumptions may or may not be met for fluorescence time-lapse imagery applications in which extended time intervals are sometimes employed to avoid phototoxicity or photobleaching. , for this reason, it was important that our technique be co-validated with label free techniques such as transmitted light and phase contrast which are minimally invasive. overlays of less frequently accumulated fluorescence imagery with cells segmented using a label-free imaging channel is then straightforward. furthermore, there has been an increased appreciation for the morphological information label-free approaches can provide as a result of algorithmic-based phenotyping. - our approach to self-supervised learning and automated model generation begins with utilizing the farnebäck of method as a means of classification bootstrapping (fig ). typical segmentation strategies involve utilizing static information in a single image at time frame (t), which can have difficulty distinguishing ‘cell’ from ‘background’ pixels in a generalizable manner (fig a). in contrast, our approach begins with an of calculation based on images from consecutive time frames (t- , t). this enables us to leverage the ubiquitous nature of intracellular motion and build a dynamics-based feature vector: pixels with the highest flow are automatically labeled as ‘cell’ pixels, those with the lowest flow are automatically labeled as ‘background’ pixels, and those that do not fit either category remain unlabeled (fig b,c). we note that this automatic self-labeling is broadly applicable in that it is not dependent on principles of any specific optical modality, cell type, or phenotype. the of-based self-labeling approach outputs a set of ‘cell’ and ‘background’ labeled pixels which are then used to generate additional entropy and gradient feature vectors at each time point. these static feature vectors are used to train and generate a classifier model which, in the final step, is applied to all pixels in the image for cell segmentation. the algorithm is written in stand-alone matlab script and utilizes functions from the image processing, statistics and machine learning, and computer vision toolboxes. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. overview of the optical flow self-labeling strategy. (a) the vast majority of cell segmentation techniques utilize single image frames and the static information contained within as means to distinguish ‘cell’ from ‘background’, oftentimes represented in a histogram. the self-supervised algorithm utilizes optical flow as a means to self-label pixels in an automated fashion. (b) due to the prevalence of intracellular dynamics in time-lapse live cell imagery, optical flow can be calculated for each pair of consecutive images (𝑡𝑡 − , 𝑡𝑡). the optical flow can then be represented as vectors associated with each pixel (right). (c) the magnitude of the optical flow then offers a means to distinguish cells from their background, as shown in the bivariate histogram which co-plots the pixel intensity of a single image at t to the optical flow vector magnitudes calculated between consecutive images (𝑡𝑡 − , 𝑡𝑡). pixels with the highest flow can be automatically labeled ‘cell’ (left of the green dashed line) and those with the lowest flow can be labeled ‘background’ (right of the yellow dashed line). pixels that do not meet either criteria remain unlabeled, while the self-labeled pixels are used to create a training data set for classification. time increment: sec, scale bar = µm. the self-supervised training approach is illustrated in fig using time lapse dic imagery of multiple (top) and a single highlighted (bottom) mda-mb- cell. from the raw imagery (fig a,b), many portions of individual cells appear to blend in with the background. however, when the of self-labeling strategy is applied, the algorithm automatically identifies pixels with high flow magnitude, highlighted as green pixels (fig c,d), which are selected as having the highest probability of correctly being labeled ‘cell’. to automatically label the background, the algorithm over segments, that is, a liberal (low) of threshold is employed which captures motion from not only the cell but also from nearby background pixels as well. the algorithm sets these pixel values to zero and labels the pixels in which no significant motion was detected as ‘background’ (fig c,d yellow pixels). once labeled ‘cell’ or ‘background’ in this unsupervised manner by of (dynamic features from image pair (𝑡𝑡 − , 𝑡𝑡) ), entropy and gradient feature vectors (static features from image at t) are generated for each of these training pixels using their local neighborhood of pixels (s.i., fig s ). these additional feature vectors are then used train and generate a naïve bayesian classifier model which is applied to the entire image in a pixel-wise fashion. the information gained from and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the entropy and gradient feature vectors enables pixels which were left unlabeled in the of training steps (fig c,d grey pixels) to be classified. the contrast enhanced image (fig b) and model-generated segmentation (fig f, teal pixels) show that the algorithm is able to segment the cell with high fidelity (dic image/segmented boundary overlay, fig g). importantly, this labeling, training and classifying procedure occurs recursively on each successive pair of (𝑡𝑡 − , 𝑡𝑡) images, enabling the classifier model to adapt to changing backgrounds and phenotypes. by using optical flow to label the highest flow pixels as ‘cells’ and lowest flow pixels as ‘background’, the labeling process has become automated (or ‘self-supervised’) and no manual inputs or training images are needed. for extremely low contrast imagery there can be too few training pixels labeled ‘cell’ for robust segmentation to occur given the initial of threshold setting. in such cases, the algorithm calculates the entropy associated with ‘cell’ pixels and iteratively reduces the of threshold until the associated ‘cell’ entropy feature vector is well distinguished from that of the ‘background’ entropy feature vector. fig. overview of the automated self-supervised learning algorithm. a. the contrast enhanced dic image of several and b a single highlighted mda-mb- cell illustrates the range of intensities inherent within the cells. ( x objective). c. & d. unsupervised learning via of: high threshold of is used to select only those pixels exhibiting the highest flow magnitudes and labels them as ‘cell’ (green pixels). similarly, low threshold of is used to identify pixels with a much wider range of flow magnitudes than the high flow regime. the lowest flow magnitude pixels are labelled ‘background’ (yellow pixels). pixels that exhibit of in between these regimes remain unlabeled (gray pixels). e. & f. supervised learning via self-labeled training data. the self-labeled pixels (green and yellow) are then used to generate static feature vectors, which are in turn used to train the classifier model. g. the blue outline is the resulting segmentation which outlines all pixels classified by the of trained model as ‘cell’ and is also overlaid on the image in b. this process is repeated at every time step, thereby using the most recent imagery to update the training data. scale bar: µm ( x objective, time increment: sec). results and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the fig imagery shows the generality of this approach and also demonstrates how the self-supervised algorithm additionally automates commonly required manual inputs such as size filtering and hole filling. the segmented cells were processed from imagery acquired from a range of cell types, imaging modalities, magnifications and time increments (s.i. table s ). the of algorithm enabled a straightforward approach to automated size filtering which is a common user adjustable parameter in supervised machine learning approaches. to accomplish this, a stand-alone application of of was applied to the imagery which lacked the added steps of self-tuning and model building described above. while some cell features are missed, this simpler, faster approach was found to be more than precise enough to estimate average cell size and to exclude much smaller objects, thus automating the size filtering process. because extraneous debris often lacked the motion of the live cells, this debris was also automatically labeled as background by the of algorithm. fig a and b demonstrate the self-supervised code’s ability to size filter, while also adapting to cell types of differing sizes, by comparing the segmentation of human fibroblasts ( x, phase contrast) to those of the much smaller dictyostelium amoeboid cells ( x, transmitted light), respectively. extraneous debris features in the hs imagery (fig a, white arrows) are correctly identified as ‘background’, even though similar in size and intensity to the dictyostelium cells of fig b. the background inhomogeneities observed in fig a and b, which could potentially be mislabeled as ‘cell’, are correctly identified because they remain relatively constant from frame 𝑡𝑡 − to frame 𝑡𝑡. the segmentation results of the mda-mb- cells ( x, phase contrast) in fig c illustrates the algorithm’s ability to adapt to a wide range of phenotypes, from rounded fig c(i) to spread fig c(ii), which is enabled without need for user input by continuously retraining the model on consecutive image pairs. the current instantiation of the software does not attempt to separate cells that are touching or close enough to be segmented as a single object. well-developed approaches such as watershed transforms and levelset methods can be employed for such purposes. the algorithm works robustly for a range of optical modalities and magnifications as shown in figs d-f. figs d and e are segmentation results from irm imagery ( x, hs cell) and dic imagery ( x, mda- mb- ). as a fluorescence imaging example, a self-supervised segmentation of a gfp-actin labeled a cell at x magnification is shown in fig f. as an additional option, of can be applied not only as an algorithm labeling element, but also a measurement tool, as shown in the fig f vector plot. the plotted of vectors (blue) display the magnitude and direction of the measured gfp labelled actin flow between frames. such measurements have been shown to be useful for quantifying intracellular protein and calcium signaling dynamics. - and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. self-supervised segmentation for a range of cell types, microscope modalities, time resolutions and magnifications. a. phase contrast of hs fibroblasts ( x objective, time increment: sec) b. transmitted light of dictyostelium ( x objective, time increment: sec) c. phase contrast of mda-mb- ( x objective, time increment: sec) d. irm image of a single hs cell ( x objective, time increment: sec) e. dic image of mda- mb- cells ( x objective, time increment: sec ) f. fluorescence image of a single lifeact (gfp-actin conjugate) transfected a cell (pseudo-colored) with the associated optical flow vector plot ( x objective, time increment: sec). insets i, ii, iii highlight boxed image regions. white arrows point to examples of debris that was correctly labelled ‘background’ due either to lack of motion or automated size filtering. images have been contrast enhanced to highlight low contrast features and background inhomogeneities. dic image (e) was additionally enhanced with a and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sharpen filter to highlight interference induced shadowing of cell features. scale bars: a, b, c: µm; d, e: µm; f: µm. hole filling, another often required manual input for model-based and machine learning algorithms, has also been automated by this approach. common examples of when hole filling input is required include fluorescent labels that do not penetrate the nucleus or, for label-free microscopy modes such as phase contrast, large spread cells in which the algorithm has a difficult time associating the interference enhanced cell edges with the enclosed lamellipodia. we found that motion within cells was ubiquitously detected by of, regardless of imaging modality or whether imaging the cell membrane, nucleus or cytoplasm. because motion detection was far more common than not for a given pixel within an area labeled ‘cell’, a fixed morphological blurring tool (circular with a radius of pixels) was found to robustly hole fill regardless of cell type or microscope configuration. the calculated cell area was found to be invariant for a range of blurring tool radii (fig s ). in all cases, the use of optical flow to identify motion and the pixel radius blurring tool was sufficient to correctly fill in the cell. by re-training on every pair of consecutive images the self-supervised algorithm remains accurate throughout long-term imaging applications, despite changes in background or cell phenotypes. this allows for a rich behavior of dynamic morphology and migration to readily be collected and analyzed – a key point given the known inter-relationship between cellular shape and function. , , furthermore, the emerging role that not just cell shape, but cell shape dynamics play in fundamental biological processes is becoming increasing clear. fig demonstrates how such quantitative morphological information is readily mined in a long-term imaging application. fig a-c shows the tracking of several mda-mb- cells segmented via the self- supervised approach under x phase contrast microscopy on crgd functionalized gold coverslips. fig a shows the labeled tracks of the cells’ centroids over the course of minutes, with the corresponding initial and final image shown in figs b,c. the cell associated with track undergoes mitosis at approximately minutes, creating two new tracks ( and ) for the daughter cells. because the self- supervised approach automatically re-trains continuously on consecutive frame pairs, the morphological changes from fig b to fig c are quantified with high fidelity, as can be seen by plotting the segmented boundaries as a function of time (fig d). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. tracking of mda-mb- cells under x phase contrast microscopy and time evolution of cell morphology through mitosis. a. the resulting tracks of multiple segmented cells from a single field of view over the course of minutes b. corresponding images at times t = min and c. min. track undergoes mitosis resulting in tracks and of the daughter cells (blue line). d. (left) time evolution of segmented morphology of track (black) with the centroid of each shape denoted by an open circle until mitosis, after which the track splits into (green) and (blue), with the cell separation event denoted by a single red open circle. d. (right) selected images showing raw data overlapped with the self-supervised segmentation throughout mitosis event. ( x objective, time increment: sec) scale bar: μm. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion & conclusions there are numerous advantages to this self-supervised machine learning approach. the most obvious is that because the training data is generated by tracking motion, the approach can be used with any live cell imaging microscopy technique, whether labeled or label-free. also unique is the use of the optical flow labeled pixels to self-supervise the building of a classifier model, which in turn is modular with regards to the incorporated feature vectors. while we have employed only two feature vectors in this current instantiation of the classification code (gradient and entropy), there are many additional image features that can be added based on the application. we have also shown that the incorporation of of enables the straightforward automation of morphological operations such as size filtering and hole filling, eliminating the need for manually tuning these parameters. the automation described here is markedly different from machine learning approaches that require user assisted training. the most time consuming aspect of model-based tuning and machine learning approaches is the training process. the process is one of trial and error, requiring retraining if the model’s performance is not deemed adequate. the complete automation of both the training and segmentation algorithms not only saves time but also removes the chances of unconscious bias from entering the training process. because the training is conducted recursively with each new image, evolutions in phenotype and background structure over extended time periods are accounted for without the need for preprocessing. the sum of all these advantages is segmentation under a wide range of magnifications, time resolutions, cell types and optical modalities that is both automated and robust. this results in the ability to track cells for hours or days and quantify a range morphological and phenotypic features without the need for user input, thus having broad applicability throughout live cell microscopy. the crux of the introduced self- supervised approach relies upon using the dynamic information embedded in each pixel – motion characterized via optical flow – as an elegant means to self-label cells versus background in time-lapse imagery. while cellular dynamics has long been appreciated as information rich with regards to understanding cell function, our approach demonstrates that it also provides the means for robust segmentation – a foundational step for achieving quantitative and objective live cell analysis. acknowledgements the authors gratefully acknowledge the devreotes laboratory of johns hopkins university for the dictyostelim discoideum cell line. m.c.r. gratefully acknowledges support from the national research council research associateship program and the jerome and isabella karle distinguished scholar fellowship program. funding for this project was provided by the office of naval research through the naval research laboratory’s basic research program and by the biological technology office of the defense advanced research program agency. author contributions michael c. robitaille: conceptualization, methodology, investigation, data curation, software, visualization, and writing. jeff m. byers: conceptualization, methodology, formal analysis, and software. joseph a. christodoulides: resources, validation, and writing. marc p. raphael: conceptualization, funding acquisition, methodology, investigation, software, visualization, and writing. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . financial conflicts of interest the authors do not have any conflict of interests with this work. references caicedo, j. c., singh, s. & carpenter, a. e. applications in image-based profiling of perturbations. current opinion in biotechnology , - , doi: . /j.copbio. . . ( ). cadart, c., zlotek-zlotkiewicz, e., le berre, m., piel, m. & matthews, h. k. exploring the function of cell shape and size during mitosis. developmental cell , - , doi: . /j.devcel. . . ( ). zhou, x. b. & wong, s. t. c. high content cellular imaging for drug development. ieee signal processing magazine , - , doi: . /msp. . ( ). zhong, j. et al. persistent hepatitis c virus infection in vitro: coevolution of virus and host. journal of virology , - , doi: . /jvi. - ( ). zhu, n. et al. morphogenesis and cytopathic effect of sars-cov- infection in human airway epithelial cells. nature communications , doi: . /s - - -z ( ). skylaki, s., hilsenbeck, o. & schroeder, t. challenges in long-term imaging and quantification of single-cell dynamics. nature biotechnology , - , doi: . /nbt. ( ). caicedo, j. c. et al. data-analysis strategies for image-based cell profiling. nature methods , - , doi: . /nmeth. ( ). deep learning gets scope time. nature methods , - , doi: . /s - - -x ( ). grys, b. t. et al. machine learning and computer vision approaches for phenotypic profiling. journal of cell biology , - , doi: . /jcb. ( ). moen, e. et al. deep learning for cellular image analysis. nature methods , - , doi: . /s - - - ( ). carpenter, a. e. et al. cellprofiler: image analysis software for identifying and quantifying cell phenotypes. genome biology , doi: . /gb- - - -r ( ). al-kofahi, y., zaltsman, a., graves, r., marshall, w. & rusu, m. a deep learning-based algorithm for -d cell segmentation in microscopy images. bmc bioinformatics , doi: . /s - - -z ( ). falk, t. et al. u-net: deep learning for cell counting, detection, and morphometry (vol , pg , ). nature methods , - , doi: . /s - - - ( ). sommer, c., straehle, c., kothe, u., hamprecht, f. a. & ieee. in th ieee international symposium on biomedical imaging: from nano to macro ieee international symposium on biomedical imaging - ( ). raphael, m. p., sheehan, p. e. & vora, g. j. a controlled trial for reproducibility. nature , - , doi: . /d - - - ( ). beauchemin, s. s. & barron, j. l. the computation of optical flow. acm comput. surv. , - , doi: . / . ( ). farneback, g. in image analysis, proceedings vol. lecture notes in computer science (eds j. bigun & t. gustavsson) - ( ). robitaille, m. c., byers, j. m., christodoulides, j. a. & raphael, m. p. robust optical flow algorithm for general, label-free cell segmentation. biorxiv, . . . , doi: . / . . . ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schroeder, t. long-term single-cell imaging of mammalian stem cells. nature methods , s -s , doi: . /nmeth. ( ). jaccard, n. et al. automated method for the rapid and precise estimation of adherent cell culture characteristics from phase contrast microscopy images. biotechnol. bioeng. , - , doi: . /bit. ( ). ounkomol, c., seshamani, s., maleckar, m. m., collman, f. & johnson, g. r. label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. nature methods , -+, doi: . /s - - - ( ). vicar, t. et al. cell segmentation methods for label-free contrast microscopy: review and comprehensive comparison. bmc bioinformatics , , doi: . /s - - - ( ). wang, m. et al. novel cell segmentation and online svm for cell cycle phase identification in automated microscopy. bioinformatics , - , doi: . /bioinformatics/btm ( ). nath, s. k., palaniappan, k. & bunyak, f. in medical image computing and computer-assisted intervention - miccai , pt vol. lecture notes in computer science (eds r. larsen, m. nielsen, & j. sporring) - ( ). buibas, m., yu, d., nizar, k. & silva, g. a. mapping the spatiotemporal dynamics of calcium signaling in cellular neural networks using optical flow. annals of biomedical engineering , - , doi: . /s - - - ( ). delpiano, j. et al. performance of optical flow techniques for motion analysis of fluorescent point signals in confocal microscopy. machine vision and applications , - , doi: . /s - - - ( ). lee, r. m. et al. quantifying topography-guided actin dynamics across scales using optical flow. mol. biol. cell , - , doi: . /mbc.e - - ( ). meyers, j., craig, j. & odde, d. j. potential for control of signaling pathways via cell size and shape. current biology , - , doi: . /j.cub. . . ( ). rangamani, p. et al. decoding information in cell shape. cell , - , doi: . /j.cell. . . ( ). akanuma, t., chen, c., sato, t., merks, r. m. h. & sato, t. n. memory of cell shape biases stochastic fate decision-making despite mitotic rounding. nature communications , doi: . /ncomms ( ). robitaille, m. c. et al. problem of diminished crgd surface activity and what can be done about it. acs applied materials & interfaces , - , doi: . /acsami. c ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . dynugene: an r package for uncertainty-aware gene regulatory network inference, simulation, and visualization dynugene: an r package for uncertainty-aware gene regulatory network inference, simulation, and visualization tianyu lu , � and anjali silva , , department of computer science, university of toronto, toronto, canada department of cell and systems biology, university of toronto, toronto, canada princess margaret cancer centre, university health network, toronto, canada vector institute, toronto, canada methods for gene regulatory network inference focus on net- work architecture identification but neglect model selection and simulation. we implement an extension to the dyngenie al- gorithm that accounts for model uncertainty as an r package, providing users with an easy to use interface for model selection and gene expression profile simulation. source code is avail- able at https://github.com/tianyu-lu/dynugene with a detailed user guide. a webserver with interactive controls is available at https://tianyulu.shinyapps.io/dynugene/. gene regulatory network | network inference correspondence: tianyu.lu @mail.utoronto.ca introduction complex phenomena such as cell development and apopto- sis emerge from coordinated dynamics of gene regulatory networks (grn). inferring network structure from data can be used for hypothesis generation, revealing mechanisms in cell development and disease (huang et al., ), and mod- elling network evolution (crombach and hogeweg, ). accurate dynamical models allow us to predict the effects of network perturbations on biological function, for example to push cells out of a disease state (karlebach and shamir, ), or to design synthetic grns given the desired dynam- ics of a network (hiscock, ). the ideal model should be flexible enough to capture highly nonlinear interactions while not sacrificing model interpretability and computation time. we present dynugene (dynamical uncertainty-aware gene nework inference), an r package that extends the functional- ity of dyngenie , a state-of-the-art method for grn infer- ence (geurts et al., ). we build on dyngenie because it satisfies all three of our model desiderata. existing exten- sions include timeor and benin which both incorporate heterogeneous data to improve network inference accuracy (wonkap and butler, ; conard et al., ). here, we take a different approach and instead account for uncertainty in dyngenie , allowing for stochastic gene expression sim- ulations and parsimonious model selection. our extension is available as an easy to use r package and also as an interac- tive web server. package design dyngenie background. dyngenie poses grn infer- ence as a feature selection problem. it first trains random forests to predict the change in concentration of each species given the current concentrations of all species. each interac- tion from species xi to species xj is associated with an im- portance score, calculated by the reduction in variance from using xi to predict the change in xj. the importance score for an interaction, when normalized, is interpreted as the proba- bility of that interaction to exist. for a detailed treatment, see the vignette and (geurts et al., ). model selection. the inferred network can be visualized as a p×p matrix where the entry [xi,xj] is the importance score of xi for inferring xj (fig. ). however, real grns are of- ten not fully connected and the presence of an interaction is binary (mangan et al., ). to address this, dynugene includes a function for model selection based on visualizing the pareto front (mangan et al., ). however, we note that the model at the sharp drop in the pareto front is not al- ways the best model (supplementary fig. s ). we include an additional function on the web server where users can choose which interactions to mask. the masked networks can then be simulated, allowing for application-specific tun- ing of model complexity. model simulation. the inferred networks and masked net- works can be used to simulate gene expression profiles by numerically solving the system of ordinary differential equa- tions learned by the random forests. in addition to determin- istic simulations, we provide an option that accounts for the uncertainty in the random forests predictions for stochastic simulations. for stochastic simulations, instead of only tak- ing the mean of a random forest’s predictions, we sample from the gaussian n(µ,σ ) where µ is the mean and σ is the variance of the random forest’s predictions. provided datasets. the dynugene package provides four example time-series datasets: repressilator, stochastic re- pressilator, hodgkin-huxley, and stochastic hodgkin-huxley (elowitz and leibler, ; hodgkin and huxley, ). these datasets were generated from systems of ordinary or lu et al. | biorχiv | january , | – .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/tianyu-lu/dynugene https://tianyulu.shinyapps.io/dynugene/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig. : bottom: inferred importance scores on the repressilator dataset for the th network in the step-wise column masks plot (supplementary fig. s ). top: simulated trajectory using the inferred network. stochastic differential equations. details are provided in the vignette. the package also includes one steady state dataset, syntren , taken from grndata (bellot et al., ). users can provide their own data as input following the for- mat specified in ?infernetwork. discussion a requirement for dyngenie and dynugene is that all species must be tracked through time. this requirement is difficult to satisfy in practice as there are often unknown species in a biological process of interest. methods that can identify or approximate latent structure in partially-observed systems are more appropriate here (hiscock, ). an omics treatment such as rna-seq can cover breadth but cur- rent sequencing techniques require cells to be destroyed, thus making time series data collection difficult. non-destructive sequencing techniques could address this issue. the implementation of an inferred network as a gene circuit will require more thought. even for networks with sparse interactions, the likelihood of finding a set of genes and pro- teins that satisfy the interaction strengths and activation or inhibitory effects is unknown. in fact, whether a species is an activator or inhibitor is not explicitly given in the interac- tion matrix. we can address this by posing dynugene as a constrained optimization problem where it is limited to using only a given set of parts (genes, promoters, ribosome bind- ing sites, proteins, etc.) thus relating the importance scores with biological interaction strengths. we leave this for future work. data and code availability source code is available at https://github.com/tianyu- lu/dynugene with a detailed user guide. a webserver with interactive controls is available at https://tianyulu.shinyapps.io/dynugene/. acknowledgements the authors thank the authors of dyngenie for their work and alan moses for guidance. funding this work was supported by a postdoctoral fellowship from canadian institutes of health research. bibliography sui huang, ingemar ernberg, and stuart kauffman. cancer attractors: a systems view of tu- mors from a gene network dynamics and developmental perspective. in seminars in cell & developmental biology, volume , pages – . elsevier, . anton crombach and paulien hogeweg. evolution of evolvability in gene regulatory networks. plos computational biology, ( ):e , . guy karlebach and ron shamir. minimally perturbing a gene regulatory network to avoid a disease phenotype: the glioma network as a test case. bmc systems biology, ( ): , . tom w hiscock. adapting machine-learning algorithms to design gene circuits. bmc bioinfor- matics, ( ): – , . pierre geurts et al. dyngenie : dynamical genie for the inference of gene networks from time series expression data. scientific reports, ( ): – , . stephanie kamgnia wonkap and gregory butler. benin: biologically enhanced network inference. journal of bioinformatics and computational biology, ( ): , . ashley mae conard, nathaniel goodman, yanhui hu, norbert perrimon, ritambhara singh, charles lawrence, and erica larschan. timeor: a web-based tool to uncover temporal regu- latory mechanisms from multi-omics data. biorxiv, . niall m mangan, steven l brunton, joshua l proctor, and j nathan kutz. inferring biological networks by sparse identification of nonlinear dynamics. ieee transactions on molecular, biological and multi-scale communications, ( ): – , . michael b elowitz and stanislas leibler. a synthetic oscillatory network of transcriptional regula- tors. nature, ( ): – , . alan l hodgkin and andrew f huxley. a quantitative description of membrane current and its application to conduction and excitation in nerve. the journal of physiology, ( ): , . pau bellot, catharina olsen, and patrick e meyer. grndata: synthetic expression data for gene regulatory network inference, . r package version . . . carl ganz. rintrojs: a wrapper for the intro. js library. journal of open source software, ( ): , . gregory r. warnes, ben bolker, lodewijk bonebakker, robert gentleman, wolfgang huber, andy liaw, thomas lumley, martin maechler, arni magnusson, steffen moeller, marc schwartz, and bill venables. gplots: various r programming tools for plotting data, . r package version . . . hadley wickham. ggplot : elegant graphics for data analysis. springer, . christopher rackauckas and qing nie. adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. discrete and continuous dynamical systems. series b, ( ): , a. christopher rackauckas and qing nie. differentialequations. jl–a performant and feature-rich ecosystem for solving differential equations in julia. journal of open research software, ( ), b. r core team. r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria, . | biorχiv lu et al. | dynugene .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/tianyu-lu/dynugene https://github.com/tianyu-lu/dynugene https://tianyulu.shinyapps.io/dynugene/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / computing the riemannian curvature of image patch and single-cell rna sequencing data manifolds using extrinsic differential geometry computing the riemannian curvature of image patch and single-cell rna sequencing data manifolds using extrinsic differential geometry duluxan sritharan∗ , , shu wang∗ , , and sahand hormoz† , , harvard graduate program in biophysics, harvard university, cambridge, ma, usa department of data sciences, dana-farber cancer institute, boston, ma, usa laboratory of systems pharmacology, harvard medical school, boston, ma, usa department of systems biology, harvard medical school, boston, ma, usa broad institute of mit and harvard, cambridge, ma, usa abstract most high-dimensional datasets are thought to be inherently low-dimensional, that is, datapoints are constrained to lie on a low-dimensional manifold embedded in a high-dimensional ambient space. here we study the viability of two approaches from differential geometry to estimate the riemannian curvature of these low-dimensional manifolds. the intrinsic approach relates curvature to the laplace-beltrami operator using the heat-trace expansion, and is agnostic to how a manifold is embedded in a high- dimensional space. the extrinsic approach relates the ambient coordinates of a manifold’s embedding to its curvature using the second fundamental form and the gauss-codazzi equation. keeping in mind practical constraints of real-world datasets, like small sample sizes and measurement noise, we found that estimating curvature is only feasible for even simple, low-dimensional toy manifolds, when the extrinsic approach is used. to test the applicability of the extrinsic approach to real-world data, we computed the curvature of a well-studied manifold of image patches, and recapitulated its topological classification as a klein bottle. lastly, we applied the approach to study single-cell transcriptomic sequencing (scrnaseq) datasets of blood, gastrulation, and brain cells, revealing for the first time the intrinsic curvature of scrnaseq manifolds. ∗equal contribution †to whom correspondence should be addressed (sahand hormoz@hms.harvard.edu) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction high-dimensional biological datasets have become prevalent in recent decades because of new technologies such as high-throughput scrnaseq [ , , ], mass cytometry [ , ] and multiplex imaging [ , ]. interpre- tation and visualization of such high-dimensional datasets have been challenging however, prompting the development of tools for non-linear projection of datapoints onto or dimensions [ ]. these tools, such as isomap [ ], t-sne [ ] and umap [ ], appeal to the ansatz that datapoints in a high-dimensional ambient space are constrained to lie on a low-dimensional manifold. unfortunately, determining the geometry of a low-dimensional manifold from these visualizations is difficult, since many geometric properties are lost after projecting onto or dimensions. for example, the cartographic projections used in an atlas to flatten earth’s curved surface tear apart continuous neighborhoods and non-uniformly stretch distances. fortunately, topology and differential geometry provide a wealth of concepts to characterize a manifold’s shape directly without confounding projections. in particular, homology [ , ] categorizes a manifold according to the number of holes it contains, and the dimensionality of each hole (whereas for example, the hole in a hollow sphere does not survive projection onto a -dimensional plane). similarly, metrics [ ] and geodesics [ ] determine shortest-distance paths between pairs of points on a manifold without any distortion from a projection (whereas for example, most atlases exaggerate distances at the poles). curvature [ ] is a local manifold property that quantifies the extent to which a manifold deviates from the tangent plane at each point p. projecting a manifold onto a plane for visualization destroys this property by definition. recent methods have emerged for estimating homology [ , ], metrics [ ] and geodesics [ ] from noisy, sampled data, with accompanying statistical guarantees [ , , ]. these methods have been applied to analyze images [ , ] and biological datasets [ , ]. however, estimating curvature has received less attention although it is fundamental to quantifying geometry. curvature arises from two sources. on the one hand, a manifold itself can be curved, resulting in riemannian or intrinsic curvature. a sphere has intrinsic curvature because it cannot be flattened so that all geodesics on its surface correspond to straight lines on a euclidean plane (see figure a). on the other hand, the embedding of a manifold in an ambient space can give rise to extrinsic curvature, a property that is not inherent to the manifold itself. for example, a scroll has extrinsic curvature because it is formed by rolling a piece of parchment, but the parchment itself is not inherently curved (see figure b). it is important to note that both types of curvature scale inversely with the global length scale (l) associated with a manifold. it is for this reason that a marble (l ≈ cm) is visibly round, but the earth (l ≈ , km) is still mistaken by some to be flat. since intrinsic curvature is an inherent property of a manifold, while .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d intrinsic (riemannian) curvature extrinsic curvature intrinsic differen�al geometry extrinsic differen�al geometry z = ± − x − y figure : riemannian curvature is an intrinsic property of a manifold while extrinsic curvature depends on the embedding. (a) (left) n = points uniformly sampled from the -dimensional hollow unit sphere, s , embedded in the -dimensional ambient space r , colored according to the z-coordinate. s has riemannian or intrinsic curvature because there is no projection onto -dimensional euclidean space that preserves geodesic (shortest-path) distances. (right) for example, a stereographic projection using the point z = ( , , ) and the plane z = introduces distortions since the geodesic distance between any pair of points in the lower hemisphere is (non-uniformly) larger than the euclidean distance in this projection. (b) (left) n = points uniformly sampled from a scroll, which is also a -dimensional manifold embedded in r . the scroll has extrinsic curvature because it curls away from the tangent plane at any point. (right) however, it does not have intrinsic curvature, because it can be projected onto -dimensional euclidean space in a way that preserves geodesic distances, by unfurling. (c) intrinsic differential geometry treats manifolds as self-contained objects that can be described using only intrinsic coor- dinates, which do not depend on any embedding or ambient space. one possible set of intrinsic coordinates for s are polar coordinates, where θ and θ are the azimuthal and elevation angles respectively. while this representation superficially resem- bles the unfurled scroll in (b), distances in this plane are non-euclidean. any line segment along θ = ±π has zero length for example. (d) extrinsic differential geometry defines manifolds in the coordinate system of the ambient space, which requires a privileged vantage point off the manifold itself. both intrinsic and extrinsic differential geometry can be used to compute intrinsic curvature, whereas only extrinsic differential geometry can be used to compute extrinsic curvature (as indicated by the black arrows). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / extrinsic curvature is incidental to an embedding, we will restrict our attention to the former. a precise description of intrinsic curvature is provided by the riemannian curvature tensor, rlkij(p). for a given basis {v}, this tensor quantifies how much a vector initially pointing in direction vk is displaced in direction vl after parallel transport around an infinitesimal parallelogram defined by directions vi and vj. the simplest intrinsic curvature descriptor is scalar curvature, s(p), which is formed by contracting rlkij(p) to a scalar quantity, as its name suggests. when s(p) is greater (less) than , the sum of the angles of a triangle formed by connecting three points near p by geodesics is greater (less) than π. likewise, when s(p) is greater (less) than , a small ball centred at p has a smaller (larger) volume than a ball of the same radius in euclidean space. we furnish toy examples in the main text to provide stronger intuition for this quantity. in theory, intrinsic curvature can be equivalently computed using tools from either one of the two branches of differential geometry. intrinsic differential geometry makes no recourse to an external vantage point off a manifold, just as the polygonal characters in edwin abbot’s classic flatland [ ] were confined to traversing in r , and found the notion of r unfathomable. in this branch, a manifold is therefore represented in intrinsic coordinates, which are agnostic to any ambient space or embedding. a hollow sphere represented in polar coordinates and k-nearest neighbor (knn) graph representations of a dataset, for instance, are in this spirit (see figure c). conversely, in extrinsic differential geometry, a manifold is treated as a surface embedded in an ambient space, and is represented in ambient coordinates (see figure d). the surface of an organ is parameterized this way, for example, in a surgical robot suturing an incision. in this work, we explore two approaches for estimating intrinsic curvature based on these twin views, keeping in mind practical limitations of real-world datasets, which may be comprised of a relatively small number of noisy measurements. the first approach uses the laplace-beltrami operator, which is well-studied in previous applications of differential geometry to data analysis [ , , , , ], and is theoretically appealing as an intrinsic quantity that is embedding-invariant. however, we find that this approach cannot accurately estimate even average scalar curvature on the simplest of low-dimensional toy manifolds for small sample sizes, despite the history and ubiquity of the laplace-beltrami operator in geometric data analysis. meanwhile, the second approach uses the second fundamental form and the gauss-codazzi equation [ ], identities that rely on information from the ambient space. we find that this extrinsic approach is not only more robust to small sample sizes and noise, but permits computation of the full riemannian curvature tensor, though we focus on the scalar curvature for simplicity. using these insights, we developed a software package to compute the scalar curvature (and associated uncertainty) at each sampled point on a manifold, and applied this tool to investigate the curvature of image and scrnaseq datasets. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results . estimators of the laplace-beltrami operator yield inaccurate scalar cur- vatures intrinsic differential geometry treats a d-dimensional manifold, m, as a self-contained object and is agnostic to how m may be represented in ambient coordinates due to any particular embedding (see figure c). conceptually, this is accomplished by only considering m as a collection of local, overlapping neighborhoods. the geometry of these neighborhoods is encoded using tools such as the laplace-beltrami operator, ∆m , which captures diffusion dynamics across neighborhoods. for most practical applications, we do not have direct access to m but instead to a finite number (n) of points sampled from m. for these cases, estimators of ∆m are used instead. these estimators are well-studied [ , , , , ], and the convergence rates of some have been characterized [ ]. the scalar curvature averaged across m, has a well-known connection to ∆m via the heat-trace expan- sion [ , ], which relates the eigenvalues, λk, of ∆m to the geometry of m: z(t) ≡ ∞∑ k= e−λkt = ( πt)− d ( n∑ i= cit i + o(t n+ ) ) , λk ≤ λk+ ( ) the first few coefficients, ci, are given by [ ]: c = ∫ m dm, c = − √ π ∫ ∂m d(∂m), c = ∫ m s dm − ∫ ∂m j d(∂m) ( ) where ∂m is the boundary of the manifold and j is the mean curvature on ∂m. recall that s is the point-wise scalar curvature. by inspection, c is the volume, c is proportional to the area, and c is directly related to the average scalar curvature. we reasoned that if the average scalar curvature cannot be accurately computed for a manifold with constant scalar curvature using these relations, then computing the point-wise scalar curvature for more complex manifolds is intractable. to investigate this, we considered the -dimensional hollow unit sphere, s , for which the true scalar curvature is s(p) = ∀p ∈ m, and uniformly sampled n = points to mirror the typical size of current scrnaseq datasets (see figure a; methods section . . . ). since common estimators of ∆m only yield as many eigenvalues as datapoints (n), we cannot compute .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the infinite set of eigenvalues needed in equation . therefore, we introduced a truncated series with m eigenvalues, zm(x), where we have substituted x = √ t and divided through by the prefactor in the rhs of equation to isolate for ci, following the approach in [ ]: zm(x) = ( π) d/ xd m∑ k= e−λkx ( ) the scalar curvature can then be approximated by fitting the truncated series, zm(x), to a second-order polynomial, p (x), over intervals of small x: zm(x) ≈ p (x), where p (x) = c + c x + c x ( ) we estimated ∆m using the n sampled points (see methods section . . ), substituted the eigenvalues of the estimate into equation , and numerically fit zm(x) to p (x) (see figure s a-g; methods section . . ). we obtained the scalar curvature by inspecting the resulting c coefficient, and compared the result to the true value of . we found that the scalar curvature was always over-estimated (s > ) regardless of m, the number of eigenvalues used in the truncated series (see methods section . . ), or the choice of estimator for ∆m (see methods section . . ). we identified the poor convergence of the estimated eigenvalues of ∆m as the source of error (see methods section . . ) and found that at least n ≈ points are required to reduce the error to ± . , so that s ≈ . (see figure s h). therefore, despite the prevalence of the laplace-beltrami operator in geometric data analysis, our exam- ple shows that an intrinsic approach relying on the operator is not practical for computing scalar curvatures. even for noise-free datapoints uniformly sampled from s , the sample size needed to compute average scalar curvature accurate to ± . is several orders of magnitude greater than what is typically feasible in current scrnaseq experiments. noise and non-uniform sampling would confound the issue further. most impor- tantly, we would eventually like to compute local values of s(p) ∀p ∈ m, but this approach failed to correctly recover even average scalar curvature, which one might have expected to be feasible. to find an alternative approach, we next considered tools from extrinsic differential geometry. . curvature can be computed accurately using the second fundamental form in extrinsic differential geometry, a manifold is described in the coordinates of the ambient space in which it is embedded, usually rn (see figure d). since the shape of the sphere in figure a is visually unambiguous .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to the eye (thanks to its extrinsic view from a vantage point off the manifold), we reasoned that an extrinsic approach would be more fruitful. a d-dimensional manifold, m, embedded in rn can be described at each point p in terms of a d- dimensional tangent space, tm (p), and an (n − d)-dimensional normal space, nm (p), as shown in fig- ure a. given orthonormal bases for tm (p) and nm (p), points in the neighborhood of p can be expressed as y = [t , ..., td,n , ...,nn−d] where ti is y ’s coordinate along the i th basis vector of tm (p) and nk is y ’s coordinate along the kth basis vector of nm (p). the nks can then be locally approximated as functions of the tis i.e. nk ≈ fk(t , ..., td) as shown in figure b. the riemannian curvature of m is related to the quadratic terms in the taylor expansion of each fk with respect to the tis. specifically, the second fundamental form of m, h k ij, gives the second-order coefficient relating each fk to the quadratic term titj [ ]: hkij(p) = ∂ fk ∂ti∂tj ∣∣∣∣ p ( ) the riemannian curvature tensor is related to the second fundamental form according to the gauss-codazzi equation [ ]: rijkl = (h α jkh β il −h β jih α kl)gαβ ( ) where gαβ is the metric of the ambient space, which we take to be the usual euclidean metric δα,β going forward. the scalar curvature can be obtained by contracting the riemannian curvature tensor: s = ∑ i,j rijij ( ) this suggests a conceptually simple procedure to estimate the scalar curvature of a data manifold at each point p: (i) estimate tm (p) and nm (p), (ii) determine h k ij(p) in local coordinates, (iii) compute s using equations and . we developed a computational tool that provides an implementation of this procedure. briefly, given a set of datapoints {x} ∈ rn and manifold dimension d, a neighborhood around each point p is selected to be the n-dimensional ball centred on p of radius r encompassing np(r) points (see methods section . . ). for each point p, principal component analysis (pca) [ ] is performed on the np(r) points in its neighborhood, and the first d (last n−d) principal components (pcs) accounting for the most (least) variance are taken as an orthonormal basis for tm (p) (nm (p)). the normal coordinates, nk, of the np(r) points in each neighborhood are fit by regression to a quadratic model in terms of the tangent coordinates, ti, to obtain h k ij(p) with associated uncertainties (see figure b; methods section . . ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the choice of r(p) is an important one since it sets the length scale at which curvature is computed for point p (see methods section . . ). our tool allows interrogation of curvature at any length scale of interest by allowing the user to manually set r(p), a feature we use to inspect real-world datasets later in the paper. however, since the local geometry of the manifold may be non-trivial and unknown a priori, we also provide the ability to set r(p) according to statistical rather than geometric principles. specifically, our tool algorithmically chooses r at each p so that the uncertainty in hkij(p) from regression is less than a user-specified global parameter, σh (see methods section . . ). since a larger number of points reduces the uncertainty in regression, a smaller σh requires a larger r(p) ∀p ∈ m. this strategy of setting σh therefore allows neighborhood sizes to dynamically vary over the manifold based on the local density of the data, which means the algorithm can gracefully handle non-uniform sampling of the manifold. the choice of σh will depend on the global length scale, l, of the datapoints (see methods section . . ), the average density of sampled points, and of course, the desired uncertainty in the estimates of hkij. these uncertainties are in turn used to compute a standard error, σs, accompanying the scalar curvature estimate at each point, using standard error propagation formulas (see methods section . . ). we specify σh instead of σs as the global parameter for choosing neighborhood sizes, since the latter depends non-linearly on the values of hkij(p), which makes determining r(p) more difficult. our algorithm also computes a goodness-of-fit (gof) p-value at each p by comparing residuals from regression against a normal distribution to quantify how well the normal coordinates are fit by a quadratic function (see methods section . . ). we tested this p-value at significance level α = . , declaring fits to be poor when the residuals are significantly non-gaussian. the p-value can be disregarded if the neighborhood size is manually specified to be larger than a length scale for which a quadratic fit is appropriate. however, when σh is specified instead, a uniform distribution of these p-values over m indicates that the desired uncertainty results in neighborhoods that are well-approximated using quadratic regression. we adopted this heuristic when choosing σh for the datasets studied in this paper (see methods section . . , . . and . . ). the software is available at https://gitlab.com/hormozlab/manifoldcurvature. we first applied our algorithm to compute scalar curvatures for the same n = points uniformly sampled from s for which the intrinsic approach failed (see figure c; methods section . . . ). the algorithm yielded scalar curvature estimates at each point with mean error − . (computed by averaging the difference between the point-wise scalar curvature estimates and the ground truth value of across all points) using neighborhoods that only contained np(r) ≈ points. this is already superior to the intrinsic approach, which failed to compute even average scalar accurate to ± for the same sample size. the non-zero value of the mean error indicates that our estimator is biased. the values of hkij are not biased because they .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gitlab.com/hormozlab/manifoldcurvature https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c d b e f i g h figure : scalar curvature is accurately estimated using the second fundamental form and the gauss-codazzi equation. (a) a hypothetical manifold (shown in grey) from which datapoints are sampled (shown as colored dots). the manifold at any given point p (shown in red) can be decomposed into a tangent space tm (p) (the cyan plane) and a normal space nm (p) (the cyan line). points in the neighborhood around p (shown in green) can be expressed in terms of orthonormal bases for tm (p) and nm (p) (see (b) below). (b) the set of points in the neighborhood of p (shown as green dots in (a)) are represented here in local tangent (t , t ) and normal (n ) coordinates, corresponding to orthonormal bases for tm (p) and nm (p) respectively. coloring corresponds to magnitude in the normal direction. the normal coordinates (n ) can be locally approximated as a quadratic function (the translucent surface) of the tangent coordinates (t , t ), according to the second fundamental form, h k ij. (c) scalar curvatures computed using the extrinsic approach for n = points uniformly sampled from the -dimensional hollow unit sphere, s . the true value is at all points on the manifold. see methods section . . . . (d) scalar curvatures (s) computed in (c) are plotted against their associated standard errors (σs). points enclosed by the red lines have a % confidence interval (ci), computed as s ± σs, containing the true value of . (e) as in (c) but for n = points uniformly sampled from a one-sheet hyperboloid, h , which is also a -dimensional manifold. due to the radial symmetry of the manifold, scalar curvature only varies only along the z-direction. see methods section . . . . (f) scalar curvatures (black) computed in (e) with their associated % cis (shown in grey) plotted as a function of the z-coordinates of the datapoints. the true value is shown as a dashed red line. (g) as in (c) but for n = points uniformly sampled from a -dimensional ring torus, t . t is constructed by revolving a circle parameterized by θ, oriented perpendicular to the xy-plane, through an angle φ around the z-axis. the scalar curvature only depends on the value of θ. see methods section . . . . (h) scalar curvatures computed in (g) with their associated % cis plotted as a function of the θ values of the datapoints. colors as in (f). (i) distribution of computed scalar curvatures for n = points uniformly sampled from the d-dimensional unit hypersphere, sd, for d = , , , . as with s , these manifolds are isotropic and have constant scalar curvature. the true values are shown as dashed red lines. see methods section . . . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / are obtained using regression. even so, the components of the riemannian curvature tensor, rijkl, may still be biased because they are non-linear functions of hkij. note that for s , this bias is the same across all datapoints (because of the isotropic nature of the manifold) and therefore results in a systematic under- estimation of scalar curvature (see figure c; methods sections . . ). we also computed % confidence intervals (ci) for our estimates as s ± σs, and despite the mean error, % of points still reported a % ci containing the true value of (see figure d). we next tested our algorithm on a -dimensional manifold with negative scalar curvature, by uniformly sampling n = points from the one-sheet hyperboloid, h (see figure e; methods section . . . ). here, % of points reported a % ci containing the true scalar curvature (see figure f). lastly, we considered the -dimensional ring torus, t (see figure g; methods section . . . ). as a manifold with regions of positive, zero, and negative scalar curvature, t is a useful toy model for understanding more complex -dimensional manifolds and gaining intuition for higher-dimensional manifolds. in dimensions, regions of a manifold with positive scalar curvature (θ = , π in figure h) are dome-shaped, regions with zero scalar curvature (θ = π , π in figure h) are planar, and regions with negative scalar curvature (θ = π in figure h) are saddle-shaped. we applied our tool to n = points uniformly sampled from t and found that % of points reported a % ci containing the true scalar curvature (see figure h). to test the applicability of our algorithm to higher-dimensional manifolds, we uniformly sampled n = points from unit hyperspheres, sd, and found that %, % and % of points reported a % ci containing the true scalar curvature for d = , and respectively (see figure i; methods section . . . ). the number of terms, hkij, in the second fundamental form grows as d . for larger d, a greater number of datapoints and hence larger neighborhoods are needed for regression, but these are no longer well-approximated by quadratic fits according to our gof measure. more generally, higher-dimensional manifolds require a higher density of data to estimate scalar curvatures accurately. we additionally characterized how our algorithm performed when datapoints were non-uniformly sampled (see figure s a; methods section . . . ) or convoluted by observational noise (see figure s b; methods sec- tion . . . ), when the dimension of the ambient space was large (see figure s c; methods section . . . ), and when the specified manifold dimension differed from the ground truth (see figure s d; methods sec- tion . . . ). we found that the algorithm is robust to non-uniform sampling, large ambient dimension and small observational noise, and provides signatures indicating when the manifold dimension may be mis- specified. however, when the noise scale is large, the resulting manifold is no longer trivially related to the noise-free manifold, consistent with existing literature [ , , , ], so that scalar curvature cannot be accurately computed. lastly, we note that since the full riemannian curvature tensor is computed as an .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / intermediate step in our algorithm, more intricate geometric features in the data can also be analyzed using our tool, though we defer such investigation to future studies. taken together, these examples demonstrate the utility of the algorithm in recovering curvature with specified uncertainties for manifolds with positive and/or negative scalar curvature. next, we tested our algorithm on real-world data. . curvature of image patch manifold is consistent with a noisy klein bottle pixel intensity values in images of natural scenes are not independently or uniformly distributed. understand- ing the statistics of such images is important for designing compression algorithms [ ] and for addressing challenges in the field of computer vision such as segmentation [ ]. lee et al. discovered that x -pixel patches extracted from greyscale images of natural scenes, whose pixels have high-contrast (i.e. the differ- ences between the intensity values of adjacent pixels in a patch are large), are not uniformly distributed in r , but are instead concentrated on a low-dimensional manifold [ ]. this is because high-contrast regions in a natural scene usually correspond to the edges of objects in the scene. high-contrast image patches consequently tend to be comprised of gradients and not simply random speckle. subsequent work using topological data analysis revealed that after appropriate normalization (which takes image patches from r to s ∈ r , so that the global length scale is l = ; see methods section . . ), dense regions of high-contrast image patches have the same homology as a -dimensional manifold called a klein bottle [ ]. a klein bottle, k , is a canonical manifold typically introduced in the context of orientability, where it is often visualized in r (as shown in figure a) to highlight that it is non-orientable. from a topological perspective, k is a manifold parameterized by θ,φ ∈ [ , π] as shown in figure b in which vertical edges are defined to be θ = and θ = π, and horizontal edges are defined to be φ = and φ = π. to make a closed surface, the vertical (horizontal) edges are glued together according to the red (blue) arrows in figure b. k is therefore π-periodic in φ, since a point corresponding to θ on the bottom horizontal edge (φ = ) is the same as the point corresponding to θ on the top horizontal edge (φ = π). similarly, a point corresponding to φ on the left vertical edge (θ = ) is the same as the point corresponding to π −φ on the right vertical edges (θ = π). in short, points on k obey the similarity relation (θ,φ) ∼ (θ + π, π − φ). k captures the dominant features in high-contrast image patches because θ can be treated as a parameter controlling rotation and φ as a parameter controlling the relative contribution of linear vs. quadratic gradients (see figure b). an embedding of k into r with an analytical form, k , was proposed by carlsson et al. in [ ] to model image patches (see equation in methods section . . ). this embedding takes points from (θ,φ) into .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / image patches in r as shown in figure b. for example, θ = (θ = π ) corresponds to patches with vertical (horizontal) stripes and φ = π , π (φ = ,π) corresponds to patches with linear (quadratic) gradients. as θ increases, stripes in the image patches are rotated clockwise. as φ increases, image patches oscillate between having quadratic and linear gradients. importantly, the image patches constructed by this embedding obey the same similarity relation (θ,φ) ∼ (θ+π, π−φ) topologically required of a klein bottle. whereas carlsson et al. studied the global topology of image patches using this embedding, here we study their local geometry instead. first, we analytically calculated the scalar curvature of k as a function of (θ,φ) as shown in figure c (see methods section . ). next, we used our algorithm to compute the scalar curvature on a data manifold of n ≈ . × high-contrast x -pixel image patches randomly sampled from the same van hateren dataset used to propose k (see methods section . . ). we picked σh so that the distribution of gof p-values was flat, and fixed this value for all subsequent simulations (see methods section . . ). to visualize the results, we associated each image patch to its closest point on k (see methods section . . ), and plotted the scalar curvatures on the resulting (θ ,φ ) coordinates (see figure d). most image patches map to φ = π , π or θ = , π because linear gradients (of any orientation) and quadratic gradients that are vertically or horizontally oriented are the dominant features in the data as previously reported [ , ]. the scalar curvatures computed for the image patches did not match the analytical scalar curvature of k (cf. figures c and d). to identify the cause of this discrepancy, we first validated our algorithm by computing scalar curvatures on the set of n ≈ . × (θ ,φ ) points on k associated with the image patches (see figure e); we found close agreement with the analytical calculation ( % of points reported a % ci containing the true scalar curvature). next, observing that the neighborhood sizes used for computing the scalar curvature of image patches were larger than those used for computing the scalar curvature of the associated (θ ,φ ) points (cf. figures s a and s b), we recomputed the scalar curvatures of these (θ ,φ ) points, but now with the same neighborhood sizes used for the image patches. the results agreed with the analytical calculation, but still did not match the scalar curvatures computed for the image patches (see figure s c). having ruled out these two possibilities, we hypothesized that the discrepancy was caused by fluctuations in the positions of the image patches with respect to the (θ ,φ ) points on the k manifold (real image patches are noisy and the klein bottle embedding is only an idealization). we found that adding isotropic gaussian noise of increasing magnitude in r to the set of (θ ,φ ) points on k indeed resulted in scalar curvatures that resemble the data (see figure f; methods section . . ). the best agreement between the scalar curvatures of the image patches and the noisy (θ ,φ ) points was achieved when the magnitude of noise was .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f g h i j figure : scalar curvature computed for image patches is consistent with that of a klein bottle with added isotropic gaussian noise. (a) the klein bottle, k , is a -dimensional manifold shown here in r . (b) k is an analytical embedding given by carlsson et al. in [ ] relating parameter values θ,φ ∈ [ , π] to x -pixel patches of greyscale images (see equation in methods section . . ). θ controls the rotation of stripes in the image patches and φ determines the relative contribution of linear vs. quadratic gradients. importantly, as shown in the figure, this embedding has boundary conditions consistent with the topology of a klein bottle (depicted by the blue/red arrows). in particular, the embedding produces image patches that obey the similarity relation (θ,φ) ∼ (θ + π, π −φ). adapted from figure of [ ]. (c) the analytical scalar curvature of k (derived as described in methods section . ). (d) scalar curvatures computed for n ≈ . × high-contrast x -pixel patches sampled from the greyscale images in the van hateren dataset [ ] are plotted here as a function of (θ ,φ ), the parameter values of the closest point on k associated with each image patch (see methods section . . ). (e) scalar curvatures computed for the set of n ≈ . × closest points on k associated with the image patches. note the close correspondence with figure c, indicating that our algorithm correctly recapitulates the analytical scalar curvature. (f) as in (e), but after adding isotropic gaussian noise in r to the set of closest points on k (see methods section . . ). left to right corresponds to increasing levels of noise, σ = . , . , . . (g) the distribution of euclidean distances in r between each image patch and its closest point on k is shown in blue. the distribution of distances to k after adding gaussian noise to these closest points on k is also shown. (h) k is the analytical embedding from θ,φ ∈ [ , π] to r that minimizes the sum of euclidean distances from the image patches to the closest point on the embedding (see methods section . . ). each of the n ≈ . × image patches was associated to its closest point on k , given by parameter values (θ ,φ ) (see methods section . . ). scalar curvatures computed on this set of n ≈ . × points on k are shown. (i) the same scalar curvatures computed for the image patches and visualized on (θ ,φ ) coordinates in (d), are shown here plotted on (θ ,φ ) coordinates. (j) scalar curvatures computed for a densely sampled manifold comprised of the full set of n ≈ . × high-contrast x -pixel image patches in the van hateren image dataset (see methods section . . ), visualized on (θ ,φ ) coordinates. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / σ = . . notably, in this case, the median euclidean distance of the noisy (θ ,φ ) points to k was . , which is comparable to . , the median euclidean distance of the image patches to k (see figure g). furthermore, the neighborhood sizes chosen by our algorithm when σ = . (see figure s a) matched those chosen for the image patches (see figure s b). to find an embedding of the klein bottle that might better explain the scalar curvature of the image patches without needing to add noise, we incorporated higher-order terms to k (see methods section . . ). the coefficients for the higher-order terms were determined by fitting the data, resulting in a new embedding, which we refer to as k (see methods section . . ). the median euclidean distance of the image patches to k was . versus . to k . as was done for k , we associated each image patch to its closest point (θ ,φ ) on k , and used our algorithm to compute the scalar curvature of these (θ ,φ ) points (see figure h). despite the reduction in the median euclidean distance of images patches to the embedding, the scalar curvature of k was even less similar to that of the image patches (visualized in figure i on these new (θ ,φ ) coordinates for k ) than was the scalar curvature of k ; the range of scalar curvature values for k was much larger than for either the image patches or k , and the scalar curvature fluctuates on smaller length scales. lastly, we reasoned that there might be fine-scale scalar curvature fluctuations in the image patches that are masked by the larger neighborhood sizes used to compute scalar curvature for the image patches (see figure s b) relative to k (see figure s d). to decrease the neighborhood sizes chosen by the algorithm for the same σh, we augmented the image patch dataset using the full set of n ≈ . × datapoints from the van hateren dataset (see methods section . . ). this resulted in neighborhood sizes comparable to those determined for k (cf. figures s d and s e), but failed to recapitulate the fine-scale scalar curvature fluctuations observed in k (see figure j). as a sanity check, we confirmed that the scalar curvature of the augmented image patch dataset matched that of the original image patch dataset, when computed using the same neighborhood sizes as the latter (see figure s f). therefore, including higher-order terms in the embedding does not yield scalar curvatures that better agree with the data. taken together, our analysis of curvature suggests that the image patch dataset can be best modelled by adding noise to the simplest embedding, k . having applied our algorithm on real-world manifold-valued data that is well-modelled by an analyti- cal embedding, we next turned our attention to scrnaseq datasets, which are generally regarded as low- dimensional manifolds and have no known analytical form. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . scrnaseq datasets have non-trivial intrinsic curvature in scrnaseq datasets, each datapoint corresponds to a cell, and each coordinate to the abundance of a different gene. here we consider the data manifold after basic preprocessing and linear dimensionality reduction using pca (see methods section . . ). since many common analyses in the field such as clustering, visualization, and inference of cell differentiation trajectories are performed in this reduced space, it is natural to compute curvature in this space as well. we set the ambient dimension, n, to be the number of pcs needed to explain % of the variance. the manifold dimension, d, for scrnaseq datasets is not well-defined and needs to be chosen heuristically. as a simple heuristic, we specified d as the number of pcs needed to explain % of the variance in the ambient space i.e. % of the original variance (we show later that our computations are relatively insensitive to the choice of d). we considered three datasets. the first consists of n ≈ peripheral blood mononuclear cells (pbmcs) collected from a healthy human donor [ ]. the second is a gastrulation dataset comprised of n ≈ . × cells pooled from embryonic mice sacked at -hour intervals from embryonic day . to . [ ]. the final dataset is a benchmark in the field consisting of n ≈ . × brain cells pooled from embryonic mice sacked at embryonic day [ ]. refer to figures s a, s a and s a for cell type annotations for the three datasets. the pbmc dataset is characteristic of the sample size of current scrnaseq data. the other two are larger than most scrnaseq datasets, and we included these to verify if geometric features seen in the first dataset can be reproduced for more densely sampled manifolds. for the pbmc, gastrulation and brain datasets, the ambient (manifold) dimensions were determined to be , and ( , and ) respectively, according to the aforementioned heuristic (see methods section . . ). for all three datasets, the global length scale happened to be l ≈ (see methods sections . . ). as before, we picked σh for each dataset according to the distribution of gof p-values (see figures s b, s b and s b; methods section . . ). we visualized the computed scalar curvatures on standard plots employed in the field (umap and t- sne; shown in figure a,d,g) and observed non-trivial scalar curvature for all three datasets. we found statistically significant correlations between the scalar curvature reported by each point and its knn for k ≤ (ρpearson = . , . and . for the pbmc, gastrulation and brain datasets respectively at k = , p < − ; see figures s c, s c and s c), indicating that our algorithm yields scalar curvatures that vary continuously over the data manifolds. by plotting scalar curvatures against their standard errors, σs, we verified that regions with non-zero scalar curvature are statistically significant (see figure b,e,h). as a consistency check, we confirmed that the percentage of points with % cis containing the scalar curvatures reported by their respective knns (i) decayed with increasing k for k ≤ , and (ii) was significantly larger .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / than expected by chance ( %, % and % for the pbmc, gastrulation and brain datasets respectively at k = , p < . ; see figures s d, s d and s d; methods section . . . ). to rule out the possibility that localization of non-zero scalar curvature in certain regions of the umap/t- sne plots is an artifact caused by other features of the data that are also localized, we considered several factors. first, we plotted the gof p-value at each point on umap/t-sne coordinates and noted that poor gofs were not localized on the data manifolds, let alone to regions of non-zero scalar curvature (see figures s b, s b and s b). therefore, the computed scalar curvatures are not due to poor fits. next, we plotted the neighborhood size, r(p), used for fitting and observed that in some regions, non-zero scalar curvatures seemed to correspond to small r (see figures s e, s e and s e). since σh is fixed, these regions necessarily have a larger number of neighbors np(r) and are hence more dense (see figures s f, s f and s f). to rule out the possibility that the non-zero scalar curvatures were an artifact of smaller neighborhood size, we recomputed the scalar curvature at three fixed neighborhood sizes (see figure c,f,i), corresponding to the , , and %-ile values of r(p) which arose from setting σh (see figures s e, s e and s e). in general, the scalar curvatures decreased in magnitude when neighborhood sizes increased. however, regions which had statistically significant non-zero scalar curvatures (zero falls outside of the % ci) using variable neighborhood sizes also had non-zero scalar curvatures for all three fixed neighborhood sizes. additionally, statistically significant non-zero scalar curvature also emerged on other parts of the manifolds when using small fixed neighborhood sizes. these regions are therefore curved at small length scales but do not have a sufficient density of points to resolve curvature to the desired uncertainty σh (see method section . . ). this is analogous to the image patch dataset for which we could resolve scalar curvatures of larger magnitude at a smaller length scale when the dataset was augmented with enough points to attain smaller neighborhood sizes for a fixed σh. we also checked how computed scalar curvatures changed with density in a toy model with zero scalar curvature. importantly, we did not observe the artifactual appearance of statistically significant non-zero scalar curvature, for either variable neighborhood sizes chosen by the algorithm to achieve σh, or for fixed neighborhood sizes (see figure s a; methods section . . . ). taken together, although higher density allows us to resolve statistically significant non-zero scalar curvatures in scrnaseq data, these computed scalar curvatures are not an artifact of the smaller neighborhood sizes used in regions with higher density. to ensure that the computed scalar curvatures were not sensitively dependent on the heuristically chosen manifold dimension, d, we also recomputed scalar curvatures for d − and d + and observed similar qualitative results (see figures s g, s g and s g). lastly, we verified that the computed scalar curvatures were not correlated with the number of transcripts in each cell (see figures s h, s h and s h). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f h ig figure : scrnaseq datasets have localized regions of non-zero scalar curvature. (a) scalar curvatures were computed for a scrnaseq dataset with n ≈ peripheral blood mononuclear cells (pbmcs) collected from a healthy human donor. the ambient (n) and manifold (d) dimensions were specified to be and respectively and variable neighborhood sizes were chosen by setting σh (see methods section . . ). the scalar curvatures are shown here overlaid onto umap coordinates, after smoothing the values over k = nearest neighbors in the ambient space. (b) scatter plot of (unsmoothed) scalar curvatures, s, and associated standard errors, σs, for each datapoint in the pbmc dataset. points enclosed by the red lines reported a % ci (s ± σs) including . (c) as in (a) but with scalar curvatures computed using a fixed neighborhood size, r, for all datapoints. the value of r was set to be the , , and -%ile values (left to right) of the neighborhood sizes used in (a) (see figure s e). points for which a neighborhood of size r does not include enough neighbors for regression are not shown. (d-f) as in (a-c) for a mouse gastrulation dataset with n ≈ . × , d = and n = . (g-i) as in (a-c) for a mouse brain dataset with n ≈ . × , d = and n = , plotted on t-sne coordinates. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to confirm the robustness of our results to sampling, we randomly discarded f% of points in the ambient space determined for each dataset, and recomputed scalar curvatures using the same values of n, d and r(p) used for the original dataset. we found that a statistically significant percentage of downsampled points ( % for the pbmc dataset with f = , % for the gastrulation dataset with f = , and % for the brain dataset with f = ; p < . ) had a % ci containing the scalar curvature reported by the same point for the original dataset (see figures s i, s i and s i; methods section . . . ). this suggests that if the datasets were more highly sampled, and scalar curvatures were recomputed using the same neighborhood sizes, they would be reliably contained within the currently reported % cis. unlike the two other datasets, the brain dataset could not be downsampled to f = while still having at least % of points report % cis containing the originally reported scalar curvatures, despite having the most points. this might be because the brain dataset has a larger manifold dimension according to our heuristic and therefore requires a greater number of terms, hkij, to be estimated in the second fundamental form. for the pbmc dataset, we additionally downsampled the single-cell count matrix by discarding f% of transcripts at random and preprocessing the same way. we recomputed scalar curvatures for this downsam- pled dataset with the same n, d and r(p) values used for the original dataset. here too, we found that when f = (f = ), % ( %) of the downsampled points had a % ci containing the originally reported scalar curvature (p < . , see figure s j; methods section . . . ). therefore, the computed scalar cur- vature is robust to changes in capture efficiency and sequencing depth. taken together, our computational analysis reveals non-trivial intrinsic geometry in scrnaseq data. discussion in this study, we explored two approaches to computing the curvature of data manifolds using tools from twin branches of differential geometry. despite the prevalence of the laplace-beltrami operator in geometric data analysis [ , , , , ], an intrinsic approach to computing scalar curvature relying on this operator’s eigenvalues was determined to be infeasible for sample sizes of n ≈ typical of current scrnaseq datasets. although methods such as magic [ ] and diffusion pseudotime [ ] apply the laplace-beltrami operator to smooth scrnaseq data and infer cell differentiation trajectories respectively, using information intrinsic to the manifold, our results suggest that the embedding of the manifold in the ambient space provides valuable information necessary for estimating the intrinsic curvature. this observation is perhaps implicit in recent tools for estimating the laplace-beltrami operator, which first use moving local least-squares to approximate a surface, thereby incorporating information from the ambient space [ ]. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / certainly, we found that an extrinsic approach in which the embedding is retained, and curvature is determined by local quadratic fitting of datapoints in ambient coordinates, is feasible given the sample size and degree of noise in real-world datasets. to obtain the scalar curvature of data manifolds, our algorithm first computes the full riemannian curvature tensor. for other applications, this tensor can be used to compute other geometric quantities, such as ricci curvature, or may itself be of interest. more generally, we focused on intrinsic curvature because we were interested in geometric properties of the manifolds independent of their embeddings. however, the second fundamental form used in our approach to compute the intrinsic curvature can be used to obtain all the information about the extrinsic curvature as well. indeed, hkij(p) exactly quantifies the extent to which the manifold deviates in the kth normal direction from the ij-tangent plane at point p. a key limitation of our algorithm is that the manifold dimension must be specified by the user. we also assumed that the manifold dimension is the same at every point in a dataset. extending the algorithm to determine the manifold dimension from the data itself, potentially in a position-dependent manner, may prove useful. in addition, there is no inherently correct length scale over which curvature should be computed for a data manifold. our algorithm chooses a length scale that varies from one part of the data manifold to another according to the density of points, and is tuned to achieve a user-specified level of uncertainty in the computed curvature. for some applications, it might be more sensible to fix a desired length scale for computing the curvature. as a demonstration of our algorithm, we computed the scalar curvature of image patches, and found that it was consistent with that of a klein bottle. this observation further validates the claim by carlsson et al. who showed that image patches have the topology of a klein bottle [ ]. unlike the klein bottle parameterization of image patches however, no definitive analytical form has been established for scrnaseq datasets. recent work has suggested the use of hyperbolic geometry to model branching cell differentiation trajectories [ ] and specific manifolds have been proposed to model reaction networks [ ], which may be applicable to scrnaseq data. these proposed manifolds can be validated or improved using knowledge of the intrinsic geometry of scrnaseq datasets. finally, incorporating information about curvature may provide a more principled approach for developing dimensionality reduction and visualization tools. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods . differential geometry of theoretical manifolds here we briefly discuss how to compute the scalar curvature of, and sample from, theoretical manifolds given a parameterization. for a d-dimensional manifold, m, with intrinsic coordinates {x , ...,xd} and embedding in rn given by f(x , ...,xd), the metric is: gij = ∂ft ∂xi ∂f ∂xj ( ) the scalar curvature of m can then be derived analytically in intrinsic coordinates in terms of the metric as s = gij ( Γkij,k − Γ k ik,j + Γ l ijΓ k kl − Γ l ikΓ k jl ) ( ) where the Γijks are christoffel symbols given by Γijk = gil ( ∂glj ∂xk + ∂glk ∂xj − ∂gjk ∂xl ) ( ) and Γijk,l= ∂Γijk ∂xl . to draw points from m with ai ≤ xi ≤ bi so that the embedded manifold is uniformly sampled in rn, we use rejection sampling. for paired random variables x ∼ uniform(a,b) and y ∼ uniform( , max √ det g), we retain x as a sample point if √ det g ∣∣ x ≤ y. . details of intrinsic approach to curvature estimation here we explain how we used equations - on the simplest of toy manifolds, the noise-free -dimensional hollow unit sphere, s , to obtain an estimate of the average scalar curvature. the true scalar curvature is s(p) = ∀p ∈ m. for the remainder of this section, we adopt the convention that symbols with overbars are estimates of the corresponding unaccented quantities. . . approach for s our approach mirrors the treatment in [ ], in which heat-traces are fit over various intervals [x ,x ] with x ≥ , to quadratic polynomials p (x) = c + c x + c x to estimate the geometric quantities in equation . here, we constrained the form of p (x) for fitting by assuming that (i) the manifold is boundary-less (so that c = c = and the second boundary term for c vanishes), (ii) the volume is known (so that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c = c = π), and (iii) the scalar curvature is constant (so that c = π s), yielding p (x) = π + π sx . these are strong assumptions that will not hold for an arbitrary manifold, which already precludes this as a generic procedure. nonetheless, we proceeded for s to see if even with this privileged information, the scalar curvature could be estimated accurately. we declared an estimate to be accurate on the interval [x ,x ] if s has error within ± . i.e. s ∈ [ . , . ]. all quadratic fits were performed in matlab using the lsqnonlin function (‘steptolerance’= e- , ‘functiontolerance’= e- ). first, we evaluated zm(x) using analytical eigenvalues for s given by λ(`− ) + , ...,λ` = `(`− ),` > , and let dm be the collection of all intervals for which fits to p (x) yielded accurate s. dm corresponds to intervals where equation is accurate to our desired tolerance when the eigenvalues are known exactly. next, we uniformly sampled n = points from s (see figure a; methods section . . . ), estimated ∆m using the random walk graph laplacian with gaussian kernel (see equation in methods section . . ), and computed empirical eigenvalues, λk, from ∆m . we selected n = as it is the same order of magnitude as the sample size of current scrnaseq experiments, and is sufficient to identify m as s by eye (see figure a). we verified if estimates zm(x), obtained by evaluating equation using λk, when fit as described above to p (x) over intervals in dm, recapitulated the accurate s obtained using zm(x). we restricted our attention to dm for calculations using empirical eigenvalues, since it is only over intervals in dm that it is even theoretically possible to compute scalar curvature to the desired accuracy. below, we report our findings for different m. . . infinite series we first applied this approach to the ideal case in equation , where infinite analytical eigenvalues are available. we computed z∞(x) (shown as a black line in figure s a) and obtained s by fitting p (x) over various intervals as described above. figure s b shows that d∞ is comprised of intervals with ≤ x < x . . . for x & . , errors from neglecting higher-order terms o(x ) in equation dominate. since zm(x) converges from ∞, x . . necessarily holds for any interval in dm∀m. . . truncated series we next considered zm(x) for m < n, since in practice, we will only have access to as many eigenvalues as datapoints (n). we computed z (x) using equation (shown as a solid blue line in figure s a), and obtained s by fitting p (x) (see figure s c). intervals in d roughly satisfy . . x < x . . . however, we found that z (x) (shown as a dashed blue line in figure s a) deviated markedly from z (x) in the rough interval [ . , . ], which has significant overlap with d . consequently, when we .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fit p (x) to z (x) on d , the resulting s was not accurate for any interval in d (see figure s d). note that this inaccuracy was not a consequence of not using all n available eigenvalues. while picking m = n would reduce the lower bound on valid intervals in dm (since zm(x) converges from ∞), it is exactly for small x that s obtained from z (x) is already over-estimated as shown in figure s d. since zm (x) > zm (x) ∀x,m > m , using a truncated series with a larger m would simply exaggerate the difference between zm(x) and zm(x) for small x and cause scalar curvatures estimated using the latter to be further over-estimated. following this line of thought, we reasoned that picking a fewer number of eigenvalues may ameliorate the issue. we selected m = (instead of a round number like m = so that all eigenvalues of a given multiplicity are included) and repeated this analysis for the same set of n = points. z (x) is shown as a solid red line in figure s a and the intervals over which fits to p (x) yield accurate s, d , are shown in figure s e. while z (x) (shown as a dashed red line in figure s a) has a much smaller deviation from z (x) than z (x) did from z (x), no estimate of s obtained from fits of z (x) to p (x) on d were sufficiently accurate once again (see figure s f). . . eigenvalue convergence we refrained from reducing m further to improve agreement between zm(x) and zm(x) after noting that the size of the intervals in dm shrink with m. though we may have a better chance of computing accurate s with zm(x) on dm for smaller m, recall that in practice we will not have dm available to us since the analytical eigenvalues will be unknown. therefore, we simply shift the problem to one of choosing an interval that will yield an accurate s, from a shrinking pool of intervals that could even theoretically yield an accurate estimate. instead, we compared the estimated λks with their true values, λk, and observed that the former con- sistently under-estimate the latter (see figure s h). furthermore, we found that the fractional error grows with k, exceeding % for k = , ..., . therefore, z (x) will only be accurate if n is large enough to limit the fractional error. to determine the required tolerance on the fractional error, we constructed a truncated series analo- gous to equation , but with eigenvalues interpolated between the analytical eigenvalues and the empirical .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / eigenvalues determined for n = , according to a parameter f: z̃m(x; f) = ( π) d/ xd m∑ k= e−λ̃k(f)x λ̃k(f) = λk + f(λk −λk) ( ) f signifies that the fractional error of the interpolated eigenvalues is reduced by −f relative to the empirical eigenvalues determined for n = . we found that f ≤ . is needed so that z̃ (x; f) (shown as a green line in figure s a) fit to p (x) yields accurate s on half the intervals in d (see figure s g). given that the fractional error in estimating λ , ...,λ by λ , ...,λ is % when n = , how large does n have to be to reduce this fractional error to % × . ≈ %? a convergence rate for the fractional error is given in theorem of [ ]. for -dimensional manifolds: ∣∣λk −λk∣∣ λk = o ( (log n) n ) ( ) assuming that the big-o bound is sharp at n = for k = , ..., (i.e. the prefactor is given by . log( ) ≈ . ), we extrapolated that at least n = datapoints are needed to reduce the fractional error to % (see figure s h). equation also applies to empirical eigenvalues of ∆m constructed from weighted knn and r-neighborhood kernels instead of gaussian kernels (see methods section . . ). however, the prefactor in equation is actually worse for these estimators since their empirical eigenvalues have larger fractional errors at n = (see figure s h), so that even larger n would be required to attain the desired fractional error. lastly, note that while we had analytical eigenvalues available with which to ascertain m = as suitable, the naive approach of simply using all eigenvalues available (m = n), would require sample sizes that are even larger by several more orders of magnitude. . . estimating the laplace-beltrami operator from data for n points, {xi} ∈ rn, sampled from m, we estimated ∆m by normalizing the weight matrix w (see below) using the random walk normalization [ , ]. ∆m constructed using this normalization converges to ∆m when samples are drawn uniformly from the embedding of m in rn, as was done in our analysis. ∆m = � (in −d− w) d = diag{w } ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in is the n ×n identity matrix, ∈ rn is a vector of ones and the kernel width, �, is set to match that used in theorem of [ ]: � = (log n) n ( ) throughout our analysis, we used w = wg, the weight matrix with entries given by a gaussian kernel: [wg]i,j = exp(−‖xi −xj‖ /�) − δi,j ( ) to check whether other estimators had more benign prefactors for eigenvalue convergence (see figure s h), we also considered the weighted knn kernel, wknn , and the r-neighborhood kernel, wr, with r = � [ ]: [wknn ]i,j = [wg]i,j [ knn(j)(i) or knn(i)(j) ] [wr]i,j = bxi(r)(xj) − δi,j ( ) knn(i) is the set of indices of the k-nearest neighbors of point i in rn, bxi(r) is the n-dimensional ball of radius r centred at xi, and a(x) is the indicator function for x ∈ a. . details of extrinsic approach to curvature estimation . . quadratic regression on local neighborhoods of data here we describe the regression model for computing the coefficients of the second fundamental form, hkij, at a particular point p. as described in the main text, after performing pca on a neighborhood of np points around p in rn, each point in the neighborhood can be described in terms of d tangent coordinates, ti, and n−d normal coordinates, nk. we defer discussion of how the neighborhood is selected to methods section . . . the nks are treated as dependent variables that can be modelled as quadratic functions of the tis, which are taken to be independent variables. see equation below. linear terms are excluded since they ought to have zero coefficients in the tangent basis. constant terms, ck, are included to account for affine shifts. since hkij = h k ji according to equation , in practice we only consider titj and h k ij for j ≥ i so that t and h in equation have linearly independent columns, though we write the full form here for simplicity. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / n = th + e n =   n ( ) . . . n ( ) n−d ... . . . ... n (np) . . . n (np) n−d   t =   t ( ) t ( ) . . . t ( ) t ( ) d t ( ) t ( ) . . . t ( ) d t ( ) d ... ... . . . ... ... . . . ... t (np) t (np) . . . t (np) t (np) d t (np) t (np) . . . t (np) d t (np) d   h =   c h , . . . h ,d h , . . . h d,d ... ... . . . ... ... . . . ... cn−d h n−d , . . . h n−d ,d h n−d , . . . h n−d d,d   t e =   ε ( ) . . . ε ( ) n−d ... . . . ... ε (np) . . . ε (np) n−d   =   ε( ) t ... ε(np) t   ( ) regression yields the following least-squares solution: ĥ = (tt t)− tt n Σε = (n − tĥ)t (n − tĥ) np Σh = Σε ⊗ (tt t)− ( ) where ĥ is the matrix of estimates of the second fundamental form, Σε is the estimated covariance structure of the residuals so that ε(i) ∼ n( , Σε), and Σh is the covariance matrix for ĥ. ⊗ denotes the kronecker product. we used the mvregress function in matlab to perform this regression in our code. when datapoints are sampled exactly from an analytical manifold, Σε measures the contribution of higher-order terms. in the limit of infinite sampling and infinitesimally small neighborhoods, Σε → . when observational noise is present (discussed in methods section . . . ), Σε also depends on the magnitude of the noise (σ in equation ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . selecting local neighborhoods for regression here we describe the procedure for selecting a neighborhood around each point p for computing the second fundamental form. we adopt the simplest approach of selecting the neighborhood to be a ball of radius r centred at p, bp(r). if r(p) is not specified, we set it according to statistical rather than geometric principles, since the geom- etry of the manifold may be non-trivial and unknown a priori. specifically, we set r(p) so that the elements in the covariance matrix, Σh, are upper-bounded by σ h, the square of the specified target uncertainty. the largest elements in Σh are the variance terms on the main diagonal, corresponding to the squares of the standard errors, σhk ij , for the coefficients hkij. by inspection of equation : σ hk ij = [diag Σε]k [ diag (t′t)− ] (ij) ( ) where [diag Σε]k is the diagonal entry of Σε corresponding to the k th normal direction and [ diag (t′t)− ] (ij) is the diagonal entry in (t′t)− for which the corresponding entry in t′t is ∼ ∑ l(t (l) i t (l) j ) . increasing r(p) monotonically increases both np(r), the number of points in bp(r), and the average magnitude of elements in t, both of which reduce σhk ij . to avoid sweeping r(p) to find the minimum value such that max σhk ij < σh, which is computationally expensive, for each point we instead model the dependence of np(r) on r as np(r) ∼ rd ′ ( ) so that σ hk ij ∼ rd ′+ ( ) to determine d′, np(r) is counted at log-spaced distances, ri, and a line is fit to the (log ri, log np(ri)) pairs for i ∈{ , ..., }. r is set to be the distance from p to the ( d(d+ ) + ) -closest point to p (the minimum number of points needed for regression). r is set to be the distance from p to the furthest point from p. to solve for r, we first guess rg = r , perform regression on the set of points in bp(rg) and assign σ g to be the largest diagonal entry in Σh. if ∣∣∣σgσh − ∣∣∣ is within a desired tolerance, we set r = rg, or else we update rg as shown and iterate to convergence. rg ← rg ( σg σh ) d′+ ( ) for large datasets, we speed up computation by only selecting r in this manner for a subset of ncalib ≤ n .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / randomly selected calibration points. all datapoints in the voronoi cell of each calibration point are then assigned the same r as the calibration point. unless otherwise specified ncalib = n. . . goodness-of-fit test for quadratic regression for a fixed density of points, there is a fundamental trade-off between reducing uncertainty in the hkijs and the validity of approximating local neighborhoods with quadratic fits. to reduce σh, more points must be included in the fit, but a larger neighborhood may not be well-modelled by only quadratic terms. conversely, d(d+ ) + points are sufficient to perform the regression, but there is then large uncertainty in the estimate of hkij. since our approach is to choose a neighborhood size to achieve a target σh, we include a companion goodness-of-fit (gof) statistic measuring how well the neighborhood is fit by a quadratic. namely, we use mardia’s test on the residuals from regression (ε(i) in equation ), which yields a p-value for the null hypothesis that the residuals are normally distributed [ ]. when the p-values are small, the quadratic regression model is unlikely to be valid. in this case, curvatures computed using the resulting hkij may be suspect regardless of the tightness of the errorbars, and the user may want to consider increasing σh to reduce the neighborhood size. however, the poor gof may not be of concern if the length scale of interest is larger than the fluctuations in the manifold which give rise to the non-gaussian residuals (see methods section . . ). note that mardia’s test is relatively weak since it may yield false negatives for heteroskedastic residuals. this gof measure is therefore only provided as a computationally cheap consistency check. ideally, the density of sampled points is sufficiently high to (i) permit small σh and (ii) produce gof p-values that are uniformly distributed (consistent with the null model) and spatially uncorrelated. . . standard error and bias of scalar curvature estimate here we discuss how we compute the standard error, σs, of the estimate for s and note sources of estimator bias. since the riemannian curvature tensor in equation is a bilinear form and the tensor contraction in equation is a straightforward sum, σs can be computed using simple error propagation formulas in terms of the uncertainties from regression. specifically, the standard error we report is the first-order approximation to the second moment of a function of random variables: σs = √ jt Σhj ( ) where j = ∂s ∂hk ij ∣∣∣ ĥ . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / it is important to note that our estimate for s is biased and not normally distributed. first, the hkijs are only normally distributed when the residuals (ε(i) in equation ) themselves are normally distributed. second, even when the hkijs are normally distributed, our estimate of s will not be due to its bilinear dependence on hkij. lastly, estimates for s can be biased in a manifold-dependent and even position- dependent way. for instance, the analytical scalar curvature of s embedded in r is given by s = (h h − h h ), with h = h = / and h = h = . numerically however, the symmetric off-diagonal terms will never be exactly so s will be systematically under-estimated. this is apparent in the left tail of the blue histogram in figure i. in our experience, adding isotropic noise of small magnitude tends to remove the skew, presumably because then the residuals more closely match the regression assumptions (see for example figure s b, where the left tail disappears for σ = . ). furthermore, in our examples, we observed that computed scalar curvatures were less biased when the ambient and/or manifold dimensions were large. we speculate that this is because the increased number of terms (with alternating signs) in equations and leads to cancellation of errors, which is likely why the accuracy of computed scalar curvatures was higher for s , s and s than s , and the distribution of scalar curvatures less skewed (see figure i). . . note on length scales here we make three remarks regarding length scales relevant both for considering curvature theoretically and for applying our algorithm. first, note that scalar curvature has units of inverse length squared. therefore, scaling all the coordinates of the points on a manifold by a factor l, changes the scalar curvature at all points by l− . thus, it is always important to contextualize the scalar curvature in terms of the global length scale associated with the manifold. for example, the scalar curvature of sd with radius r is sd(p) = d(d− ) r ∀p ∈ m (here l = r). in the case of the toy models shown in figure , the global length scale is l ≈ (see methods section . . ). for the image patch dataset, a normalization is applied which places all patches on s (see methods section . . ), so that the global scale is again l = . for scrnaseq data, we computed scalar curvature on the datapoints after preprocessing (see methods section . . ), without imposing any additional scaling correction to achieve a standardized global length scale. since other custom analyses also use these same boilerplate preprocessing steps, computing scalar curvatures in the context of the global length scale of the preprocessed data is sensible. for all three scrnaseq datasets, the global length scale happened to be l ≈ (see methods section . . ). second, since hkij is a dimension-ful quantity (which scales as l − ), to keep the ratio of σs to s fixed when all coordinates are scaled by l, σh needs to be scaled by l − . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / lastly, we note that our choice of σh sets local length scales that are statistically rather than geometrically informed: neighborhoods are chosen to upper bound the uncertainty in estimates obtained from regression. this length scale can also be understood in terms of a bias-variance trade-off. large length scales reduce variance but may introduce a bias if the resulting neighborhoods are larger than features on the manifold. this manifests as poor gofs and can be corrected by finer sampling. however, for manifolds with features at different length scales (such as a golf ball, which can be treated as dimples superimposed on s ), neigh- borhoods chosen by this heuristic can also be much smaller than the feature of interest, so that fine-scale curvature fluctuations are detected (dimples) while coarser features are neglected (s ). regardless, we de- fault to this statistical approach because in general, the length scale of relevant features on a data manifold will not be uniform across the manifold or known a priori. however, we also provide the ability to manually set position-dependent r(p) in the software to facilitate ad hoc computation of curvatures at any length scale of interest. . details of toy manifold curvature computations . . analytical forms here we provide analytical forms for the toy manifolds shown in figures and s . . . . hypersphere the d-dimensional unit hypersphere, sd, has intrinsic coordinates θ ∈ [ , π], θ , ...,θd ∈ [ −π , π ] and ambient coordinates in rd+ given by: xi =   ∏d j= cos θj, i = sin θi− ∏d j=i cos θj, < i ≤ d + ( ) using the relations in methods sections . , the scalar curvature is given by sd(p) = d(d − ) ∀p ∈ m. to draw uniform samples from sd, instead of applying rejection sampling on these intrinsic coordinates as described in methods section . , it is more straightforward to let xi ∼n( , ) and scale the resulting vector (x , ...,xd+ ) to have unit norm. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . one-sheet hyperboloid the one-sheet hyperboloid, h , has intrinsic coordinates θ ∈ [ , π], u ∈ r and ambient coordinates in r given by: x = a cos θ √ u + y = b sin θ √ u + z = cu ( ) for figure e,f, we used a = b = and c = . using the relations in methods sections . , the scalar curvature is given by s(z) = − ( z + ) . to avoid edge effects in the z-direction, we constrained u ∈ [− , ], and sampled points as described in methods section . until a subset of n = had u ∈ [− , ]. scalar curvature was computed and visualized for these n = points. . . . ring torus the -dimensional ring torus, t , has intrinsic coordinates θ,φ ∈ [ , π] and ambi- ent coordinates in r given by: x = (r + r cos θ) cos φ y = (r + r cos θ) sin φ z = r sin θ ( ) for figure g,h, we used r = . and r = . . using the relations in methods sections . , the scalar curvature is given by s(θ) = cos(θ) +cos(θ) . . . . hypercube the m-dimensional cube of side length r, dmr , has intrinsic coordinates z , ...,zm ∈ [−r/ ,r/ ], and ambient coordinates in rn for n ≥ m given by: xi =   zi, ≤ i ≤ m , m < i ≤ n ( ) using the relations in methods sections . , the scalar curvature is given by s(p) = ∀p ∈ m. . . practical issues for curvature estimation on real-world datasets for real-world data, small sample size is only one of the potential confounders for accurately estimating curvature. here, we report how our algorithm fares when four other real-world confounders are applied to toy manifolds: non-uniform sampling, observational noise, large ambient dimension n, and uncertainty in .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the manifold dimension d. . . . non-uniform sampling we expect our approach to handle non-uniform sampling of the man- ifold gracefully: smaller (larger) neighborhoods will be used on densely (sparsely) sampled portions of the manifold to encapsulate the number of points needed to achieve σh. to computationally verify the robustness of our tool to non-uniform sampling, we constructed a toy model to roughly match the (n, d, l) parameters for the scrnaseq datasets explored in the paper, for which non-zero scalar curvatures seemed to appear at smaller length scales/higher densities. specifically, we wanted to verify that non-zero scalar curvatures do not appear artifactually at specific length scales due to sharp changes in the local density of points sampled from a flat manifold. to this end, we formed a dataset with a sparse periphery and dense core by uniformly sampling n = points from d to establish a background density equal to points per unit volume, and n = points from d to create a core density roughly equal to points per unit volume (see methods section . . . ). we embedded these points in r by adding isotropic gaussian noise with σ = . to the eight normal directions, for all datapoints. we computed scalar curvature on this dataset for a fixed σh (see methods section . . ) and found no significant deviation from the true value of zero in either the sparse or dense regions (see figure s a). we next computed scalar curvatures at three fixed length scales corresponding to the , , and %-ile r values obtained using the specified σh (r = . , . and . respectively) and again saw no deviation from zero scalar curvature for points in either the sparse or dense region (see figure s a). we repeated this analysis for n = and again saw no deviation from zero scalar curvature, regardless of whether variable neighborhood sizes or fixed length scales (r = . , . and . corresponding to the same percentiles) were used (see figure s a). . . . observational noise every ambient coordinate can be considered a measured observable with its own observational noise. assuming each observable is distorted by independent, isotropic gaussian noise with variance σ (sometimes referred to as convolutional noise [ ]), datapoints x ∈ rn sampled from an embedded manifold m are modelled by: x = x + n( ,σi), x ∈ m ( ) to study the sensitivity of our algorithm to noise, we uniformly sampled n = datapoints from s ∈ r , added convolutional noise with σ ranging over several orders of magnitude, and estimated scalar curvatures using a fixed σh (see methods section . . ). for small σ, the distribution of scalar curvatures was centred .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / on the true value of , but once σ became large (≈ % of s ’s radius), the estimated scalar curvatures approached (see figure s b). noise in the regression context does not change the expectation value of any estimated parameter. the apparent flattening that is observed therefore indicates that x (obtained from convoluting m), has a geometry that is not trivially related to m. certainly for σ ≈ , x does not even preserve the topology of m as s . from a practical perspective, it suffices to say that small convolutional noise can be handled by simple quadratic regression, while large convolutional noise obfuscates the original manifold. these observations are consistent with literature defining a manifold’s reach [ , ], a noise scale beyond which noisy samples cannot be uniquely associated to a point on the noise-free manifold. when σ exceeds the manifold’s reach, the relationship between the empirical density of sampled points and the original manifold is non-trivial even for a relatively forgiving model of manifold-orthogonal noise. the ridge manifold [ , ] of an empirical density has also been defined as an alternative to the unwieldy task of deconvoluting noisy samples to recover a noise-free manifold. this definition avoids the notion of a noise-free manifold altogether and instead defines manifolds as ridges, contours along which the empirical density of points is maximized. . . . large ambient dimension a high-dimensional dataset may have an ambient space comprised of tens of thousands of observables, i.e. n is very large. meanwhile, the underlying manifold dimension, d, may be small. since convolutional noise occurs in n dimensions, will a low-dimensional manifold still be discernable? to explore this, we uniformly sampled n = datapoints from s ∈ r , embedded these points in rn for a range of n up to , and added convolutional noise of magnitude σ = . , . , and . in the n-dimensional ambient space. we computed curvatures for all combinations of n and σ using a fixed σh (see methods section . . ). as n or σ increased, the algorithmically chosen neighborhood sizes, r(p), expanded to include enough datapoints to maintain the desired σh. the distribution of estimated scalar curvatures (shown in figure s c) is centred on the true value of for n < and σ ≤ . . however, we observed that r was far less sensitive to changes in n than changes in σ. for example, exploding n from to at σ = . and tripling σ from . to . at n = required a comparable increase in r (see figure s c). therefore, consistent with the results of methods section . . . , as long as the noise scale σ is small, a large ambient dimension n is not a confounder. practically however, to shorten computational overhead and avoid the large-n-and-σ case, it is still helpful to reduce the ambient dimension by projecting datapoints to an affine subspace containing the manifold (e.g. by pca). such a transformation does not change the intrinsic curvature. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . choice of manifold dimension the last practical consideration is accurate selection of the manifold dimension, d, which we have so far assumed to be known. there is no consensus on the definition of d for a dataset, so various disciplines have devised different heuristics to determine d in a data-driven fashion [ ]. from the regression perspective, any d > corresponds to a well-defined regression problem. the choice of d merely determines how local coordinates are partitioned into independent (tangent) and dependent (normal) variables. however, in our algorithm we noticed that some choices of d result in exces- sively large r(p) for a fixed σh. we explored this further using two toy manifolds and discovered a signature indicating that the specified manifold dimension may be incorrect. the manifolds considered were s ⊂ r convoluted by isotropic gaussian noise with σ = . and s ×s ⊂ r , for which d∗, the true manifold dimension, is d∗ = and d∗ = respectively. we uniformly sampled n = points from each manifold and estimated scalar curvatures by holding σh fixed for different d (see methods section . . ). for both manifolds, the average neighborhood size, r, was much larger for d > d∗ and d < d∗, than for d = d∗ (see figure s d). in the case of s , for d < d∗, the average neighborhood size was even larger than the global length scale, l, of the manifold. since neighborhood sizes are chosen to achieve a target σh, manually decreasing r(p) is counter-productive and simply increases the uncertainty from regression above σh. the large neighborhood sizes that emerged for both d > d∗ and d < d∗ can be understood in terms of the mis-assignment of normal vectors to the tangent space, or vice versa. according to equation , σhk ij increases with large variation in the normal direction ([diag Σε]k), or with small variation in the tangent direction ( [ diag (t′t)− ] (ij) ). when we choose d > d∗, we mis-attribute a normal direction with small variation [diag Σε]k as an independent variable, whereas variation along the true tangent space is � [diag Σε]k. r must therefore be increased to compensate for the lack of variation along this direction mis-classified as tangent. when d < d∗, we have spuriously assigned a tangent direction with large variation to be a normal direction. since this spurious normal coordinate cannot be well-approximated as a function of tangent coordinates from which it is linearly independent, the perceived noise scale ([diag Σε]k) is exaggerated so that a larger neighborhood is needed to attain σh. this suggests a crude, operational definition of what constitutes an incorrect choice of d. when σhk ij is large relative to the uncertainty in other coefficients, there is either too little variation along the ith and jth tangent directions, or too much variation along the kth normal direction. in the former case, the ith or jth tangent direction might be more appropriately classified as a normal direction (d is too large and should be decreased), while in the latter case, the kth normal direction might be more appropriately classified as a tangent direction (d is too small and should be increased). when this criterion is applied point-wise, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / there may be a different acceptable choice of d for different parts of the manifold. when this criterion is generalized over the entire manifold, a σh yielding a flat distribution of gof p-values when the manifold dimension is specified to be d will also yield a flat distribution for d + but not necessarily for d − : if residuals in n−d dimensions are well-modelled by a multivariate gaussian, so too will residuals in n−d− dimensions, but not necessarily residuals in n − d + dimensions (see figure s d). our observations are consistent with manifolds in literature with multiple possible manifold dimensions (like the helix manifold in [ ]), and which could generally arise from non-isotropic noise or non-uniform sampling. . . parameters for curvature estimation for each manifold in figure , we chose σh so that the fraction of points with gof p-value ≤ α = . most closely matched the null model of normally distributed residuals consistent with neighborhood sizes well-approximated by quadratic regression (see section . . ). σh = ( . , . , . , . , . , . ) for (s ,s ,s ,s ,h ,t ) resulted in ( . , . , . , . , . , . )% of points having gof p-values ≤ α = . . theoretically, max |hkij| = ( . , , . ) for (s d,h ,t ) so our choices for σh result in small fractional errors in all cases. for figure s a, we set σh = ( . , . ) for n = ( , ) respectively which resulted in ( . , . )% of points having gof p-values ≤ α = . . for all other panels in figure s , where we were interested in ascertaining the sensitivity to different confounders, instead of minimizing uncertainty per se, we used a fixed value of σh = . . this choice resulted in neighborhoods small enough to be well-approximated by quadratic regression, manifesting as a roughly uniform distribution of gof p-values in all cases. . details of image patch dataset and klein bottle manifolds . . notation and preliminaries first we introduce some notation needed to describe the image patch dataset. we refer readers to [ , ] for a more detailed exposition. let p be the space of all bivariate polynomials p : r × r → r with p ∈ p, h : p → r the vectorization operator given by h(p) = [p(− , ),p(− , ),p(− ,− ),p( , ),p( , ),p( ,− ), p( , ),p( , ),p( ,− )]t , u : rm →sm− the normalization operator given by u(v) = v‖v‖ , and c : r → r the projection operator given by c(y) = Λaty, where a = [e . . . e ], Λ = diag{ ‖e ‖ , ..., ‖e ‖ }, and {ei} .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / are vectorized basis vectors for the -dimensional discrete cosine transform (dct) applied to x patches: e = [ , ,− , , ,− , , ,− ]t/ √ e = [ , , , , , ,− ,− ,− ]t/ √ e = [ ,− , , ,− , , ,− , ]t/ √ e = [ , , ,− ,− ,− , , , ]t/ √ e = [ , ,− , , , ,− , , ]t/ √ e = [ , ,− ,− , , , , ,− ]t/ √ e = [ ,− , , , , ,− , ,− ]t/ √ e = [ ,− , ,− , ,− , ,− , ]t/ √ ( ) by inspection, e is the basis vector for patches with horizontal stripes and linear gradients, e for patches with vertical stripes and linear gradients, e for patches with horizontal stripes and quadratic gradients, e for patches with vertical stripes and quadratic gradients, and e for diagonally-oriented patches with quadratic gradients. all the patches produced by the embedding k in equation below and visualized in figure b can be written as a linear combination of these basis vectors. next, note that the components in each ei sum to , so that the projection operator, c, additionally serves to remove the mean. finally, observe that the vector norm formed under d = aΛ at (referred to hereafter as the d-norm following [ ]) measures the contrast in a x patch since ‖v‖d = √ vtdv = √∑ i ∑ j∼i (vi −vj) ( ) where j ∼ i refers to all vertical and horizontal neighbors, j, of a pixel i in the preimage of v under h. the ei are normalized so that ‖ei‖d = . . . image dataset we used the same van hateren iml dataset [ ] consisting of greyscale images of size x pixels studied by carlsson et al. in [ ] and followed the same preprocessing steps used there. in short, we applied a log p transformation to all pixel values and randomly sampled × (possibly overlapping) x patches from each image. we indexed the pixels in each patch using standard cartesian coordinates with the middle pixel as the origin, so that log-transformed pixel values are given by p(x,y),x ∈{− , , },y ∈{− , , }. we .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / then applied h to vectorize each patch p, and retained the high-contrast patches comprising the top quintile of d-norms for each image, resulting in n ≈ . × datapoints. next, we normalized these high-contrast vectorized patches using the composition u◦ c, resulting in a set of datapoints on s ⊂ r . we determined the density of these datapoints in r using the knn density estimator with k = , and retained the densest decile, which yielded n ≈ . × datapoints. this dense subset of high-contrast normalized patches was found using topological data analysis in [ ] to be a klein bottle, k ⊂ s , and is studied in figures d,i and s b. to generate the augmented image patch dataset used in figures j and s e,f, we first considered all n ≈ . × vectorized high-contrast patches in the van hateren iml dataset using the same procedure described above (each of the images yields × patches, of which the top % by d-norm are retained per image). these were normalized by u◦c as before to place them on s ⊂ r . we again wanted to retain the densest decile of points, since only these have the topology of a klein bottle. mirroring the approach in [ ] where the k used in the knn estimator was scaled with sample size, k = used for n ≈ . × corresponds to k = × . × . × ≈ × for n ≈ . × . computing k ≈ × neighbors for all n ≈ . × points is prohibitive however. to determine a reasonable smaller value of k, we randomly selected × points from the set of n ≈ . × on which to compare estimators and found that % of points in the densest decile as computed with k = × also appeared in the densest decile computed using k = × . we therefore used the latter value for density estimation and retained the n ≈ . × datapoints comprising the densest decile. . . parametric family of klein bottle embeddings let θ,φ ∈ [ , π]. bivariate polynomials parameterized by (θ,φ), kθ,φ ∈ kθ,φ ⊂ p, that satisfy kθ,φ = kθ+π, π−φ form a klein bottle, k : the (θ,φ) ∼ (θ + π, π − φ) similarity relation results in edges being glued together in the manner definitional of a klein bottle’s topology (shown in figure b). the candidate klein bottle embedding supplied in [ ] to model image patch data satisfies the similarity relation ∀x,y: k ≡ k θ,φ(x,y) = cos φ [x cos θ + y sin θ] + sin φ [x cos θ + y sin θ] ( ) note that any kθ,φ ∈ kθ,φ can be decomposed as: kθ,φ = c + κθ + κφ + κθ,φ ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / where κθ = κθ+π, κφ = κ π−φ and κθ,φ = κθ+π, π−φ. the first three terms can be understood as constant, θ-dependent and φ-dependent phases respectively. we sought an embedding of the klein bottle for which the sum of euclidean distances from each image patch to its closest point on the embedding is minimized. to accomplish this, we constructed a parametric family of models for each of the four terms in equation . the first three of these are most conveniently expressed directly in the dct basis. (c◦h) (c) = nc ∑ i= µiei (c◦h) (κθ) = ∑ i=   nθ∑ j= j even βi,j cos(jθ) + γi,j sin(jθ)   ei (c◦h) (κφ) = ∑ i=  nφ∑ j= ζi,j cos(jφ)  ei ( ) nc is a boolean variable, and nθ and nφ control the number of terms in the inner sum for (c◦h) (κθ) and (c◦h) (κφ) respectively. the expression for (c◦h) (κθ) only includes even coefficients for θ so that the similarity relation (θ) ∼ (θ +π) is satisfied. the expression for (c◦h) (κφ) only includes cosine terms so that the similarity relation (φ) ∼ ( π −φ) is satisfied. for κθ,φ, we refrained from writing a fourier series-like expansion because we wanted to preserve the interpretation of θ and φ as parameters controlling the orientation and gradient respectively [ ]. instead, we devised the following form, which we motivate further below: κθ,φ(x,y) = mφ∑ l= cosl(φ)   s+t≤mθ∑ ≤s,t≤mθ even and t odd − √ e − √ e , if t > even and s odd √ (e + e + e ) , if s > even and t > even ( ) note that the first inner sum in equation is a linear combination of basis vectors encoding purely quadratic gradients (e , e , e and e ), weighted by even trigonometric functions of θ. the prefactors on this inner sum are functions that are even in φ. this inner sum and its prefactor therefore jointly satisfy the similarity relation (θ,φ) ∼ (θ + π, π−φ) by independently satisfying (θ) ∼ (θ + π) and (φ) ∼ ( π−φ). meanwhile, the second inner sum in equation is a linear combination of basis vectors containing linear gradients (e , e , e and e ), weighted by odd trigonometric functions of θ. the prefactors on this inner sum are functions that are odd in φ. this inner sum and its prefactor therefore jointly satisfy the similarity relation (θ,φ) ∼ (θ + π, π − φ), by independently satisfying (θ) ∼ −(θ + π) and (φ) ∼ −( π − φ). since the trigonometric functions of θ are coupled to (x,y), θ controls the rotation of stripes in the image patches, just as in k . similarly, since the prefactors on the inner sums are functions of φ, φ controls the relative contribution of quadratic gradients (e , e , e and e in the first inner sum) and linear gradients (e , e , e and e in the second inner sum). lastly, the boundary conditions for θ and φ in this parameterization of κθ,φ, yield patches with vertical (horizontal) stripes when θ = (θ = π ), and linear (quadratic) gradients when φ = π , π (φ = ,π) just as in k . a klein bottle embedding belonging to this parametric family, kαθ,φ ∈ kθ,φ, can therefore be specified in terms of a vector f = [nc,nθ,nφ,mθ,mφ] defining its functional form, and a corresponding coefficient vector α = [µi, ...,βi, ...,γi, ...,ζi, ...,ηi, ...,ϑi]. in this parametric family of klein bottle embeddings, k corresponds to f = [ , , , , ] with α = [η , , ,η , , ,η , , ,ϑ , , ,ϑ , , ] = [ , , , , ]. note that since curvatures are only computed on the embedding after normalization, α is only meaningfully defined up to a multiplicative constant. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . associating image patches to a klein bottle embedding for a given klein bottle embedding, kαθ,φ ∈ kθ,φ, we associated each datapoint vi (already vectorized and normalized by u◦ c◦h) to the closest point on kαθ,φ by minimizing the euclidean distance in r : (θ̂i, φ̂i) = argminθ,φ‖(u◦ c◦h) ( kαθ,φ ) −vi‖ ( ) we solved this minimization using the lsqnonlin function (‘steptolerance’= e- , ‘functiontolerance’= e- ) in matlab, supplying initial conditions corresponding to analytical values for a point on k : θ̂i = arctan e tvi −e tvi ( vi∈(u◦c◦h)(k ) = arctan sin φ̂i sin θ̂i sin φ̂i cos θ̂i ) φ̂i = arctan √ (e tvi) + (e tvi) (e tvi) + e tvi  vi∈(u◦c◦h)(k )= arctan √ sin φ̂i cos φ̂i   ( ) we constrained solutions to θ̂i ∈ [ ,π] and φ̂i = [ , π] according to the (θ,φ) similarity relation. . . optimal klein bottle embedding let kα̂θ,φ ∈ kθ,φ be the klein bottle embedding that minimizes the sum of euclidean distances in r between each image patch and the closest point on the embedding. to determine kα̂θ,φ given a functional form f, we initialized the coefficient vector α̂ to have zero entries everywhere except for the values used in k . we then iterated between optimizing for (θ̂i, φ̂i) according to equation and for α̂ as shown below using least-squares, until convergence: α̂ = argminα ∑ i ‖(u◦ c◦h) ( kα θ̂i,φ̂i ) −vi‖ ( ) k ≡ kα̂θ,φ is the optimized klein bottle embedding corresponding to f = [ , , , , ], for which results are shown in figures h and s d. . . noisy klein bottle embeddings the set of n ≈ . × image patches was associated to k according to the procedure described in methods section . . , yielding (θ̂i, φ̂i) values. isotropic gaussian noise of magnitude sσ was added element-wise in r (prior to normalization by u ◦ c) to h(k θ̂i,φ̂i ), where s = mediani{‖h(k θ̂i,φ̂i )‖ } ≈ . . figures f,g and s a correspond to noise with σ = . , . and . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . parameters for curvature estimation for all scalar curvature computations on image patch datasets and klein bottle embeddings, we set d = and ncalib = . unless the neighborhoods were manually specified, we used σh = . , which yielded a flat distribution of gof p-values ( . % of points reported gof p-values ≤ α = . ) for the set of n ≈ . × points on k closest to the image patches (shown in figure e). . details of scrnaseq datasets the pbmc dataset provided by x genomics is comprised of n = pbmcs collected from a healthy donor [ ]. the mouse gastrulation dataset consists of n = cells collected at nine -hour intervals from embryonic day . to . [ ]. the mouse brain dataset is a benchmark from x genomics consisting of n = cells collected from the cortex, hippocampus and ventricular zone of two embryonic mice sacked at embryonic day [ ]. . . preprocessing for the pbmc dataset, we applied standard preprocessing steps using seurat v . . [ ] with default function arguments, to extract pc projections and umap coordinates ourselves. specifically, we removed cells where the percentage of transcripts corresponding to mitochondrial genes exceeded %, or which had fewer than transcripts. this reduced the number of cells from to . on this filtered set, we normalized the data (normalizedata(normalization.method=‘lognormalize’, scale.factor= )), retained the most variable genes (findvariablefeatures(selection.method=‘vst’, nfeatures= )), and scaled the data (scaledata). next, we performed linear dimensionality reduction using pca down to dimensions (runpca(npcs= )) and generated umap coordinates for visualization (runumap(dims = : )). for the gastrulation (brain) dataset, we did not preprocess the data ourselves but instead directly used the ( ) pc projections and umap (t-sne) visualization coordinates provided with the dataset. please refer to [ , ] for additional details. . . cell type annotations for the pbmc dataset, the addmodulescore(ctrl= ) function was used to compute the per-cell average expression of marker genes corresponding to seven different cell types [ ]. to prepare figure s a, each cell was assigned the cell type for which its average marker gene expression was the highest. cell type annotations for the gastrulation dataset (see figure s a) were sourced from figure c of [ ]. cell type .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / annotations for the brain dataset (see figure s a) are predicted labels sourced from [ ]. . . statistical tests here we describe the statistical tests applied to scalar curvatures computed for the scrnaseq datasets. . . . spatial precision of errorbars let m be the fraction of datapoints with % cis containing the scalar curvatures reported by their respective knns. to check whether m was significantly larger than chance, we used a permutation test. we randomly assigned the knn of each datapoint to be one of the n datapoints in the dataset and computed m. we repeated the procedure t = times to generate an empirical distribution of m for the null model of random neighbors. the reported p-value for each k is the fraction of the t trials for which m was greater than the value computed for data. see figures s d, s d and s d. . . . sensitivity to cell downsampling to check the sensitivity of the computed scalar curvatures to the average density of cells, we discarded f% of cells at random from the ambient space computed using the original set of n datapoints, and recomputed scalar curvatures using the same ambient dimension, manifold dimension and neighborhood sizes as for the original dataset (see methods section . . ). let m be the fraction of downsampled datapoints with % cis containing the scalar curvatures originally reported. since the cis grow as f increases, we checked whether m was significantly larger than chance by using a permutation test. we randomly paired each of the % cis computed after downsampling, to one of the scalar curvatures reported by the downsampled points for the original dataset, and computed m. we repeated the procedure t = times to generate an empirical distribution of m for the null model. the reported p-value for each f is the fraction of the t trials for which m was greater than the value computed for data. see figures s i, s i and s i. . . . sensitivity to transcript downsampling to check the sensitivity of the computed scalar curvatures to the capture efficiency and sequencing depth of the data, we discarded f% of transcripts at random from the single-cell count matrix for the pbmc dataset, then performed the same preprocessing steps described in methods section . . . we recomputed scalar curvatures using the same ambient dimension, manifold dimension and neighborhood sizes as for the original dataset (see methods section . . ). let m be the fraction of datapoints with % cis containing the scalar curvatures originally reported. to check whether m was significantly larger than chance, we used a permutation test. we randomly paired each of the % cis computed after downsampling transcripts, to one of the scalar curvatures computed for the original .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dataset, and computed m. we repeated the procedure t = times to generate an empirical distribution of m for the null model. the reported p-value for each f is the fraction of the t trials for which m was greater than the value computed for data. see figure s j. . . parameters for curvature estimation let the variance explained by the ith pc be given by σ i and the cumulative fractional variance of the first m pcs by cm = ∑m i= σ i∑ i σ i . for each dataset, we selected the ambient dimension as n = argmaxm{cm|cm ≤ . }, the manifold dimension as d = argmaxm{cm|cm ≤ . }, and considered the global length scale to be l = σd. (n,d,l) = ( , , . ), ( , , . ) and ( , , . ) for the pbmc, gastrulation and brain datasets respectively. for the three datasets, we computed scalar curvatures for manifold dimensions d− , d and d + . it was not always possible to select σh for each dataset and manifold dimension, so that the distribution of gof p-values was flat, according to our usual heuristic. for consistency, we therefore picked σh so that / of points had gof p-values ≤ α = . . for manifold dimension (d − ,d,d + ), σh = ( . , . , . ), ( . , . , . ) and ( . , . , . ) for the pbmc, gastrulation and brain datasets respectively. acknowledgements ds was funded in part by the natural sciences and engineering research council of canada (nserc pgsd - - ). sw was supported by nci u -ca and nih nigms t gm . ds and sh acknowledge funding from nih nigms r gm , u systems immunology pilot project grant at harvard university, and the harvard university william f. milton fund. the authors would like to thank peter kharchenko and allon klein for helpful discussions. portions of this research were conducted on the o high performance compute cluster, supported by the research computing group, at harvard medical school. see http://rc.hms.harvard.edu for more information. data and code availability the van hateren iml dataset is available at http://bethgelab.org/datasets/vanhateren and was loaded according to the instructions there. the pbmc dataset is available at https://support. xgenomics. com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc. the gastrulation dataset can be retrieved using instructions found at https://github.com/marionilab/embryotimecourse . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://rc.hms.harvard.edu http://bethgelab.org/datasets/vanhateren https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://github.com/marionilab/embryotimecourse https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the brain dataset is available at https://support. xgenomics.com/single-cell-gene-expression/ datasets/ . . / m_neurons. the software package described here to compute scalar curvature is avail- able at https://gitlab.com/hormozlab/manifoldcurvature. all code and instructions to reproduce the numerics and figures in this study will be made available upon publication. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://gitlab.com/hormozlab/manifoldcurvature https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary figures .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f h g figure s : the scalar curvature of s is poorly estimated using the laplace-beltrami operator. (a) the heat-trace with m terms, (zm(x) in equation ) is shown for m = ∞ (black), m = (solid blue) and m = (solid red), when evaluated with analytical eigenvalues for s . empirical eigenvalues were obtained by uniformly sampling n = points from s (see figure a; methods section . . . ) and estimating the laplace-beltrami (lb) operator using equations - . the heat-trace evaluated using these empirical eigenvalues, zm, is shown for m = (dashed blue) and m = (dashed red). the heat-trace evaluated using eigenvalues obtained by interpolating between the analytical and empirical values (z̃m(x; f) in equation ) is shown for m = and f = . (solid green). f signifies that the fractional error of the interpolated eigenvalues is reduced by −f relative to the empirical eigenvalues. f = corresponds to the analytical eigenvalues while f = corresponds to the empirical eigenvalues. the white region bounded by [x ,x ] indicates a candidate interval over which to fit a heat-trace to a quadratic in order to extract an estimate for the scalar curvature (see equations - ; methods section . . ). on the one hand, since the knee of zm(x) shifts to the left as m increases (i.e. zm(x) converges from ∞), larger m results in more intervals for which zm(x) well-approximates z∞(x) and will therefore yield accurate scalar curvature estimates. on the other hand, zm(x) becomes a worse estimator for zm(x) as m increases. (b) scalar curvatures estimated by fitting z∞(x) to a quadratic over different intervals [x ,x ] as defined in (a). scalar curvatures are shown in color for intervals yielding accurate estimates (s ∈ [ . , . ]). this colored region corresponds to d∞. (c) as in (b) but with estimates obtained by fitting a quadratic to z (x). the colored region corresponds to d . by inspection, d ⊂ d∞. (d) scalar curvatures estimated by fitting z (x) to a quadratic over each interval in d . though d was constructed using only intervals which yielded an accurate scalar curvature estimate when analytical eigenvalues were used in the heat-trace, no interval in d yields an accurate scalar curvature estimate when the same number of empirical eigenvalues are used in the heat-trace instead. (e) as in (b) but with estimates obtained by fitting a quadratic to z (x). the colored region corresponds to d . by inspection, d ⊂ d (f) as in (d) but with estimates obtained by fitting z (x) to a quadratic over each interval in d . no estimate is accurate just as in (d). (g) as in (f) but with estimates obtained by fitting z̃ (x; f = . ) to a quadratic over each interval in d . f = . was chosen so that half the intervals in d yield an accurate scalar curvature estimate. (h) (left) the fractional error in the first empirical eigenvalues of the lb estimator from (a) is shown in red. this operator was computed using the gaussian kernel (wg in equation ). eigenvalues - have a fractional error of %. the fractional error of the eigenvalues of lb estimators computed on the same n = points but using the weighted knn and r-neighborhood kernels (wknn and wr respectively in equation ) is also plotted. positive error indicates under-estimation. (right) projected fractional error for eigenvalues - of the lb estimator with gaussian kernel computed using a larger sample size (n). the projection is based on the convergence rate given in theorem of [ ], assuming that the big-o bound is sharp at n = for eigenvalues - . the dashed green line corresponds to the % fractional error needed for scalar curvatures to be accurately estimated for half the intervals in d . this corresponds to f = . in (g) since % ×f = %. for the lb estimator computed using the gaussian kernel, achieving this fractional error requires n ≈ . since lb estimators computed using the other kernels have the same convergence rate but larger fractional error at n = , these estimators would require even larger n to achieve the desired % fractional error. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d figure s : sensitivity of algorithm to real-world confounders. (a) (left) a dataset with a sparse periphery and a dense core was formed by uniformly sampling n = points from the -dimensional cube of side-length , d , and n = points from the -dimensional cube of side-length , d (see methods section . . . ). these points were embedded in r and padded with isotropic gaussian noise of magnitude σ = . in the normal directions. scalar curvatures (s) were computed on this dataset of n + n points by setting σh and are plotted against their standard errors (σs) in the leftmost panel. curvature computations were also performed at fixed length scales corresponding to the , and %-ile values for neighborhood size (left to right) used in the leftmost panel (r = . , . and . respectively). here, points for which the chosen r led to neighborhoods with insufficient points for regression are not shown. for large length scales, all points in the dense region are able to report curvatures but are crowded into the apex of the plots. the n (n ) sparse (dense) points are shown in blue (green). points enclosed by the red lines have % cis including the true value of zero. the right four panels show analogous results when n = . here the the , and %-ile values for neighborhood size are r = . , . and . respectively. see methods section . . . . (b) distribution of scalar curvatures computed for n = points uniformly sampled from s ⊂ r and convoluted with isotropic gaussian noise of magnitude σ in r . noise confounds accurate scalar curvature computation when σ is roughly % of the sphere’s radius. the deviation of the estimated scalar curvatures from the true value of (shown as a dashed red line) for σ ≥ . reflects the nontrivial geometry of a manifold convoluted by noise. see methods section . . . . (c) (left) n = points were uniformly sampled from s and embedded in rn. isotropic gaussian noise of magnitude σ was applied to each of the n ambient dimensions. scalar curvatures computed by keeping σh fixed for all n and σ, recapitulated the true value of (shown as dashed red lines) for n ≤ and σ ≤ . . (right) the neighborhood size (r) necessary to attain σh is less sensitive to changes in n than changes in σ. see methods section . . . . (d) n = points were uniformly sampled from (left) s ⊂ r convoluted with isotropic gaussian noise in the ambient space with σ = . and (right) s ×s ⊂ r . to investigate the effects of choosing the manifold dimension, d, differently than the true value, d∗, σh was kept fixed, and scalar curvatures were computed for d = d ∗− (cyan), d = d∗ + (magenta) and d = d∗ (green). the panels show the distribution of (left to right) scalar curvatures (s), standard errors (σs) and gof p-values. the true value of the scalar curvature (at d = d∗) is constant across both manifolds and shown as a dashed red line. the average neighborhood size (r averaged over all points) is much larger for both d = d∗ − and d = d∗ + than for d = d∗ as shown in the legend. for the same σh, d = d ∗− also leads to a more skewed distribution of gof p-values relative to d = d∗, while the distribution for d = d∗ + is still flat. see methods section . . . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c d e f figure s : additional details of the image patch dataset and klein bottle embeddings (related to figure ). (a) to compute scalar curvatures for figure e, each image patch was associated to the (θ ,φ ) coordinates of the closest point on k . here we select a handful of these associated points on k (shown in black) and visualize how neighborhoods chosen in r to compute scalar curvatures for figure e appear in (θ ,φ ) coordinates (shown in red). when noise of increasing magnitude, σ, is added to the set of closest points on k (see methods section . . ), the neighborhood size at each point grows until σh is attained. (b) as in (a), but showing neighborhoods used in computing the scalar curvatures in figure d for the image patch dataset. note the close correspondence in neighborhood size with σ = . in (a). (c) scalar curvatures computed for the set of closest points (θ ,φ ) on k as in figure e, but using the same neighborhood sizes determined for the image patch dataset shown in figure d, some of which are visualized in (b). (d) as in (a) but showing neighborhoods used in computing the scalar curvatures in figure h for the set of closest points on k . neighborhoods are visualized on (θ ,φ ) coordinates instead of (θ ,φ ) coordinates for ease of comparison. (e) as in (b) but showing neighborhoods used in computing the scalar curvatures in figure j for the augmented image patch dataset. (f) scalar curvatures computed for the augmented image patch dataset with n ≈ . × points as in figure j, but using the same neighborhood sizes determined for the original image patch dataset with n ≈ . × shown in figure d and (b). note the close correspondence with figure d. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / j c d e h g f i a b figure s .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure s : additional details of the pbmc scrnaseq dataset (related to figure ). (a) cell types overlaid onto umap coordinates and sorted in decreasing order of abundance in the legend. cells were annotated as described in methods section . . . (b) a goodness-of-fit p-value was computed for each point by applying mardia’s test to the residuals obtained from fitting the neighborhood around the point to a quadratic function (see methods section . . ). these p-values are visualized on umap coordinates corresponding to each point (left) and their empirical distribution is shown using a histogram (right). small p-values suggest that the residuals are non-normal so that approximating local neighborhoods as quadratic may not be valid. (c) pearson correlation between the scalar curvature reported by each point and its kth-nearest neighbor (knn) for different k (shown in blue). the red bar shows the mean and standard deviation of the pearson correlation when neighbors are chosen randomly over trials (*p < − ). (d) the percentage of points with % cis containing the scalar curvatures reported by their respective knns (shown in blue). the red bar shows the mean and standard deviation of this percentage when neighbors are chosen randomly over trials (*p < . ; see methods section . . . ). (e) the neighborhood size (r) used for computing scalar curvature at each point, overlaid onto umap coordinates (left) and a corresponding histogram of the empirical distribution (right). the dashed red lines correspond to the , , and %-ile values of r(p) used for computing scalar curvatures at fixed neighborhood sizes for figure c. see methods section . . . (f) the number of points in each neighborhood (corresponding to the neighborhood sizes in (e)) overlaid onto umap co- ordinates (left) and a corresponding histogram of the empirical distribution (middle). (right) the set of neighbors used for computing scalar curvature (purple) is visualized on umap coordinates for a handful of points (black). (g) scalar curvatures were computed for manifold dimension d− (left) and d + (right). they are plotted here on umap coordinates after smoothing over the same set of k = neighbors used in figure a. see methods section . . . (h) the total number of transcripts observed in each cell overlaid onto umap coordinates. (i) scalar curvatures were computed after downsampling the number of cells in the ambient space by a factor of (left) and (middle), using the same ambient dimension, manifold dimension and neighborhood sizes determined for the original dataset. they are plotted here on umap coordinates after smoothing over the same set of neighbors (which survive downsampling) used in figure a. (right) the percentage of points in the downsampled datasets with a % ci containing the originally reported scalar curvature (blue), and likewise for a negative control obtained by randomly pairing % cis and originally reported scalar curvatures for points in the downsampled dataset (red). errorbars for the negative control are the standard deviation of this percentage over trials with different random pairings (*p < . ; see methods section . . . ). (j) scalar curvatures were computed after downsampling the number of transcripts by a factor of (left) and (middle), using the same ambient dimension, manifold dimension and neighborhood sizes determined for the original dataset. they are plotted here on umap coordinates after smoothing over the same set of k = neighbors used in figure a. (right) the percentage of points in the downsampled datasets with a % ci containing the originally reported scalar curvature (blue), and likewise for a negative control obtained by randomly pairing % cis and originally reported scalar curvatures for points in the downsampled dataset (red). errorbars for the negative control are the standard deviation of this percentage over trials with different random pairings (*p < . ; see methods section . . . ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / b c d e h g f i a figure s : additional details of the gastrulation scrnaseq dataset (related to figure ). panels as in figure s . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / b c d e h g f i a figure s : additional details of the brain scrnaseq dataset (related to figure ). panels as in figure s but with t-sne instead of umap plots. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] a. m. klein, l. mazutis, i. akartuna, n. tallapragada, a. veres, v. li, l. peshkin, d. a. weitz, and m. w. kirschner. droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. cell, ( ): – , . [ ] e. z. macosko, a. basu, r. satija, j. nemesh, k. shekhar, m. goldman, i. tirosh, a. r. bialas, n. kamitaki, e. m. martersteck, j. j. trombetta, d. a. weitz, j. r. sanes, a. k. shalek, a. regev, and s. a. mccarroll. highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. cell, ( ): – , . [ ] g. x. y. zheng, j. m. terry, p. belgrader, p. ryvkin, z. w. bent, r. wilson, s. b. ziraldo, t. d. wheeler, g. p. mcdermott, j. zhu, m. t. gregory, j. shuga, l. montesclaros, j. g. underwood, d. a. masquelier, s. y. nishimura, m. schnall-levin, p. w. wyatt, c. m. hindson, r. bharadwaj, a. wong, k. d. ness, l. w. beppu, h. j. deeg, c. mcfarland, k. r. loeb, w. j. valente, n. g. ericson, e. a. stevens, j. p. radich, t. s. mikkelsen, b. j. hindson, and j. h. bielas. massively parallel digital transcriptional profiling of single cells. nature communications, ( ): – , . [ ] d. r. bandura, v. i. baranov, o. i. ornatsky, a. antonov, r. kinach, x. lou, s. pavlov, s. voro- biev, j. e. dick, and s. d. tanner. mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. analytical chem- istry, ( ): – , . [ ] c. giesen, h. a. o. wang, d. schapiro, n. zivanovic, a. jacobs, b. hattendorf, p. j. schüffler, d. grolimund, j. m. buhmann, s. brandt, z. varga, p. j. wild, d. günther, and b. bodenmiller. highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. nature methods, ( ): – , . [ ] j-r. lin, m. fallahi-sichani, j-y. chen, and p. k. sorger. cyclic immunofluorescence (cycif), a highly multiplexed method for single-cell imaging. current protocols in chemical biology, ( ): – , . [ ] j-r. lin, b. izar, s. wang, c. yapp, s. mei, p. m. shah, s. santagata, and p. k. sorger. highly multiplexed immunofluorescence imaging of human tissues and tumors using t-cycif and conventional optical microscopes. elife, , . [ ] l. h. nguyen and s. holmes. ten quick tips for effective dimensionality reduction. plos computational biology, ( ):e , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] j. b. tenenbaum. a global geometric framework for nonlinear dimensionality reduction. science, ( ): – , . [ ] l. van der maaten and g. hinton. visualizing data using t-sne. journal of machine learning research, (nov): – , . [ ] e. becht, l. mcinnes, j. healy, c-a. dutertre, i. w. h. kwok, l. g. ng, f. ginhoux, and e. w. newell. dimensionality reduction for visualizing single-cell data using umap. nature biotechnology, ( ): – , . [ ] a. hatcher. algebraic topology. cambridge university press, . [ ] r. ghrist. barcodes: the persistent topology of data. bulletin of the american mathematical society, ( ): – , . [ ] d. perrault-joncas and m. meilâ. non-linear dimensionality reduction: riemannian metric estimation and the problem of geometric discovery. arxiv, . [ ] j. m. lee. riemannian manifolds: an introduction to curvature (graduate texts in mathematics). springer, . [ ] a. zomorodian and g. carlsson. computing persistent homology. discrete & computational geometry, ( ): – , . [ ] g. carlsson. topology and data. bulletin of the american mathematical society, ( ): – , . [ ] m. bernstein, v. de silva, j. c. langford, and j. b. tenenbaum. graph approximations to geodesics on embedded manifolds. technical report, department of psychology, stanford university, . [ ] f. chazal, m. glisse, c. labruère, and b. michel. convergence rates for persistence diagram estimation in topological data analysis. journal of machine learning research, ( ): – , . [ ] c. r. genovese, m. perone-pacifico, i. verdinelli, and l. wasserman. minimax manifold estimation. journal of machine learning research, ( ): – , . [ ] g. carlsson, t. ishkhanov, v. de silva, and a. zomorodian. on the local behavior of spaces of natural images. international journal of computer vision, ( ): – , . [ ] p. lawson, a. b. sholl, j. q. brown, b. t. fasy, and c. wenk. persistent homology for the quantitative evaluation of architectural features in prostate cancer histology. scientific reports, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] j. m. chan, g. carlsson, and r. rabadan. topology of viral evolution. proceedings of the national academy of sciences, ( ): – , . [ ] p. g. cámara, a. j. levine, and r. rabadán. inference of ancestral recombination graphs through topological data analysis. plos computational biology, ( ):e , . [ ] e. abbott. flatland: a romance of many dimensions. princeton university press, . [ ] m. belkin and p. niyogi. laplacian eigenmaps and spectral techniques for embedding and clustering. advances in neural information processing systems, : – , . [ ] m. reuter, f-e. wolter, and n. peinecke. laplace–beltrami spectra as ‘shape-dna’ of surfaces and solids. computer-aided design, ( ): – , . [ ] m. belkin, j. sun, and y. wang. constructing laplace operator from point clouds in rd. in proceedings of the twentieth annual acm-siam symposium on discrete algorithms, pages – , . [ ] j. liang, r. lai, t. w. wong, and h. zhao. geometric understanding of point clouds using laplace- beltrami operator. in ieee conference on computer vision and pattern recognition, pages – , . [ ] n. g. trillos, m. gerlach, m. hein, and d. slepčev. error estimates for spectral convergence of the graph laplacian on random geometric graphs toward the laplace–beltrami operator. foundations of computational mathematics, ( ): – , . [ ] h. p. mckean jr. and i. m. singer. curvature and the eigenvalues of the laplacian. journal of differential geometry, ( - ): – , . [ ] b. andrews. lectures on differential geometry. https://maths-people.anu.edu.au/~andrews/dg. australian national university. [ ] i. t. jolliffe and j. cadima. principal component analysis: a review and recent developments. philosophical transactions of the royal society a: mathematical, physical and engineering sciences, ( ): , . [ ] h. federer. curvature measures. transactions of the american mathematical society, ( ): – , . [ ] p. niyogi, s. smale, and s. weinberger. finding the homology of submanifolds with high confidence from random samples. discrete & computational geometry, ( - ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://maths-people.anu.edu.au/~andrews/dg https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] u. ozertem and d. erdogmus. locally defined principal curves and surfaces. journal of machine learning research, : – , . [ ] c. r. genovese, m. perone-pacifico, i. verdinelli, and l. wasserman. nonparametric ridge estimation. the annals of statistics, ( ): – , . [ ] r. w. buccigrossi and e. p. simoncelli. image compression via joint statistical characterization in the wavelet domain. ieee transactions on image processing, ( ): – , . [ ] j. malik, s. belongie, t. leung, and j. shi. contour and texture analysis for image segmentation. international journal of computer vision, ( ): – , . [ ] a. b. lee, k. s. pedersen, and d. mumford. the nonlinear statistics of high-contrast patches in natural images. international journal of computer vision, ( - ): – , . [ ] j. h. van hateren and a. van der schaaf. independent component filters of natural images compared with simple cells in primary visual cortex. proceedings: biological sciences, ( ): – , . [ ] x genomics. pbmcs from a healthy donor: whole transcriptome analysis. https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc, . [ ] b. pijuan-sala, j. a. griffiths, c. guibentif, t. w. hiscock, w. jawaid, f. j. calero-nieto, c. mulas, x. ibarra-soria, r. c. v. tyser, d. l. l. ho, w. reik, s. srinivas, b. d. simons, j. nichols, j. c. marioni, and b. göttgens. a single-cell molecular map of mouse gastrulation and early organogenesis. nature, ( ): – , . [ ] x genomics. . million brain cells from e mice. https://support. xgenomics.com/ single-cell-gene-expression/datasets/ . . / m_neurons, . [ ] d. van dijk, r. sharma, j. nainys, k. yim, p. kathail, a. j. carr, c. burdziak, k. r. moon, c. l. chaffer, d. pattabiraman, b. bierie, l. mazutis, g. wolf, s. krishnaswamy, and d. pe’er. recovering gene interactions from single-cell data using data diffusion. cell, ( ): – , . [ ] l. haghverdi, m. büttner, f. a. wolf, f. buettner, and f. j. theis. diffusion pseudotime robustly reconstructs lineage branching. nature methods, ( ): – , . [ ] a. klimovskaia, d. lopez-paz, l. bottou, and m. nickel. poincaré maps for analyzing complex hierar- chies in single-cell data. nature communications, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . /parent_ngsc _di_pbmc https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://support. xgenomics.com/single-cell-gene-expression/datasets/ . . / m_neurons https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] s. wang, j-r. lin, e. d. sontag, and p. k. sorger. inferring reaction network structure from single-cell, multiplex data, using toric systems theory. plos computational biology, ( ):e , . [ ] m. hein, j-y. audibert, and u. von luxburg. graph laplacians and their convergence on random neighborhood graphs. journal of machine learning research, ( ): – , . [ ] d. ting, l. huang, and m. jordan. an analysis of the convergence of graph laplacians. arxiv, . [ ] k. v. mardia. measures of multivariate skewness and kurtosis with applications. biometrika, ( ): – , . [ ] p. campadelli, e. casiraghi, c. ceruti, and a. rozza. intrinsic dimension estimation: relevant tech- niques and a benchmark framework. mathematical problems in engineering, : – , . [ ] a. butler, p. hoffman, p. smibert, e. papalexi, and r. satija. integrating single-cell transcriptomic data across different conditions, technologies, and species. nature biotechnology, ( ): – , . [ ] y. hu, m. ranganathan, c. shu, x. liang, s. ganesh, a. osafo-addo, c. yan, x. zhang, b. e. aouizerat, j. h. krystal, d. c. d’souza, and k. xu. single-cell transcriptome mapping identifies common and cell-type specific genes affected by acute delta -tetrahydrocannabinol in humans. scientific reports, ( ): – , . [ ] k. xie, y. huang, f. zeng, z. liu, and t. chen. scaide: clustering of large-scale single-cell rna-seq data reveals putative and rare cell types. nar genomics and bioinformatics, ( ), . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction results estimators of the laplace-beltrami operator yield inaccurate scalar curvatures curvature can be computed accurately using the second fundamental form curvature of image patch manifold is consistent with a noisy klein bottle scrnaseq datasets have non-trivial intrinsic curvature discussion methods differential geometry of theoretical manifolds details of intrinsic approach to curvature estimation approach for s infinite series truncated series eigenvalue convergence estimating the laplace-beltrami operator from data details of extrinsic approach to curvature estimation quadratic regression on local neighborhoods of data selecting local neighborhoods for regression goodness-of-fit test for quadratic regression standard error and bias of scalar curvature estimate note on length scales details of toy manifold curvature computations analytical forms hypersphere one-sheet hyperboloid ring torus hypercube practical issues for curvature estimation on real-world datasets non-uniform sampling observational noise large ambient dimension choice of manifold dimension parameters for curvature estimation details of image patch dataset and klein bottle manifolds notation and preliminaries image dataset parametric family of klein bottle embeddings associating image patches to a klein bottle embedding optimal klein bottle embedding noisy klein bottle embeddings parameters for curvature estimation details of scrnaseq datasets preprocessing cell type annotations statistical tests spatial precision of errorbars sensitivity to cell downsampling sensitivity to transcript downsampling parameters for curvature estimation acknowledgements data and code availability supplementary figures references deephbv: a deep learning model to predict hepatitis b virus (hbv) integration sites. deephbv: a deep learning model to predict hepatitis b virus (hbv) integration sites. canbiao wu ¶, xiaofang guo ¶, mengyuan li ¶, xiayu fu , zeliang hou , manman zhai , , jingxian shen , xiaofan qiu , zifeng cui , hongxian xie , pengmin qin , xuchu weng , zheng hu , *, jiuxing liang * key laboratory of brain, cognition and education sciences, ministry of education, china; institute for brain research and rehabilitation, south china normal university, guangzhou, china. department of medical oncology of the eastern hospital, the first affiliated hospital, sun yat-sen university, guangzhou, guangdong, china department of gynecological oncology, the first affiliated hospital, sun yat-sen university, guangzhou, guangdong, china department of thoracic surgery, the first affiliated hospital, sun yat-sen university, guangzhou, guangdong, china school of psychology, south china normal university, guangzhou, guangdong, china generulor company bio-x lab, guangzhou, guangdong, china department of obstetrics and gynecology, tongji hospital, tongji medical college, huazhong university of science and technology, wuhan, hubei, china *corresponding author email: huzheng @ .com(zh), liangjiuxing@m.scnu.edu.cn(jl) ¶these authors contributed equally to this work. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract hepatitis b virus (hbv) is one of the main causes for viral hepatitis and liver cancer. previous studies showed hbv can integrate into host genome and further promote malignant transformation. in this study, we developed an attention-based deep learning model deephbv to predict hbv integration sites by learning local genomic features automatically. we trained and tested deephbv using the hbv integration sites data from dsvis database. initially, deephbv showed auroc of . and aupr of . on the dataset. adding repeat peaks and tcga pan cancer peaks can significantly improve the model performance, with an auroc of . and . and an aupr of . and . , respectively. on independent validation dataset of hbv integration sites from visdb, deephbv with hbv integration sequences plus tcga pan cancer (auroc of . and aupr of . ) performed better than hbv integration sequences plus repeat peaks (auroc of . and aupr of . ). next, we found the transcriptional factor binding sites (tfbs) were significantly enriched near genomic positions that were paid attention to by convolution neural network. the binding sites of ar-halfsite, arnt, atf , bhlhe , bhlhe , bmal , clock, c-myc, coup-tfii, e a, ebf , erra and foxo were highlighted by deephbv attention mechanism in both dsvis dataset and visdb dataset, revealing the hbv integration preference. in summary, deephbv is a robust and explainable deep learning model not only for the prediction of hbv integration sites but also for further mechanism study of hbv induced cancer. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / author summary hepatitis b virus (hbv) is one of the main causes for viral hepatitis and liver cancer. previous studies showed hbv can integrate into host genome and further promote malignant transformation. in this study, we developed an attention-based deep learning model deephbv to predict hbv integration sites by learning local genomic features automatically. the performance of deephbv model significantly improves after adding genomic features, with an auroc of . and an aupr of . . furthermore, we enriched the transcriptional factor binding sites of proteins by convolution neural network. in summary, deephbv is a robust and explainable deep learning model not only for the prediction of hbv integration sites but also for the further study of hbv integration mechanism. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction hbv is the main cause of viral hepatitis and liver cancer (hepatocellular carcinoma: hcc) [ ]. it is a small dna virus that can integrate into the host genome via an rna intermediate [ ]. first, hbv attaches and enters into hepatocytes, then transports its nucleocapsid which contains a relaxed circular dna (rcdna) to the host nucleus. in host nucleus, rcdna is converted into covalently closed circular dna (cccdna) which produces messenger rnas (mrna) and pregenomic rna (pgrna) by transcription. via reverse transcription in host nucleus, pgrna produces new rcdna and double-stranded linear dna (dsldna), which tend to integrate into the host cell genome [ ]. previous study showed hbv integration breakpoints distributed randomly across the whole genome with a handful of hotspots [ ]. for instance, hbv was reported to recurrently integrate into the telomerase reverse transcriptase (tert) and myeloid/lymphoid or mixed-lineage leukemia (mll , also known as kmt b) genes. the insertional events were also accompanied by the altered expression of the integrated gene [ , , ], indicating important biological impacts on the local genome. further analysis revealed that the association between hbv integration and genomic instability existed in these insertional events [ ]. moreover, significant enrichment of hbv integration was found near the following genomic features in tumours compared to non-tumour tissue: repetitive regions, fragile sites, cpg islands and telomeres [ ]. however, the pattern and the mechanism of hbv integration still remained to be explored. many of the hbv integration sites distributed throughout the human .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / genome and seem completely random [ , , ]. whether the features and patterns of these “random” viral integration events could be learned and extracted remained an open question, and once solved, will greatly improve the understanding towards hbv integration induced carcinogenesis. deep learning has an excellent performance in computational biology research, such as medical image identification [ ], discovering motifs in protein sequences [ ]. the convolutional neural network (cnn) is the most important part in deep learning, which enables a computer to learn and program itself from training data [ ]. though deep learning performs excellent in a various of fields, the detailed theory of how it makes the decision was hard to explain due to its black box effect. therefore, an approach named attention mechanism which can highlight the outstanding parts was invented to open the “black box” [ , ]. in this study, we developed, deephbv, an attention-based model to predict the hbv integration sites using deep learning. the attention mechanism calculates the attention weight for each position and connect the encoder and the decoder in the meanwhile. it highlights the regions concentrated by deephbv and helps figure out the patterns that were paid attention to. deephbv can predict hbv integration sites accurately and specifically, and the attention mechanism identified positions with potential important biological meanings. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / results deephbv effectively predicts hbv integration sites by adding genomic features. deephbv model structure and the scheme of encoding a kb sample into a binary matrix were described in fig . deephbv model was tested with our hbv integration sites database (http://dsvis.wuhansoftware.com). hbv integration sequences were prepared according to hbv integration sites as positive/negative samples following the steps in method. the negative samples should be twice number of positive samples to keep data balance and to improve the confidence level. the positive samples were divided into and as positive training dataset and testing dataset. ccorrespondingly, we extracted and negative samples as negative training dataset and testing dataset. deephint, an existing deep learning model for predicting hiv integration sites according to surroundings [ ], will also be evaluated using hbv integration sequences for training and testing. both models were trained by the same hbv integration training dataset and used the same testing dataset for the evaluation. deephbv with hbv integration sequences showed an auroc of . and an aupr of . while deephint with hbv integration sequences demonstrated an auroc of . and an aupr of . (fig ). the comparison of deephbv and deephint was described in discussion part. several previous studies showed that hbv integration has a preference on surrounding genomic features such as repeat, histone markers, cpg islands, etc [ , ]. thus, we tried to add these genomic features into deephbv, by mixing genomic feature samples together with hbv integration sequences as new datasets, then .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / trained and tested the updated deephbv models. we downloaded following genomic features from different datasets [ - ] into four subgroups: ( ) dnase clusters, fragile site, repeatmasker; ( ) cpg islands, genehancer; ( ) cons mammals, tcga pan-cancer; ( ) h k me chip-seq, h k ac chip-seq (s fig). after obtaining genomic feature data positions (sources are mentioned in s table), we extended the positions to bp and extracted related sequences on hg reference genome. we defined these sequences as positive genmoic feature samples. then we mixed hbv integration sequences, positive genome feature samples, and randomly picked negative genomic feature samples (see method) together and trained the deephbv model. once a subgroup performed well, we re-test each genomic feature in that subgroup to figure out which specific genomic feature affect the model performance significantly (s fig) (auroc and aupr values were recorded in s table). from the roc and pr curves, we found deephbv with hbv integration sites plus the genomic features repeat (auroc: . and aupr: . ) and tcga pan cancer (auroc: . and aupr: . ) can significantly improve the hbv integration sites prediction performance against deephbv with hbv integration sequences (fig ). we also performed the same test on deephint, but did not find a subgroup can substantially improve the model performance (results were recorded in s table). together, deephbv with hbv integration sequences plus repeat or tcga pan cancer can significantly improve the model performance. validation of deephbv using independent dataset visdb it is necessary of deephbv to be applied on general datasets, we tested the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pre-trained deephbv models (deephbv with hbv integration sequences + repeat peaks and deephbv with hbv integration sequences + tcga pan cancer peaks) on the hbv integration sites dataset in another viruses integration sites (vis) database visdb [ ]. we found that in the model trained with hbv integration sequences + repeat sequences showed an auroc of . and an aupr of . , while the model trained with hbv integrated sequences + tcga pan cancer showed an auroc of . and an aupr of . . the deephbv model with hbv integration sequences + tcga pan cancer performed better compared with deephbv model with hbv integration sequences + repeat and was more robust on both testing dataset from dsvis (auroc: . and aupr: . ) and independent testing dataset from visdb (auroc: . and aupr: . ). thus, we decided to use this model for future hbv integration sites study. study the preference pattern of hbv integration by conserved sequence elements deephbv can extract features with translation invariance by pooling operation, which enables deephbv to recognise certain patterns even the features were slightly translated. the participating of attention mechanism into deephbv framework might partly open the deep learning black box by giving an attention weight to each position. each attention weight represented the computational importance level of that position in deephbv judgement. the attention weights in attention layer were extracted after two de-convolution and one de-pooling operation and the output shape .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / is × . each score represented an attention weight of a bp region. positions with higher attention weight scores might have more important impact on the pattern recognition of deephbv, meaning these positions might be the critical points for identifying hbv integration positive samples. we first averaged the fractions of attention scores in all hbv integration sequences and normalized them to the mean of all positions. then we visualised the fractions of attention scores and found the figure showed peak-valley-peak patterns only in positive samples (fig ). we were interested in the positions with higher attention weights in convolution neural network. and we found that, in the attention weight distribution of deephbv with hbv integration sites + tcga pan cancer, a cluster of attention weights much higher than other weights often occurred in the positive samples. while in the model of deephbv with hbv integration sites + repeat did not show this pattern (fig ). to further discover the pattern behind these positions with higher attention weights, we defined the sites with top % highest attention weights as attention intensive sites, the regions of bp near them as attention intensive regions. we mapped these attention intensive sites on hg reference genome with genomic features (fig ), but found that the positional relationship between attention intensive sites and genomic features was not quite clear. the results indicated that there may exist other specific pattern closely related to hbv integration preference, and when analysed carefully, could be recognized by the deephbv model. convolution and pooling module will learn the patterns with translation invariance in deep learning, based on that deep learning network tend to learn the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / domains happened recurrently among different samples in the same pooling matrix, even if the learned feature was not at the same position in these different samples [ , ]. attention intensive regions are more likely to be conserved due to the translation invariance in convolution and pooling module, and would give hints to the selection preference of hbv integration sites. transcriptional factor-binding sites (tfbs) motifs are conserved genomic elements which can be critical to the regulation of downstream genes. therefore, we tested whether tfbs played important roles in hbv integration preference. we used all hbv integration samples whose prediction scores were higher than . from dsvis and visdb separately to enrich local tfbs motifs in attention intensive regions by homer v . . [ ] with its vertebrates transcription factor databases (table ). from the result of deephbv with hbv integration sequences + tcga pan cancer, binding sites of ar-halfsite, arnt, atf , bhlhe , bhlhe , bmal , clock, c-myc, coup-tfii, e a, ebf , erra, foxo , heb, hic , hif- b, lrf, meis , mitf, mnt, myog, n-myc, npas , npas, nr a , ptf a, snail , tbx , tbx , tcf , tead , tead , tead , tead, tgif , tgif , thrb, usf , usf , zac , zeb , zfx, znf , znf can be both enriched in attention intensive regions of dsvis and visdb sequences. we selected two representative samples to obtain a more intuitive display. genomic features, hbv integration sites from dsvis and visdb, attention intensive sites and tfbs were aligned and shown in hg reference genome (fig ). most attention intensive sites can be mapped to enrich tf motifs. and the clusters of high attention weight from the output of deephbv with hbv integration sites plus tcga pan cancer showed the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / binding site of a tumour suppressor gene hic , circadian clock related elements bmal , clock, c-myc and naps (fig ). the data provided novel insights into hbv integration site selection preference and reveal biological importance that warrants future experimental confirmation. table . enriched tfbs from attention intensive regions of deephbv with hbv integration sites + tcga pan cancer peaks. homer known results homer de novo results rank name p-value rank best match/details p-value bmal e- tead e- npas . e- ebf e- clock . e- tcf e- c-myc . e- grhl e- zfx . e- dux e- tgif . e- ptf a e- mnt . e- tead e- lrf . e- ahr::arnt . e- tbx . e- sox . e- znf . e- tead . e- n-myc . e- zic . e- znf . e- nr e . e- usf . e- sox . e- bhlhe . e- zbtb . e- rbpj . e- usf . e- zac . e- isl . e- tgif . e- znf . e- zeb . e- ascl . e- thrb . e- znf . e- ptf a . e- lrf . e- bhlhe . e- znf . e- tead . e- pknox . e- stat . e- bcl b . e- meis . e- arnt . e- c-myc . e- osr . e- usf . e- tfap a . e- npas . e- hic . e- tead . e- tead . e- ar-halfsite . e- stat . e- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / tcf . e- mitf . e- tead . e- atf . e- hif- b . e- foxo . e- e a . e- tead . e- mef a . e- znf . e- nkx . . e- coup-tfii . e- myog . e- nkx . . e- snail . e- heb . e- tbx . e- scrt . e- nr a . e- nanog . e- oct . e- elk . e- erra . e- gata . e- bhlha . e- amyb . e- nr a . e- nfkb-p -rel . e- zic . e- trps . e- hoxa . e- hif a . e- isl . e- cebp:ap . e- ews:fli -fusion . e- foxk . e- ets . e- .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / discussion in this study, we developed an explainable attention-based deep learning model deephbv to predict hbv integration sites. in the comparison of deephbv and deephint on predicting hbv integration sites (s table), deephbv out-performed deephint after adding genomic features due to its more suitable model structure and parameters on recognising the surroundings of hbv integration sites. we applied two convolution layers ( st layer: convolution kernels and the kernel size is ; nd layer: convolution kernels and the kernel size is ) and one pooling layer (with pooling size of ) in deephbv while in deephint the model only have one convolution layer ( convolution kernels and the kernel size is ) and one pooling layer (with pool size of ). the increasing of convolution layers enables the information from higher dimensions can be extracted, the increasing of convolution kernels enables more feature information to be extracted [ ]. we trained the deephbv model using three strategies ( ) dna sequences near hbv integration sites (hbv integration sequences), ( ) hbv integration sequences + tcga pan cancer peaks, ( ) hbv integration sequences + repeat peaks. we found that the model with hbv integration sequences adding tcga pan cancer or repeat can both significantly improve the model performance. and the deephbv with hbv integration sequences adding tcga pan cancer peaks performed better on independent test dataset visdb. however, the attention intensive regions cannot be well aligned to these genomic features. thus, we further inferred that other features such as tfbs motifs may influence deephbv in the prediction process. and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / homer was applied to recognise these tfbs that might be related to hbv-related diseases or cancer development. we noticed that the attention intensive regions identified by attention mechanism of deephbv with hbv integration sequences + tcga pan cancer showed strong concentration on the binding site of the tumour suppressor gene hic , circadian clock-related elements bmal , clock, c-myc, naps , and the transcription factors tead and nr a . these dna binding proteins were closely related to tumour development [ - ]. for instance, hic is a tumour suppressor gene in hepatocarcinogenesis development [ , ]. bmal , clock, c-myc, naps all participate in the regulation of circadian clock [ ], which is reported to promote hbv-related diseases [ , ]. in accordance, the binding motif of circadian clock-related elements were also enriched from the attention intensive regions of deephbv with hbv integration sequences + repeats, further confirming the results (s table). in addition, the other transcription factors identified by deep hbv are tead and nr a . tead deregulation affected well-established cancer genes such as braf, kras, myc, nf and lkb , and showed high correlation with clinicopathological parameters in human malignancies [ ]. nr a (also known as liver receptor homolog- , lrh- ) binds to the enhancer ii (enii) of hbv genes, and serves as a critical regulator of their expression [ ]. in summary, deephbv is a robust deep learning model of using convolutional neural network to predict hbv integrations. our data provide new insight into the preference for hbv integration and mechanism research on hbv induced cancer. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / methods data preparation a detailed step-by-step instruction of deephbv was provided in s and s notes. to obtain positive training and testing samples for deephbv, we extracted bp dna sequences from upstream and bp dna sequences from downstream of hbv integration sites as positive dataset, each sample was denoted as 𝑆 = (𝑛 ,𝑛 ,…,𝑛 ), where 𝑛i represents the nucleotide in position i. deephbv, as a deep learning network also require negative samples that do not contain hbv integration sites as background area. the existing of hbv integration hot spots which contains several integration events within ~ kb range [ ] prompted us that we should selected background area keeping enough distance from known hbv integration sites. thus, we discarded the regions around known hbv integration sites with length kb on hg reference genome and selected kb length dna sequences randomly on remained regions as negative samples. we encoded extracted dna sequences using one-hot code to make the calculation of distance between features in training and the calculation of similarity more accuracy. original dna sequences were converted to binary matrices of -bit length where each dimension corresponds to one nucleotide type. finally, we converted a bp dna sequence into a × binary matrix. feature extraction deephbv model first applied convolution and pooling module to learn and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / obtain sequence features around hbv integration sites (s fig). each binary matrix representing a dna sequence entered the convolution and pooling module to execute convolution calculation. we employed multiple variant convolution kernels to calculation in order to obtain different features. s = (𝑛 ,𝑛 ,…,𝑛 ) denoted as a specific dna sequence and e represented the binary matrix- encoded from s, the convolutional calculation in convolution layer refers to 𝑋 = 𝑐𝑜𝑛𝑣(𝐸), which can be described as: 𝑋𝑘,𝑗= ∑ 𝑝― 𝑗= ∑ 𝐿 𝑙= 𝑊𝑘,𝑗,𝑙𝐸𝑙,𝑖+𝑗 ( ) where ≤ 𝑘 ≤ 𝑑, 𝑑 refers to the number of kernels, ≤ 𝑖 ≤ 𝑛 ― 𝑝 + , 𝑖 refers to the index, 𝑝 refers to the kernel size, n refers to input sequence length, 𝑊 refers to the kernel weight. convolutional layer activated eigen vectors using rectified linear unit (relu) after extracting relative eigen vectors. relu is an activation function in artificial neural networks which can be described as 𝑓(𝑥) = max ( ,𝑥). we applied relu on the output matrix of each convolution layer and mapped each element on a sparse matrix. relu imitates real neuron activation, which enables data fitted to the model better. then we applied max-pooling strategy to complete dimension reduction as well as support the maximum retention of predicted information. till now, we achieved the final eigen vector 𝐹c from the binary matrix represented dna sequence after feature extracting in convolution and pooling module. attention mechanism in deephbv model deephbv added attention mechanism in order to capture and understand the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / position contribution in abstracted eigen-vector 𝐹c. eigen-vector entered the attention layer, which will calculate a weight value to each dimension in 𝐹c. the attention weight represents the contribution level of the convolutional neural network (cnn) in that position. the output of attention weight 𝑡𝑗 is the contribution score, larger 𝑡𝑗 score means bigger contribution in this position to hbv integration sites prediction. all contribution scores were normalized to achieve the dense eigenvector matrix, which denoted as 𝐹𝑎: 𝐹𝑎 = ∑ 𝑞 𝑗= 𝑎𝑗𝑣𝑗 ( ) where, 𝑎𝑗 = 𝑒𝑥𝑝 (𝑡𝑗) ∑𝑞𝑖 𝑒𝑥𝑝 (𝑡𝑖) ( ) where 𝑎𝑗 represents the relevant normalisation score, 𝑣𝑗 represents the eigenvector at position 𝑗 of the input eigenmatrix. each position represents an extracted eigen-vector in each convolution kernel. the convolution-pooling module and the attention mechanism module need to be combined in model prediction progress, in another word, eigen-vector 𝐹c and relative eigen important score 𝐹𝑎 should work together in hbv integration sites prediction. we linked the values in eigen-vector 𝐹c and linearly mapped them to a new vector 𝐹𝑣, which is: 𝐹𝑣= (𝑑𝑒𝑛𝑠𝑒(𝑓𝑙𝑎𝑡𝑡𝑒𝑛(𝐹c))) ( ) in this step, flatten layer performed function 𝑓𝑙𝑎𝑡𝑡𝑒𝑛() to reduce dimension and concatenate data; function 𝑑𝑒𝑛𝑠𝑒() was executed by dense layer, which will map .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / dimension-reduced data to a single value. then 𝐹𝑣 and 𝐹𝑎 concatenated vector entered linear classifier prediction to calculate the probability of hbv integration happened within the current sequence, with: 𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑐𝑜𝑛𝑐𝑎𝑡(𝐹𝑎,𝐹𝑣)) ( ) where 𝑃 is the predicted score, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑() represents the activation function acted as classifier in final output, 𝑐𝑜𝑛𝑐𝑎𝑡() represents the concatenate operation. in the meantime, if we give the output eigenvector 𝐹c from convolution-and-pooling module as input, and execute attention mechanism, weight vector 𝑊 can be achieved: 𝑊 = 𝑎𝑡𝑡(𝑎 ,𝑎 ,…,𝑎𝑞) ( ) where 𝑎𝑡𝑡() refers to the attention mechanism, 𝑎𝑖 denotes the eigenvector in 𝑖𝑡ℎ dimension in the eigenmatrix, 𝑊 represents the dataset containing contribution scores of each position in the eigenmatrix extracted by convolution-and-pooling module. deephbv model training after confirming each parameter in deephbv (s table), we trained the deep learning neural network model deephbv via binary crossentropy. the loss function of deephbv can be defined as: loss = -∑𝑖 𝑦𝑖 log(𝑃) + ( ― 𝑦𝑖) log( ― 𝑃) ( ) where, 𝑦𝑖 is the prediction score, 𝑃 is the binary tag value of that sequence (in this dataset, positive samples were labelled as and negative samples were labelled as ). back propagation algorithm was adapted in training progress and .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / nesterov-accelerated adaptive moment estimation (nadam) gradient descent algorithm was applied to optimise parameter initialization. the deep learning neural network model adapted python . , keras library . . [ ] using three nvidia® tesla v -pcie- g(nvidia corporation, california, usa ) for training and testing. deephbv takes around min and s for model training and testing respectively using the computational platform under such software and hardware settings. data availability deephbv is available as an open-source software and can be downloaded from https://github.com/jiuxingliang/deephbv.git reference . liang tj. hepatitis b: the virus and disease. hepatology ; ( suppl):s - . . tu t, budzinska ma, shackel na et al. hbv dna integration: molecular mechanisms and clinical implications. viruses ; ( ). . sung wk, zheng h, li s et al. genome-wide survey of recurrent hbv integration in hepatocellular carcinoma. nat genet ; ( ): - . . zhao lh, liu x, yan hx et al. genomic and oncogenic preference of hbv integration in hepatocellular carcinoma. nat commun ; : . . ding d, lou x, hua d et al. recurrent targeted genes of hepatitis b virus in the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / liver cancer genomes identified by a next-generation sequencing-based approach. plos genet ; ( ):e . . tu t, budzinska ma, vondran fwr et al. hepatitis b virus dna integration occurs early in the viral life cycle in an in vitro infection model via sodium taurocholate cotransporting polypeptide-dependent uptake of enveloped virus particles. j virol ; ( ). . mason ws, gill us, litwin s et al. hbv dna integration and clonal hepatocyte expansion in chronic hepatitis b patients considered immune tolerant. gastroenterology ; ( ): - e . . litjens g, kooi t, bejnordi be et al. a survey on deep learning in medical image analysis. med image anal ; : - . . bailey tl, baker me, elkan cp. an artificial intelligence approach to motif discovery in protein sequences: application to steroid dehydrogenases. the journal of steroid biochemistry and molecular biology ; ( ): - . . yamashita r, nishio m, do rkg et al. convolutional neural networks: an overview and application in radiology. insights into imaging ; ( ): - . . bahdanau d, cho k, bengio y. neural machine translation by jointly learning to align and translate. computer science . . guidotti r, monreale a, ruggieri s et al. a survey of methods for explaining black box models. acm comput. surv. ; ( ):article . . hu z, zhu d, wang w et al. genome-wide profiling of hpv integration in cervical cancer identifies clustered genomic hot spots and a potential .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / microhomology-mediated integration mechanism. nat genet ; ( ): - . . chollet fao. keras. . . hu h, xiao a, zhang s et al. deephint: understanding hiv- integration via deep learning with attention. bioinformatics ; ( ): - . . haeussler m, zweig as, tyner c et al. the ucsc genome browser database: update. nucleic acids res ; (d ):d -d . . inoue f, kircher m, martin b et al. a systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. genome res ; ( ): - . . robinson jt, thorvaldsdottir h, winckler w et al. integrative genomics viewer. nature biotechnology ; ( ): - . . tang d, li b, xu t et al. visdb: a manually curated database of viral integration sites in the human genome. nucleic acids res . . zhang w, itoh k, tanida j et al. parallel distributed processing model with local space-invariant interconnections and its optical architecture. appl opt ; ( ): - . . bruna j, zaremba w, szlam a et al. spectral networks and locally connected networks on graphs. computer science . . heinz s, benner c, spann n et al. simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. molecular cell ; ( ): - . . seide f, gang l, dong y. conversational speech transcription using .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / context-dependent deep neural networks. . . taniguchi k, roberts lr, aderca in et al. mutational spectrum of beta-catenin, axin , and axin in hepatocellular carcinomas and hepatoblastomas. oncogene ; ( ): - . . zheng j, xiong d, sun x et al. signification of hypermethylated in cancer (hic ) as tumor suppressor gene in tumor progression. cancer microenviron ; ( ): - . . paibomesai mi, moghadam hk, ferguson mm et al. clock genes and their genomic distributions in three species of salmonid fishes: associations with genes regulating sexual maturation and cell cycling. bmc res notes ; : . . fekry b, ribas-latre a, baumgartner c et al. incompatibility of the circadian protein bmal and hnf alpha in hepatocellular carcinoma. nat commun ; ( ): . . mukherji a, bailey sm, staels b et al. the circadian clock and liver function in health and disease. j hepatol ; ( ): - . . huh hd, kim dh, jeong hs et al. regulation of tead transcription factors in cancer biology. cells ; ( ). . cai yn, zhou q, kong yy et al. lrh- /hb f and hnf synergistically up-regulate hepatitis b virus gene transcription and dna replication. cell research ; ( ): - . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure legends figure . the deep learning framework applied in deephbv. (a) scheme of encoding a kb dna sequence into a binary matrix using one-hot code; (b) a brief flowchart of deephbv structure, the matrix shape was included in brackets, and a detailed flowchart was in s fig. figure . evaluation of deephbv and deephint model prediction performance on the test dataset. (a) receiver-operating characteristic (roc) curves and (b) precision recall (pr) curves, respectively. “deephbv with hbv integration sequences” refers to deephbv model with only hbv integration sequences as input; “deephint with hbv integration sequences” refers to deephint model with only hbv integration sequences as input; “deephbv with hbv integration sequences + repeat” refers to deephbv integration sequences and repeat sequences as input; “deephbv with hbv integration sequences” refers to deephbv integration sequences and tcga pan cancer sequences as input: “deephbv with hbv integration sequences + repeat + (test) visdb” refers to deephbv using hbv integration sequences and repeat sequences for training and using visdb as independent test dataset; “hbv with hbv integration sequences + tcga pan cancer + (test) visdb” refers to deephbv using hbv integration sequences as tcga pan cancer sequences for training and using visdb as independent test dataset. figure . the attention weight distribution of analysed by deephbv with hbv integration sequences + genomic features. (a) deephbv with hbv integration sequences + tcga pan cancer peaks; (b) deephbv with hbv integration .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / sequences + repeat peaks. the left graph showed the fractions of attention weight, which were averaged among all samples and normalized to the average of all positions, each index represents a bp region due to the multiple convolution and pooling operation. the graphs on the right are representative samples of attention weight distribution of positive samples and negative samples. figure . attention intensive regions highlighted essential local genomic features on predicting hbv integration sites. representative examples showed the positional relationship between the attention intensive sites and several genomic features using deephbv with hbv integration sequences + tcga pan cancer model on (a) chr : , , - , , (hg ), (b) chr : - (hg ). each of these two sequences contains hbv integration sites from both dsvis and visdb. enriched dna binding proteins detected by homer from the attention intensive regions using the output of deephbv then we applied fimo [ ] to find the enriched motif position and label the motifs on attention intensive regions. ucsc genome browser [ ] and matplotlib [ ] was used for visualisation. “hpv integration site” refers to the sites selected from our unpublished database used as testing samples. “attention intensive sites” denotes the sites with top % attention weight. “repeatmasker”, “tcga pan cancer”, “dnase clusters”, “con mammals”, “genehancer”, “layered h k ac”, “layered h k me ” are genomic features. references . grant ce, bailey tl, noble ws. fimo: scanning for occurrences of a given .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / motif. bioinformatics ; ( ): - . . haeussler m, zweig as, tyner c et al. the ucsc genome browser database: update. nucleic acids res ; (d ):d -d . . hunter jd. matplotlib: a d graphics environment. computing in science & engineering ; ( ): - . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supporting information s fig. deephbv framework. each part represents a layer in neural network and 𝑛 × 𝑛 stands for the output dimension which was explained in s note. two continuous convolution layers were used to extract features; max-pooling layers can reduce the dimension while keeping the feature matrix has the ability to predicting information; dropout layer randomly drop some results to prevent over-fit; flatten layer is responsible for reduce the dimensions and connect them; dense layer is used to map the output from last layer to a specific value; attention layer and attention flatten are used to give a weight score to each dimension in the feature matrix; concatenate layer concatenates captured features and importance scores of those features from the convolution module and the attention mechanism model. prediction output offered the final output reveals the probability of hbv infection. s fig. prediction performance on the hbv integration dataset with different types of genomic features added in. we found that character and character outperformed the deephbv model with an significant increase in aupr and auroc score on character and character , indicating that deephbv can capture genomic features from character and character effectively, so we did further analysis on each single items in character group and , and found that repeats and tcga pan cancer are the genomic features that can be captured by deephbv which significantly improved model performance. deephbv with hbv integration sequences + repeats reached the auroc of . and the aupr of . , which deephbv with hbv integration sequences + tcga pan cancer reached the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / auroc of . and the aupr of . . s table. the parameters for the deep neural network used in deephbv. s table. genomic features and sources. (access date: novemember th, ) s table. comparison of deephbv and deephint result record. s table. enriched tfbs from attention intensive regions of deephbv with hbv integration sites + repeat peaks. s note. deephbv framework. deephbv neural network structure design and hyperparameters involved in deephbv are noted. s note. mathematical matters of the deephbv. there are explanations for mathematical matters (i.e. encoding dna sequences, convolution layers, the max pooling layer, dropout layer, attention layer, concatenate layer, linear classifier and optimisation algorithm) of the deephbv in this part. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a validated generally applicable approach using the systematic assessment of disease modules by gwas reveals a multi-omic module strongly associated with risk factors in multiple sclerosis a validated generally applicable approach using the systematic assessment of disease modules by gwas reveals a multi-omic module strongly associated with risk factors in multiple sclerosis tejaswi v.s. badam , †, hendrik a. de weerd , †, david martínez-enguita , tomas olsson , lars alfredsson , ,ingrid kockum ,maja jagodic , zelmina lubovac-pilav *, mika gustafsson * school of bioscience, systems biology research center, university of skövde, sweden bioinformatics, department of physics, chemistry and biology, linköping university, linköping, sweden department of clinical neuroscience, karolinska institutet, center for molecular medicine, karolinska university hospital, se- , stockholm, sweden institute of environmental medicine, karolinska institutet, center for molecular medicine, karolinska university hospital, se- , stockholm, sweden †these authors contributed equally to the work. *these authors share senior authorship. corresponding author: mika gustafsson (mika.gustafsson@liu.se) running title : multi-omic modules in multiple sclerosis keywords : benchmark , multi-omics , network modules ,multiple sclerosis, risk factors summary : our benchmark of multi-omic modules and validated translational systems medicine workflow for dissecting complex diseases resulted in multi-omic module of genes highly enriched for risk factors associated with multiple sclerosis. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract background: there are few (if any) practical guidelines for predictive and falsifiable multi-omics data integration that systematically integrate existing knowledge. disease modules are popular concepts for interpreting genome-wide studies in medicine but have so far not been systematically evaluated and may lead to corroborating multi-omic modules. methods: we assessed eight module identification methods in previously published expression and methylation studies of diseases using gwas enrichment analysis. next, we applied the same strategy for multi-omics integration of datasets of multiple sclerosis (ms), and further validated the resulting module using both gwas and risk-factor associated genes from several independent cohorts. results: our benchmark of modules showed that in immune-associated diseases modules inferred from clique-based methods were the most enriched for gwas-genes. the multi-omics case study using ms revealed the robust identification of a module of genes. strikingly, most genes of the module was differentially methylated upon the action of one or several environmental risk factors in ms (n = , p = - ) and were also independently validated for association with five different risk factors of ms, which further stressed the high genetic and epigenetic relevance of the module for ms. conclusion: we believe our analysis provides a workflow for selecting modules and our benchmark study may help further improvement of disease module methods. moreover, we also stress that our methodology is generally applicable for combining and assessing the performance of multi-omics approaches for complex diseases. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction complex diseases are the result of disruptions of many interconnected multimolecular pathways, reflected in multiple omics layers of regulation of cellular function, rather than perturbations of a single gene or protein[ ]. systems and network medicine aim to translate observed omics differences in patients using networks, in order to personalize medicine[ ]. importantly, genes that are associated with diseases are more likely to interact with each other rather than with non-disease associated genes, forming multi-omics network disease modules[ , ]. owing to the incompleteness of the underlying multi-omics interactions, the networks are often modeled as effective gene-gene interactions, using for example string database[ ]. thus, network modules might be ideal tools for multi-omics analysis. however, the evaluation of performance of different module inference methods remains a poorly understood topic, which creates the need for transparent evaluation of these methods based on objective benchmarks across various diseases and omics. genomic concordance has been suggested as a multi-omics validation principle[ , ], i.e., modules derived from one omic, such as gene expression or dna methylation should be enriched for disease- associated single nucleotide polymorphisms (snps). the variety of algorithms that have been proposed and applied for identification of disease modules can be categorized into two main groups. on the one hand, there are methods which rely purely on clustering of the genes in relevant disease networks[ ]. on the other hand, there are algorithms which make use of disease-associated molecules or genetic loci to reveal disease modules that correlate with disease function, such as the disease module detection (diamond) algorithm[ ], clique-based methods[ ],[ ] and weighted gene co-expression network analysis (wgcna)[ ]. the data-derived information can either be differentially expressed genes or differentially correlated or co-expressed genes. methods following the former approach were recently benchmarked by a metric utilizing genomic concordance within the dream consortia[ ]. however, so far, algorithms from the latter group have not been benchmarked. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . in this study we analyzed, assessed, and compared the performance of eight of the most popular methods for disease module analysis using the r package modifier[ ] on different diseases including expression and ten methylation datasets. we assessed the performance of the methods using genome-wide association (gwas) enrichment analysis from the summary statistics of all assayed snps similarly as in dream[ ]. the resulting workflow provided a systematic procedure for selecting the best method for each disease and set the stage for method development in the disease module area. moreover, it allowed the predictive assessment of combining multiple datasets across several omics using gwas, which we tested in multiple sclerosis (ms), a heterogeneous complex disease. briefly, we derived multi-omic modules in a stepwise optimization of gwas enrichment from transcriptomic and methylomic analyses of ms. we further evaluated the identified multi-omic ms module of genes for its enrichment across dna methylation studies of eight known lifestyle-associated risk factors of ms. additionally, we validated the identified significant enrichment risk factors in an independent dna methylation ms study which indeed showed a very strong and significant ms enrichment for both module genes and risk factor associations. in summary, we provide a robust multi-omics strategy that can be used to disentangle networks of affected genes in complex diseases from both genetic and environmental levels. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . materials and methods benchmark data a total of publicly available datasets for the transcriptomic benchmark and ten publicly available datasets for the methylomic benchmark were used. to avoid bias due to subtypes of diseases and drug treatments, we searched for datasets that have only patient and control samples, and that are available for download from the geo database. we categorized the datasets into seven distinct disease types based on the disease-trait type associations used in choobdar et al[ ]., i.e. autoimmune, cardiovascular, glycemic, inflammatory, neurodegenerative, and psychiatric and social disorders. a total of complex diseases were used in the transcriptomic benchmark analysis, while six complex diseases were used in the methylation benchmark analysis. the methylation benchmark diseases belong to inflammatory, autoimmune, and glycemic disease types. ms use case data a total of publicly available and one non-publicly available transcriptomic and methylomic ms- related datasets were used in the ms multi-omics integration use case. in general, every dataset in the modifier benchmark was also used in the ms use case, with exceptions according to certain criteria. the inclusion of transcriptomic ms datasets followed the criteria: ) the largest dataset by sample number, per tissue, is shown in the modifier benchmark; ) replication cohorts are not included in the ms use case. criteria for inclusion of methylomic ms datasets were the following: ) the largest dataset by sample number, per tissue or cell type, is included in the modifier benchmark; ) a single dataset for every cell-specific tissue was included in the benchmark; ) methylation studies that reported using whole blood as sample tissue were excluded from the ms use case, due to the high heterogeneity of this type of data. for the additional independent validation, we utilized the methylation microarray analysis of blood samples analyzing from kular et al . for each of these ms patients (nms= ) and healthy controls (nhc= ), we also collected their lifestyle-associated risk factors from questionnaires that (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . were part of the epidemiological investigation of multiple sclerosis (eims) study. those factors were smoking status, prior ebv infection, sunbathing, nightshift work, alcohol consumption, as well as phenotypic features (age, sex, bmi at age of ). pre-processing and quality control of risk factor methylation data dna methylation datasets were downloaded from geo as raw idat files, when available, or matrices of beta values. pre-processing of the data was performed using the chip analysis methylation pipeline (champ) r package[ ] , version . . . default parameters were used for probe and sample filtering. probes with a detection p-value above . , probes with a fraction of failed (bead count less than ) samples over . , non-cpg probes, snp-related probes, multi-hit probes, and probes located on chromosomes x and y, were removed. samples with a proportion of failed (na) probe p-values over . were also removed from the analysis. post-filtering imputation of na values was conducted on the beta matrices, with default parameters (“combine” method, k = , probe cutoff = . , sample cutoff = . ). filtered imputed matrices were normalized applying the beta- mixture quantile dilation (bmiq) normalization method[ ]�, including correction of type-i and type-ii probe effects. data quality was assessed by producing multi-dimensional scaling (mds) plots of the top , most variable positions per sample, density plots for the distribution of beta values, and hierarchical clustering of samples, before and after normalization. singular value decomposition (svd) was used to detect the most significant components of variation in the data. unwanted sources of variation in the normalized data were corrected using combat batch effect correction[ ]. module identification the modifier r package offers nine different methods for producing disease modules for which we included all but clique sum exact as it is highly similar to clique sum. the included methods will produce modules based on the provided omics input and background network and do not include prioritization of pathway association. modifier methods used for module identification through this study are listed in the supplementary table . for the methods that require a network, we used the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . human ppi network from string database version , consisting of , , interactions among , unique genes/proteins. we filtered the network to have high confidence interactions by using the cutoff > to reduce the number of false positives, resulting in a subset of , interactions between , unique genes/proteins. for co-expression methods, the network is computed within the method algorithm from the gene expression matrix. in case of the benchmark analysis, we used a stringent cutoff of score > , so that the runs were not computationally intensive. for the ms use case benchmark, we used the network combined score cutoff > . the processed matrix for each dataset and their respective phenotypic information were downloaded from geo. the input object is prepared using the create_input_microarray function from the modifier package which is then used for creating the modules. the input function applies linear model using limma for comparison of patient's vs controls to get the differentially methylated or expressed genes. a dynamic cutoff of % in the differentially methylated or expressed genes is applied for input seed genes for the methods that require seed genes. differential methylation analysis of risk factor data differentially methylated probes (dmps) were found by fitting a linear model to the data using the limma r package[ ]�, version . . implemented in the champ function champ.dmp. p-values were adjusted for multiple testing using benjamini-hochberg false discovery rate (fdr) correction. differentially methylated genes (dmgs) were obtained and annotated using the org.hs.eg.db r package�, version . . . dmg lists were cross-checked against the string database version ppi network used for module identification in the ms multi-omics approach (high confidence interactions, combined score > ). dmgs that were not present in the ppi network were removed. in case of the additional ms validation dataset, a linear mixed effect model with risk factors (age, sex, bmi at age of , smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents) as categorical covariates was implemented to find the differentially methylated genes after the preprocessing step, as described in the preprocessing section of the methods. since all the patients were ebv positive, we did not include it for linear mixed effect model. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . validation of modules the final modules produced from each single algorithm and the consensus were evaluated using pascal[ ] (pathway scoring algorithm). pascal implements a fast and rigorous gene scoring and pathway enrichment pipeline that can be run on a local machine. the snp values are converted to gene scores by computing pairwise snp-by-snp correlations and obtaining z-scores from their distribution. these obtained gene scores are fused with the pathway enrichment analysis to recompute a chi-square p-value for the given set of module genes. thus, the obtained chi-square p- value serves as the significance of the module in its enrichment of the disease-associated pathway gene loci. a combined p-value was computed for each of the methods using fisher’s method[ ], diseases, and datasets for ranking the performance of the modules in each criterion. integration of ms single-omic modules clique sum was ranked as the best performing method on average for both transcriptomic and methylomic data, according to the ms gwas enrichment of the modules calculated by pascal. therefore, significant clique sum modules (p < . ) were selected for further analysis (nine transcriptomic and four methylomic modules). consensus modules were generated across each omic by applying a module count-based method, where the criteria for gene inclusion in the consensus is its presence in a certain number of single-method modules. to balance the weight of each omic in the multi-omics integration, the top four significant modules per omic were used to create each consensus (fig. a, b). single-omic clique sum consensus were ranked again by gwas enrichment, and the best performing consensus per omic was selected for integration into the multi-omics module. enrichment analyses of the ms multi-omics module disease enrichment analysis of the multi-omics module was performed by fisher’s exact test, with a significance threshold of p < . . ms-associated genes were obtained from the gene-disease association summary provided by disgenet database . [ ]�. all genes with a known association (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to the disease “multiple sclerosis” (unified medical language system unique identifier c ) were considered ms-associated genes (n = , ). pathway enrichment analysis was carried out using the function enrichkegg from the clusterprofiler r package[ ]�, version . . . p-values were adjusted for multiple testing using benjamini-hochberg fdr correction, with a significance threshold of adj. p < . . enrichment of the multi-omics module in ms risk-factor-associated genes was performed by fisher’s exact test, with a significance threshold of p < . . to provide a uniform comparison of ms risk factor-associated genes across datasets, the module was tested for enrichment in the top , dmgs (with at least p < . ) obtained from the differential methylation analysis with champ for each risk factor dataset. representation of the ms multi-omics module experimentally validated interactions for the multi-omics module genes were obtained from string database version (experimental score > ) and imported into cytoscape[ ] version . . . to determine representative functional clusters of module genes, overrepresented gene ontology (go) biological process (bp) terms in the module were found using bingo[ ] version . . , with benjamini-hochberg fdr for multiple testing correction, and a significance threshold of adj. p < . . then, enriched go terms with adj. p < x - were summarized using revigo[ ] server tool (medium allowed similarity = . ) and categories of interest were selected by uniqueness (>= %), dispensability (>= %), and frequency (<= %) criteria. further manual assessment was performed to group similar terms with an adequate number of genes in the network. results a benchmark comparing transcriptionally derived disease modules from different diseases. we compiled a benchmark source of disease modules and summary statistics of gwas datasets from well-powered case-control studies (supplementary table ), some of which were previously used in the dream topological disease module challenge[ ]. for these datasets we assessed modules using the same metric as in the recent dream study[ ], based on the pathway scoring (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . algorithm (pascal)[ ]. for each disease we compiled one to five publicly available transcriptomic datasets considering both easily assessable tissues (e.g. blood) and target tissues, thereby covering transcriptomic datasets in total (fig. a). modules were created using eight different methods from modifier[ ]. in addition, we also tested if genes detected by several methods, hereafter called consensus module genes, had higher enrichment scores than single-method module genes. enrichment scores for the non-empty modules (n = ) from this analysis were summarized for each method and dataset (fig. a). in total, we found significantly gwas-enriched modules in . % ( / ) of the single-method modules and . % ( / ) of the non-empty consensus modules that combined at least three methods as a criterion. these numbers seemed higher than expected, which might have been a consequence of the same gwas being used to evaluate multiple transcriptomic datasets of the same disease. hence, we aggregated scores of the same disease and method as meta p-values (see methods). out of the possible disease-method combinations, % of the pairs showed a significant gwas pascal enrichment, which is more than expected by chance (n = , p = . x - ). the most enriched method was clique sum, which showed significant enrichment in seven out of diseases (binomial test p = . x - ). many methods exhibited strong enrichments in coronary artery disease (cad), type diabetes, multiple sclerosis (ms), rheumatoid arthritis (ra), and the inflammatory bowel diseases(ibd), ulcerative colitis (uc) and crohn’s disease (cd), while no significant enrichments were found for asthma, hepatitis c, type diabetes, narcolepsy, parkinson’s disease, or for any psychiatric and social diseases. if we instead ranked methods based on their respective module gwas enrichment, clique sum showed significant association in % ( / ) of the modules corresponding to seven different diseases followed by consensus modules identified by two out of three methods. lastly, diamond and co- expression-based methods all achieved significant results, although worse than clique sum. next, we tested the impact of network centrality and module size as potential confounding factors of the applied performance metric. we found a significant but very modest correlation for module size (fig. c, spearman rho = . , p = . x - ), and a non-significant correlation for interactome (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . centrality (fig. b, rho = . , p = . ). thus, it is meaningful to compare results with differences in those module properties. in summary, we found that the clique sum method resulted in the highest disease enrichment for most diseases, while not producing significant modules for others, such as type diabetes, where co-expression-based methods and diamond scored best. in general, we observed stronger enrichments for inflammatory diseases and weaker results for psychiatric and social diseases. considering that the transcriptomic modules showed that clique sum was the best performing method and that the cardiovascular and inflammatory diseases were the most enriched within the clique sum modules, we wanted to test whether this was true for methylomic data as well. a benchmark comparing methylation-based disease modules from six different diseases using gwas. following the same logic of the transcriptomic benchmark, we performed a similar benchmark study for methylation modules. we collected ten datasets from three different disease categories, including six complex diseases, and ran the eight modifier methods on them (fig. a). in addition, we constructed consensus modules for each of the datasets. modules were then tested for gwas enrichment using pascal. inspecting the overall performance, we found nine single-method modules with a significant gwas enrichment ( / , . %). though this might be due to disease and cell type heterogeneity, the enrichment is more than expected by chance (p= . x - ). interestingly, the inflammatory diseases such as ms and uc showed a more significant gwas enrichment considering that the evaluation of module performance by gwas enrichment may be biased due to differences in module sizes and interactome centrality, we again assessed the correlation between these values. we found a significant correlation between gwas enrichment and module size (fig. c, rho = . , p = . ) and a non-significant correlation between gwas enrichment and interactome centrality (fig. b, rho = . , p = . ). we found that . % of the disease-method combinations yielded significant gwas enrichment, which is more than expected from an independent random selection of modules (fisher’s exact test p = . , n = ). the highly enriched disease modules belong to ms, uc and cd. two out of the six diseases showed significant gwas (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . enrichment by using the clique sum modules (p = . ). in summary, clique sum method resulted in a more significant gwas enrichment for most diseases also for the methylomic benchmark. multi-omics approach revealed a module enriched for ms-associated genes. considering genomic concordance as the guidance principle for the modules that show enrichment for gwas snps, differentially methylated genes and differentially expressed genes, we further wanted to evaluate multiple datasets of one specific disease, i.e., ms. we compiled ms transcriptomic datasets and nine methylation (supplementary table ) comparisons from geo which satisfy the pre-defined dataset criteria (see methods). for each dataset we implemented the pipeline for module identification and scoring shown in fig. b. we evaluated each module using ms snp enrichment analysis and selected the most enriched modules per omic from this metric. this analysis again showed that clique sum yielded the far highest average enrichment score (meta p = . x - ) and was significantly enriched (p < . ) in / transcriptomic datasets (fig. a) and / of the methylation datasets (fig. b). from the significant modules generated by clique sum, we choose the top four modules from each of the gene transcription and methylation sets, and prioritized genes detected in modules from multiple datasets in each omic. this analysis showed that the strongest ms snp enrichment was found for genes in at least three out of four transcriptomic modules (n= , ; p= . x - ) and two out of four methylomic modules (n= , p= . x - ). next, we used the same principle to combine these two and found that the intersection between the gene transcription and methylation consensus resulted in a module (n = genes, fig. ) enriched for ms-associated genes ( / , p < . x - , or = . ) and with the highest gwas enrichment (p = . x - ) which we hereafter referred to as the multi-omics ms module. the multi-omics ms module was enriched in genes associated with major ms pathways. as we used gwas enrichment as a selection criterion, the high gwas enrichment of the final module was partly expected, which led us to analyze its biological functions and their potential epigenetic associations to ms. first, pathway enrichment analysis showed that the multi-omics module genes (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . are significantly involved in several inter-linked immune-related pathways, most of which have been previously associated to ms, including the t cell receptor[ ] (adjusted p = . x - ), pi k/akt[ ] (p = . x - ), erbb[ ] (p = . x - ), fc epsilon ri[ ] (p = . x - ), chemokine[ , ] (p = . x - ), mapk[ , ] (p = . x - ), and b cell receptor[ ] (p = . x - ) signaling pathways; th (p = . x - ), and th and th (p = . x - ) cell differentiation[ ]; natural killer cell mediated cytotoxicity (p = . x - ); and leukocyte transendothelial migration (p = . x - ), which indeed supports their relevance in ms. interestingly, the module was also highly enriched in morphogenetic and neurogenetic signaling pathways, such as the neurotrophin (adjusted p = . x - ), ras (p = . x - ), rap (p = . x - ), vascular endothelial growth factor (vegf, p = . x - ), foxo (p = . x - ), and mtor (p = . x - ) signaling pathways; and in growth hormone synthesis, secretion and action (p = . x - ). the multi-omics ms module was enriched in genes associated with five known environmental ms risk factors validated in an independent cohort. second, from a literature study[ , ] we found nine environmental ms risk factors of varying evidence for which we could identify methylation studies in healthy controls. for each of these risk factors we derived the top differentially methylated genes (dmgs) and tested their enrichment with the module. intriguingly, the module was significantly enriched for genes associated with five risk factors (fig. b), which included the top associated risk factors, i.e., epstein-barr virus (ebv) infection (fisher exact test p = . x - , or = . ) and smoking (p = . x - , or = . ), as well as low sun exposure (p = . x - , or = . ), high bmi (p = . , or = . ) and alcohol consumption (p = . x - , or = . ). then, we asked whether these putative gene-risk factor associations could be validated using an independent omics dataset with paired risk factor associations. for this purpose, we utilized methylation arrays of peripheral blood from ms patients and controls, which have been described previously[ ]. in this analysis we also considered risk factor associations for each individual including age, sex, bmi at age of , smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents. this enabled analysis of dmgs for the ms and risk factor status as covariates in linear mixed effect (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . analysis. indeed, the module genes were highly significantly enriched for ms (n = ; permutation test p = . x - ), but also for all the tested risk factors (ebv was not included, methods) and non- significantly associated to age and sex having - of the genes in each factor ( . x - < p < . ; fig b). combining all these results we found of the module genes to be associated with a risk factors from both the risk factor studies, genes were associated with two risk factors, and seven genes were associated with three risk factors (csk, prkca, prkcz, runx , runx , stat a, and synj ) (fig. c). these associations suggest that the multi-omics module is capturing a key disease network with both genetically and epigenetically driven alterations, thereby providing the possibility to use it to identify potential novel biomarkers or therapeutic targets for ms.� (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discussion the analysis of case control data in the context of networks has gained increased interest to detect consistent robust gene signatures of individual diseases. the application of disease modules might vary for different researchers, but here we systematically aimed at the detection of disease genes supported by genetic association. for this purpose, our study of the transcriptome and methylome profiles of diseases showed significant gwas enrichments for several inflammatory and heart diseases, while psychiatric disorders showed no enrichments and might not be suitable for gwas validation of modules, potentially due to differences in affected tissue types and sampling points. however, analysis of the significant results showed that methods based of differentially expressed cliques in the protein-protein interaction network demonstrated the strongest enrichments (highest scoring for clique sum), while those based primarily on correlations, like wgcna, showed weak enrichments. a potential reason for this could be that gwas has shown to be mostly associated to the central genes of the protein-protein interaction (ppi) network, but our analysis demonstrated that the correlation between gwas enrichment and centrality was non-significant. we also tested whether there was an improvement using consensus approaches that counted the frequency of the result of multiple methods but found this not to increase performance. moreover, we tested the same strategy on a set of inflammatory, glycemic, and autoimmune methylation datasets and found similar results. we would like to emphasize that, rather than scoring a single best working method, our result is a pipeline for evaluating modules using independent high-throughput enrichments. the work on transcription and methylation datasets suggested that ms is a disease highly enriched for gwas, and we therefore tested if increased enrichments could be derived by their integration. we found publicly available datasets and run assessment for both omics independently, which again showed clique sum to score highest. we then tested if improved results could be obtained using modules from multiple datasets of these two omics using consensus modules from clique sum. this resulted in a module of genes highly enriched for gwas (p = . x - ). the multi- omic module was highly enriched in immune-associated pathways, such as t cell and b cell receptor (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . signaling, th /th differentiation, or leukocyte transendothelial migration. these results conform with the current hypothesis that ms is mediated by an autoreactive response of cd + t cells against myelin surrounding neuronal axons, preceded by their migration across the blood-brain barrier (bbb)[ ]. this autoproliferation of brain-targeting th cells has been shown to be driven by memory b cells, in a process mediated by hla-dr [ ]. in addition, another enriched pathway was vegf signaling. ms patients present high serum vegf levels, which is related to pro-inflammatory functions and can alter the permeability of the bbb[ ]. as gwas was used for method prioritization we asked if modules instead could be validated using epigenetics and lifestyle risk factor genes that we identified to associate with ms. with this aim, we compiled a set of publicly available data from omics studies of these risk factors in healthy individuals. this analysis demonstrated that five out of eight risk factors were enriched in our module. in order to validate the use of an environmental assessment using public domain risk factor association we found an independent methylome study of ms comprising environmental data for each ms and healthy individual. this analysis showed a remarkable enrichment of the module genes by to differentially methylated genes for ms (p = . x - ), and a majority to be associated with the tested risk factors. in contrast to previously known community challenges, in our study we not only used the topological property of the network, but we also combined the methods to use an omics-based input to uncover the disease modules that might be dysregulated at each omics level, contributing to the diverse causative mechanisms behind complex diseases. although using the ppi network as background may lead to certain knowledge bias, this kind of benchmark allowed us to look at the relevant risk factors. in our assessment of the disease modules, methods such as clique sum and diamond did perform better than the community-based consensus predictions. in summary, our study provides a practical integrative workflow that enables system-level analysis of heterogeneous diseases, in terms of multi-omics disease modules, as well as the validation of these by using both disease-specific gwas and risk factors enrichment. we believe that this analysis (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . validates our integrated use datasets and suggest a pipeline that readily could be tested in at least in other autoimmune and cardiovascular diseases. lastly, our study did not aim to optimize hyper- parameters for individual disease modules, and instead used default values when possible, and to the methods from the modifier r package implementation of the methods[ ]. however, this might be an important task for specific disease and our code and processed datasets are available at gitlab (https://gitlab.com/gustafsson-lab/modifier-benchmark). in future work, this approach can be expanded to include diverse and context-specific networks to determine whether our multi-omics modules are able to capture various other levels of granularity. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . declarations ethics approval and consent to participate not applicable availability of data and materials the data used for transcriptomic benchmark and methylation benchmark are downloaded from geo. the disease specific gwas files are downloaded from the latest pascal version. the processed data for analysis is available at https://gitlab.com/gustafsson-lab/modifier-benchmark.the risk factor (eims) data will be made available on request. the r-package modifier is available on the gitlab: https://gitlab.com/gustafsson-lab/modifier; the code used for benchmark analysis and risk factor analysis is available on gitlab: https://gitlab.com/gustafsson-lab/modifier-benchmark ; the latest pascal version: https://www .unil.ch/cbg/index.php?title=pascal. competing interests the authors declare no competing interests. funding this work was supported by the swedish research council (grant - (m.g.), grant - (m.j.)), the swedish foundation for strategic research (grant sb - (m.g.)), the center for industrial it (ceniit)(m.g.), european union horizon /european research council consolidator grant (epi ms, grant (m.j.)), knut and alice wallenberg foundation (grant . (m.j.)) and the knowledge foundation (grant (z.l.)). computational resources were granted by swedish national infrastructure for computing (snic; snic / - , liu- - and liu- - ). author contributions t.v.s.b. compiled the necessary data for the benchmark analysis. h.a.w. performed the transcriptomic benchmark analysis. t.v.s.b. performed the methylation benchmark analysis. d.m.e. and h.a.w. performed the ms use case analysis. d.m.e performed the risk factor analysis. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . m.j.,i.k.,t.o., and l.a., provided the raw data and collected the associated risk factor data for the independent methylation dataset. t.v.s.b performed the independent validation dataset analysis. t.v.s.b. and d.m.e. collectively made the plots and figures for the manuscript. m.g. and z.l. designed the study. t.v.s.b. and d.m.e. prepared the manuscript. all authors discussed the results and commented on the manuscript at all stages. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references . naylor s, chen jy. nih public access. natl institutes heal. ; : – . . santiago ja, bottero v, potashkin ja. dissecting the molecular mechanisms of neurodegenerative diseases through network biology. front aging neurosci [internet]. ; : – . available from: http://journal.frontiersin.org/article/ . /fnagi. . /full . barabási al, gulbahce n, loscalzo j. network medicine: a network-based approach to human disease. nat rev genet [internet]. nature publishing group; ; : – . available from: http://dx.doi.org/ . /nrg . gustafsson m, nestor ce, zhang h, barabási a-l, baranzini s, brunak s, et al. modules, networks and systems medicine for understanding disease and aiding diagnosis. genome med [internet]. ; : . available from: http://genomemedicine.biomedcentral.com/articles/ . /s - - - . szklarczyk d, gable al, lyon d, junge a, wyder s, huerta-cepas j, et al. string v [: protein – protein association networks with increased coverage , supporting functional discovery in genome- wide experimental datasets. oxford university press; ; : – . . lamparter d, lin j, kutalik z, choobdar s, hescott b, tomasoni m, et al. open community challenge reveals molecular network modules with key roles in diseases. ssrn electron j. ; – . . schadt ee. molecular networks as sensors and drivers of common human diseases. nature [internet]. ; : – . available from: http://www.nature.com/doifinder/ . /nature . ghiassian sd, menche j, barabási al. a disease module detection (diamond) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. rzhetsky a, editor. plos comput biol [internet]. ; :e . available from: (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . https://dx.plos.org/ . /journal.pcbi. . hellberg s, eklund d, gawel dr, köpsén m, zhang h, nestor ce, et al. dynamic response genes in cd + t cells reveal a network of interactive proteins that classifies disease activity in multiple sclerosis. cell rep. ; : – . . wang h, rogers g, benson m, jarvelin m-r, chavali s, ramasamy a, et al. highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. genome biol. ; :r . . langfelder p, horvath s. wgcna: an r package for weighted correlation network analysis. bmc bioinformatics. ; . . choobdar s, ahsen me, crawford j, tomasoni m, fang t, lamparter d, et al. assessment of network module identification across complex diseases. nat methods. ; : – . . de weerd ha, badam tvs, martínez-enguita d, Åkesson j, muthas d, gustafsson m, et al. modifier: an ensemble r package for inference of disease modules from transcriptomics networks. bioinformatics. ; – . . tian y, morris tj, webster ap, yang z, beck s, feber a, et al. genome analysis champ[: updated methylation analysis pipeline for illumina beadchips. ; : – . . teschendorff ae, marabita f, lechner m, bartlett t, tegner j, gomez-cabrero d, et al. gene expression a beta-mixture quantile normalization method for correcting probe design bias in illumina infinium k dna methylation data. ; : – . . johnson we, li c. adjusting batch effects in microarray expression data using empirical bayes methods. ; – . . ritchie me, phipson b, wu d, hu y, law cw, shi w, et al. limma powers differential expression analyses for rna-sequencing and microarray studies. ; . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . lamparter d, marbach d, rueedi r, kutalik z, bergmann s. fast and rigorous computation of gene and pathway scores from snp-based summary statistics. plos comput biol. ; : – . . mosteller, f. and fisher r. a. questions and answers # author ( s ): frederick mosteller and r . a . fisher published by[: taylor & francis , ltd . on behalf of the american statistical association stable url[: http://www.jstor.org/stable/ all use subject to http://about.jsto. ; : – . available from: http://www.jstor.org/stable/ . piñero j, ramírez-anguita jm, saüch-pitarch j, ronzano f, centeno e, sanz f, et al. the disgenet knowledge platform for disease genomics: update. nucleic acids res. ; :d – . . yu g, wang lg, han y, he qy. clusterprofiler: an r package for comparing biological themes among gene clusters. omi a j integr biol. ; : – . . paul shannon, andrew markiel, owen ozier, nitin s. baliga, jonathan t. wang, daniel ramage, nada amin , benno schwikowski, and trey ideker. cytoscape: a software environment for integrated models. genome res [internet]. ; : . available from: http://ci.nii.ac.jp/naid/ / . maere s, heymans k, kuiper m. systems biology bingo[: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. ; : – . . supek f, bošnjak m, Škunca n, Šmuc t. revigo summarizes and visualizes long lists of gene ontology terms. plos one. ; . . carbone f, de rosa v, carrieri pb, montella s, bruzzese d, porcellini a, et al. regulatory t cell proliferative potential is impaired in human autoimmune disease. nat med. ; : – . . mammana s, bramanti p, mazzon e, cavalli e, basile ms, fagone p, et al. preclinical evaluation of the pi k/akt/mtor pathway in animal models of multiple sclerosis. oncotarget. ; : – . . holley je, gveric d, newcombe j, cuzner ml, gutowski nj. astrocyte characterization in the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . multiple sclerosis glial scar. neuropathol appl neurobiol. ; : – . . pedotti r, devoss jj, youssef s, mitchell d, wedemeyer j, madanat r, et al. multiple elements of the allergic arm of the immune response modulate autoimmune demyelination. proc natl acad sci u s a. ; : – . . cui ly, chu sf, chen nh. the role of chemokines and chemokine receptors in multiple sclerosis. int immunopharmacol [internet]. elsevier; ; : . available from: https://doi.org/ . /j.intimp. . . krumbholz m, theil d, cepok s, hemmer b, kivisäkk p, ransohoff rm, et al. chemokines in multiple sclerosis: cxcl and cxcl up-regulation is differentially linked to cns immune cell recruitment. brain. ; : – . . krementsov dn, thornton tm, teuscher c, rincon m. the emerging role of p mitogen- activated protein kinase in multiple sclerosis and its models. mol cell biol. ; : – . . kotelnikova e, kiani na, messinis d, pertsovskaya i, pliaka v, bernardo-faura m, et al. mapk pathway and b cells overactivation in multiple sclerosis revealed by phosphoproteomics and genomic analysis. proc natl acad sci u s a. ; : – . . kunkl m, frascolla s, amormino c, volpe e, tuosto l. t helper cells: the modulators of inflammation in multiple sclerosis. cells. ; : . . waubant e, lucas r, mowry e, graves j, olsson t, alfredsson l, et al. environmental and genetic risk factors for ms: an integrated review. ann clin transl neurol. ; : – . . olsson t, barcellos lf, alfredsson l. interactions between genetic, lifestyle and environmental risk factors for multiple sclerosis. nat rev neurol. nature publishing group; ; : – . . kular l, liu y, ruhrmann s, zheleznyakova g, marabita f, gomez-cabrero d, et al. dna methylation as a mediator of hla-drb : and a protective variant in multiple sclerosis. nat (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . commun. ; . . compston a, coles a. multiple sclerosis. lancet [internet]. elsevier ltd; ; : – . available from: http://dx.doi.org/ . /s - ( ) - . jelcic i, al nimer f, wang j, lentsch v, planas r, jelcic i, et al. memory b cells activate brain- homing, autoreactive cd + t cells in multiple sclerosis. cell. ; : - .e . . lange c, storkebaum e, de almodóvar cr, dewerchin m, carmeliet p. vascular endothelial growth factor: a neurovascular target in neurological diseases. nat rev neurol [internet]. nature publishing group; ; : – . available from: http://dx.doi.org/ . /nrneurol. . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figures : figure . overview of the benchmark assessment of disease modules and the integration workflow for ms. (a) transcriptomic and methylomic datasets from different diseases were used as inputs for eight modifier module identification methods. the resulting single-omic disease modules (n = (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ) were independently assessed by gwas enrichment analysis of the same disease using pascal module scoring. modifier methods were evaluated by the combined enrichment score of their respective disease modules. (b) multi-omic integrative workflow for multiple sclerosis (ms)- associated modules. data from case-control comparisons were used as input for module detection with modifier methods. clique sum modules presented the highest gwas enrichment score and were therefore used to generate single-omic consensus modules. the intersection of the best transcriptomic and methylomic consensus modules resulted in an ms multi-omic module (n = genes) with the highest gwas enrichment, which was independently found to be enriched for genes associated with five known lifestyle ms risk factors using public omics data from healthy individuals. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic concordance of modifier modules on transcriptomic datasets. (a) heatmap of pascal p-values for eight single-method and eight consensus modifier modules, identified for publicly available transcriptomic datasets. module performance p-values are shown in a white to blue scale, where any shade of blue represents a significant module ( < . ; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. datasets are classified into six disease types: cardiovascular (red), glycemic (golden), inflammatory (green), neurodegenerative (fuchsia), psychiatric and social (pink), autoimmune (dark purple), and others (light purple); and two cell types: blood (maroon), and others (light yellow). datasets are ranked by meta p-values using fisher’s method of the single-method module p-values across and within their disease types (dataset score, bottom boxplot). modifier methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red), plus the consensus modules (blue). single-methods and consensus were scored by meta p-values across (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . datasets (method score, right boxplot). consensus x/ indicates that the module genes are found in at least x methods out of eight. (b) scatter plot showing spearman correlation between module score and betweenness centrality. modules are represented with a different shape depending on their method and colored based on the disease type. (c) scatter plot showing spearman correlation between module score and module size. modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic concordance of modifier modules on methylomic datasets. (a) heatmap of pascal p-values for eight single-method and eight consensus modifier modules, identified for ten publicly available methylomic datasets. module performance p-values are shown in a white to blue scale, where any shade of blue represents a significant module (p < . ; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. datasets are classified into two disease types: glycemic (golden), and inflammatory (green); and two cell types: blood (maroon), and others (light yellow). datasets are ranked by fisher’s combined p of the single-method module p-values across and within their disease types (dataset score, bottom boxplot). modifier methods are organized by algorithm type: seed-based (green), co-expression- based (yellow), and clique-based (red), plus the consensus modules (blue). single-methods and consensus are scored by meta p-values across datasets (method score, right boxplot). consensus x/ indicates that the module genes are found in at least x methods out of eight. (b) scatter plot (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . showing spearman correlation between module score and betweenness centrality. modules are represented with a different shape depending on their method and colored based on the disease type. (c) scatter plot showing spearman correlation between module score and module size. modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . genomic concordance of modifier modules on ms use case data. (a) heatmap of pascal p-values for eight single-method modifier modules, identified for ten ms-related transcriptomic datasets. module performance p-values are shown in a white to blue scale, where any shade of blue represents a significant module (p < . ), white represents a non-significant module, and grey represents a module of size zero. datasets are classified into the reported ms type: ms (blue), rrms (red), ppms (green), spms (orange), and cis (yellow); and four cell types: whole blood (maroon), pbmcs (light brown), white matter (light yellow), and cd + t cells (purple). datasets are meta p- values of the single-method enrichments (dataset score, bottom boxplot). modifier methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red). single methods are scored by p of the significant modules across datasets (method score, right boxplot). (b) heatmap of pascal p-values for four single-method modifier modules, identified for nine ms-related transcriptomic datasets. (c-d) bar plots of pascal p-values for the ms consensus modules generated with clique sum from transcriptomic (a) and methylomic (b) datasets. (e) union and intersection of the top performing modules, shown as a venn diagram. diseas e type ms rrms ppms spms cis module performance α = . - - - ≤ - best worst p cell type wb pbmcs wm cd + t cells cd + monocytes cd + b cells cd + t cells a b c d e / / / / transcriptomic cliq ue sum consensus modules α -l o g p * α -l o g p / / / / methylomic cliq ue sum consensus modules * best transcriptomic consensus best methylomic consensus intersectionunion ngenes *(p = . x - ) (p = . x - ) (p = . x - ) (p = . x - ) diseas e type cell type mod. disco v. mcode correl. clique clique sum wgcna moda di��coex diamond t α = . α = . . -log p disease type cell type mod. disco v. mcode correl. cliq ue clique sum wgcna moda di��coex diamond α = . -log p t t t t t t t t t t m m m m m m m m m na na na nanana na na na na na na na na na na α = . -l o g p -l og p (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . risk factor enrichment and network visualization of the ms multi-omic module. (a) evidence levels and effect on ms of the risk factor. � (b) enrichment overlap of multi-omic ms dync h jun mapk mapk prkca prkce mapk lcp rhoa dynll grap bcl dnm dnm prkacb casp bcl nrip dnm bcl l prkaca pten atf prkci bid rac rac rasa nras sos pik ca hras casp cdc prkcz pard a met plcg irs ptk pgr kras ret hgf pik cb gab vav grb erbb hck pik cd crkl pik r carm igf ptk b kdr vegfa pxn edn cbl bcar app sh gl iqgap shc bdnf ngf ntrk ptpn egfr ins gnb gng arid a trim gnai ar pik r pik r ptprj sp inpp b tnf ctnnb ncam cdh spp sec csk tln rap b abl src itgb ptpn egf it gb itgav synj cd hla-e clta cd hla-dpb hla-a ptpn hla-dra il mmp pip k b cxcr cxcl icam lckhla-drb ap m ap b fcgr a ap m mapk vwf irf irf irf il il ifng akt a p a hsp aa cd d ppp r a gsk b ppp ca fgg eps l fgf ptprc cd g hsp ab epha f n cltc pip k a vcam fyn esr tgfb itgb cd nr c cd cd e ap a runx cd cd cebpb ap s nfkb hdac kit cdk ccna ube i pcna ccnd rela stat a prkcd prkcq zap raf ywhab akt cd rap a mapk mapk ptafr rab a map k smad map k crebbp smad hmgb ngfr daxx akt pparg trim smad myc ctss sirt csf brca sptbn tp h ax sphk ep jak irf stat stat stat pak hif a plcg pdgfb jak pdgfrb ccne runx rb ezh cdk functional clusters cell death and apoptosis morphogenesis and neurogenesis cell cycle and proliferation chemotaxis and cell migration response to hormone stimulus leukocyte activation and di��erentiation node color legend low sun exposure smoking high bmi alcohol use ebv infection associated with ms signif. enriched ms risk factors risk factor evidence e��ect ebv infection smoking low sun exposure adolescent obesity high bmi night shift work organic solvent exposure alcohol consumption oral tobacco +++ +++ ++ ++ ++ ++ + + + � risk � risk � risk � risk � risk � risk � risk � risk a c b module enrichments risk factor datasets -log p α = . validation dataset -log p α = . na na na . � risk (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . module genes in the top , dmgs in risk factor datasets and independent risk factor methylation dataset (see methods) shown as fisher exact test p-values (threshold α= . ). (c) visualization of the module. nodes (module genes) are arranged in functional clusters according to their overrepresented go terms. genes with a known association to ms are marked with a blue circle. node colors display the associations to an ms risk factor for which the module is significantly enriched (red, alcohol use; green, high bmi; yellow, smoking; purple, low sun exposure; light blue, ebv infection; grey, no association). edges were extracted from the stringdb v human ppi network of experimentally validated interactions (confidence score > ). supplementary materials supplementary table : all case-control comparisons used in the transcriptomic and methylomic benchmarks. supplementary table : all case-control comparisons used in the ms use case benchmark. supplementary table : all methods implemented in the benchmark. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . biorxiv.org - the preprint server for biology skip to main content home about submit alerts / rss search for this keyword advanced search subject areas all articles animal behavior and cognition biochemistry bioengineering bioinformatics biophysics cancer biology cell biology clinical trials developmental biology ecology epidemiology evolutionary biology genetics genomics immunology microbiology molecular biology neuroscience paleontology pathology pharmacology and toxicology physiology plant biology scientific communication and education synthetic biology systems biology zoology view by month complex systems analysis informs on the spread of covid- complex systems analysis informs on the spread of covid- xia wang , dorcas washington , georg f. weber * university of cincinnati department of mathematical sciences, cincinnati, oh, usa university of cincinnati health science library, cincinnati, oh, usa university of cincinnati academic health center, cincinnati, oh, usa * send correspondence to: georg f. weber, james l. winkle college of pharmacy, university of cincinnati, albert sabin way, oh - . e-mail: georg.weber@uc.edu, phone - - . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract the non-linear progression of new infection numbers in a pandemic poses challenges to the evaluation of its management. the tools of complex systems research may aid in attaining information that would be difficult to extract with other means. to study the covid- pandemic, we utilize the reported new cases per day for the globe, nine countries and six us states through october . fourier and univariate wavelet analyses inform on periodicity and extent of change. evaluating time-lagged data sets of various lag lengths, we find that the autocorrelation function, average mutual information and box counting dimension represent good quantitative readouts for the progression of new infections. bivariate wavelet analysis and return plots give indications of containment versus exacerbation. homogeneity or heterogeneity in the population response, uptick versus suppression, and worsening or improving trends are discernible, in part by plotting various time lags in three dimensions. the analysis of epidemic or pandemic progression with the techniques available for observed (noisy) complex data can aid decision making in the public health response. keywords covid- , epidemiology, new infections, complex systems, autocorrelation, fractal dimension, average mutual information, wavelet analysis .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction the spread of infectious diseases depends on pathogen factors (virulence), host factors (immunity), and – on the population level – on countermeasures taken by the community. such measures cover a broad spectrum of possible engagements, and they may be highly consequential for the course of an epidemic or a pandemic [ ]. the analysis of acute infectious progression in a society is critical for gauging the effectiveness of public health responses, but it is made difficult through the non-linear nature of the underlying process. conventional approaches of reductionist research or common linearization techniques are not meaningfully applicable. various strategies have been employed to account for the complexity of infectious propagation. the spread of covid- has been modeled with machine learning [ ], networks of compartments [ ] and cellular automata [ ]. power laws have been inferred [ ]. such investigations are of value, even though they are inevitably based on idealizing assumptions. in addition to modeling approaches, the analysis of actually observed data is of critical importance. the numbers in such data sets are noisy, and they are eminently non-linear (also described as “complex data” or “observed chaotic data” [ ]). complex systems research has made techniques and algorithms available to extract information from observed non-linear data series. the manifestations of the covid- pandemic have varied widely among geographic areas, when compared across countries [ , , ] as well as across us states [ ], depending on when the virus reached them, what the population characteristics were at the time of onset, and what actions were taken in response to the infectious spread. here, we set out to investigate underlying patterns. we apply basic tools of complex systems research to compare the spread of covid- in distinct countries, characterized by their varying approaches to the pandemic, from its beginning stages through early or late october . further, we compare various regions within the usa, which has left major decisions to the individual states. patterns are discernible in fourier and wavelet analyses. order can be detected in time-lagged plots. therefrom, quantitative measurements are obtainable, including autocorrelation, average mutual information, fractal dimension, and embedding dimension, which inform on the pandemic progression. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / methods source data: here we analyze the new infections per day, either as absolute numbers or as rates per , inhabitants. the source data utilized for the present analysis came from bing covid- tracker (www.bing.com/covid). fourier spectrum and univariate wavelet analysis: fourier analysis evaluates the spectral density by relative numbers of new infections (case rates per , inhabitants) versus frequency or versus period. wavelet analysis does not assume stationarity in the time-series. thus, it allows the study of localized periodic behavior. in particular, we look for regions of high-power in the frequency-time plot. the calculations for wavelet analyses of new infections were done in r. in waveletcomp, the null hypothesis, that there is no periodicity in the series, is tested via p-values obtained from simulation, where the model to be simulated can be chosen from a range of options [ ]. the algorithm analyzes the frequency structure of uni- or bivariate time series using the morlet wavelet. the time series to be analyzed was standardized, after detrending, in order to obtain a measure of the wavelet power, which is relative to unit-variance white noise and directly comparable to results of other time series. detrending is accomplished using polynomial regression. where indicated, all graphs are normalized to the same y-axis scale. bivariate wavelet analysis: we conducted bivariate analysis of lagged data (t versus t+ or t+ or t+ ) for joint periodicity. the concepts of cross-wavelet analysis provide tools for comparing the frequency contents by two time series as well as for drawing conclusions about their synchronicity at certain periods and across certain ranges of time. while cross-wavelet power corresponds to covariance in the time domain, wavelet coherence is a time-series measure similar to correlation. two waves are coherent if they have a constant relative phase. the bivariate analysis results include the cross-wavelet power plot, the wavelet coherence plot, the average power plot and the phase difference image. the cross-wavelet power and coherence plot contain arrows showing the area of significant joint periods (significance level = . ). the direction of these arrows indicating the direction of phase differences. up-right pointing arrows indicate that the two series are in phase and x(t) series leads, while down-right pointing arrows indicate the two series are in-phase and x(t+n) series leads. similarly, up-left pointing arrows express that the two series are out of phase and x(t+n) series leads, while down-left pointing arrows express that the two series are out of phase and x(t) series leads. the arrows are only plotted within white contour lines indicating significance at the % level. a more explicit global view of the phase difference can be produced with (π/ , π) and (-π, - π / ) for out of phase and (-π / , π / ) for in- .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / phase. the time-averaged cross-wavelet power provides a summarized view on the shared periods, the corresponding power and the statistical significance. cross-wavelet plots may mark areas significant due to one series swinging widely, rather than two series sharing a joint period. to avoid this false positive readout, it is more appropriate to examine wavelet coherence plots, like the coefficient of correlation. it has a value range between and and it shows statistical significance only in areas where the two series actually share jointly significant periods. return plots: from the total numbers of new infections, we generated return plots with increasing lags, plotting daily changes x(t+ ), …, x(t+ ) versus x(t) and weekly changes x(t+ ), …, x(t+ ) versus x(t). short time lags tend to cluster around the o angle, whereas increasing time delays reveal the structure of the oscillations. when graphed in dimensions, these diagrams can aid in reconstructing the underlying attractor. autocorrelation: a time series sometimes repeats patterns or has other properties, whereby earlier values display some relation to later values. the autocorrelation statistic (serial correlation statistic) measures the degree of that affiliation as it refers to linear dependence. the magnitude of its dimensionless number reflects the extent of similarity. the formula for autocorrelation rm is comprised of terms for autocovariance and variance 𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = 𝑎𝑢𝑡𝑜𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑚 = 𝑁 ∑n―m t= (𝑥𝑡 ― 𝑥)(𝑥𝑡+𝑚 ― 𝑥) 𝑁 ∑n t= (𝑥𝑡 ― 𝑥) autocorrelation coefficients range from - to + , with + indicating perfect synchrony and - reflecting exact mirror images. an absence of any correlation yields rm = . box counting dimension: the dimension of a fractal is best described as a non-integer. the dimension is a quantitative measure for the evaluation of geometric complexity by objects. a general relationship assumes 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 ∝ log (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡𝑠) log ( 𝑠𝑐𝑎𝑙𝑒 𝑠𝑖𝑧𝑒 ) here, the characteristic of dimension is that it specifies the rate, at which the number of increments varies with scale size. we calculated the box counting dimension after binning into x squares of -dimensional return plots with various lags. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / average mutual information: the average mutual information (ami) represents a non-linear correlation function, which indicates how much common information is shared by the measurements of x(t) and x(t+n). the average mutual information was calculated with the mutual function r package tserieschaos. it estimates the mutual information index for a specified number of lags. the joint probability distribution function is estimated with a simple bi-dimensional density histogram. embedding dimension: here by r package nonlineartseries, we first use the timelag function to decide the optimal time lag 𝜏 based on the average mutual information and then by the estimateembeddingdim function to assess the optimal embedding dimension m. then the optimal set of regressors related to x(t) is x(t- 𝜏), …, x(t-(m- ) 𝜏), x(t- m 𝜏). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / results . comparison across countries across countries, a wide spectrum of measures was taken to curb the spread of sars- cov . this resulted in a range of very different progression curves when graphing the numbers of new infections over time (figure ). india, brazil, sweden, italy and the united states have been considered as hard-hit for their own internal reasons. france, germany, over a long period poland, and south korea had tighter control and a less aggressive spread. all curves display close to linear ramp-up phases, followed by more or less irregular oscillations. the levels of success at suppressing the new infection rates diverged among countries, and several are experiencing a second peak. wavelet methodology aids in studying periodic phenomena in time series, particularly in the presence of potential frequency changes over time. for cross-country evaluations, all graphs were plotted on the same scale (figure a). each country was also plotted on its own scale (figure b). the univariate analysis of the time course for the countries under study shows prominence of the recent upswing in france (heat intensity on the right margin of the graph). by contrast, there is comparatively more successful management by italy, germany, poland and south korea through october . india, brazil, sweden, and the united states display cyclical fluctuations of various durations, none of which have been contained. a period of days is prominent in the fluctuations of most countries, which may reflect real cyclicity or weekly reporting habits. the worldwide data are displayed in figure s . for cross-country comparisons, we converted the new infection total numbers to new infection rates by relating them to , members of the population (figure a). similarly, complex systems can be analyzed with fourier analysis. we first plotted fourier power spectra versus frequency for the rates of new infections (figure b). spectral density range (high in brazil, low in south korea) and frequency distribution provide a readout for infectious spread. the spectral density of the normalized rates (identically scaled y-axes) (figure c) confirmed good management of the pandemic spread in germany, poland, and south korea (and to some degree in italy). despite the progressive increase in the numbers of infections in india, on a population basis, control has apparently not been lost through october . by contrast, the power spectra for brazil, sweden, and france are reflective of potentially adverse developments. the united states display an anomaly with a periodic behavior that has a prominent cycle around days. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / to gain a better understanding of the dynamics, with which disease spread occurs, we investigated progressive numbers of new infections in comparison to their increasing time lags. this approach may reveal periodicities or aid in the visualization of attractors. expectedly, short time delays were associated with little change. with a lag time of about days onward, distinct patterns emerged among countries. according to bivariate wavelet analysis for time-delayed data series (including the cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image), italy, germany and south korea shared significantly joint periods of - months in the comparison x(t) versus x(t+ ). south korea has comparatively high power and significant shared periods around weeks at the early stage and later the significant shared periods are also - months. the remaining countries all have segments of shorter periods (around days) and longer periods shared. for india, brazil, france, usa and poland, the shared -day period only appear significant in the later part of the series. similar results are observed in the analyses for x(t) versus x(t+ ) and x(t) versus x(t+ ). the phase difference plots show that in the shared longer periods, x(t) are mostly in phase with x(t+ ), while they gradually become out of phase in x(t) versus x(t+ ) and x(t) versus x(t+ ), thus making longer lags more discriminating and informative (figure a and figure s a,b). a reduction in cross-wavelet power levels is apparent in italy, germany and south korea. poland and france are experiencing recent increases. india, brazil and the usa have had protracted periods of high cross-wavelet power levels. containment is associated with longer periodicity in the distribution of cross-wavelet power. this is the case for south korea, germany and italy. high cross-wavelet power around a periodicity of days is reflective of poor control. to generate informative return plots, we utilized dimensions, which allows for the visualization of two lags from x(t) (or a from a later start point) and may reveal the pattern of an attractor. in this depiction, a rapid increase or decrease in new infections is reflected in a close- to straight line, oscillations generate a near-toroid attractor, while successful management shrinks the torus and moves it closer to the origin. initially, we evaluated multiple time delays. most discriminating were x(t)/x(t+ )/x(t+ ), x(t+ )/x(t+ )/x(t+ ), and x(t+ )/x(t+ )/x(t+ ) (figure b). the progressive increase in new cases over the time period in india is reflected in a predominantly linear curve on each scale. the wide fluctuations in brazil generate a largely disordered appearance. disorder is also apparent in sweden. france initially managed the pandemic well, but is experiencing a dramatic upswing, which obscures order. cyclical patterns, suggesting the outlines of attractors, are apparent in usa, italy, germany, and south korea (where most data points are concentrated near the origin). poland initially displayed a well- .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / contained attractor, but the recent substantial upswing in new infections is reflected in a linear progression from there (for separate analyses of the two phases, see figure s ). we also calculated the embedding dimensions for the lagged data (figure c). germany has the highest embedding dimension of , followed by poland with . several countries have an embedding dimension of , including brazil, sweden, usa and south korea. italy and france have the embedding dimension equal to . india is unusual due to its longer lag period of days. when the lag period is set at days, the embedding dimension of india is also equal to . for the worldwide data, the calculated embedding dimension is with a time lag of (not shown). the autocorrelation of two data strings with short time lags is expected to be high (approaching . ) because there is little opportunity for dramatic change (high infection rates on day t likely produce similarly high numbers on the consecutive day t+ , while low numbers are followed by few new infections on the next day). autocorrelation may remain high for extended lags in the initial ramp-up and at the oscillatory stage, depending on the regularity of the fluctuations. a society that succeeds in curbing the disease spread will leave the highly correlated initial ramp-up and consecutive oscillatory phases, thus displaying a gradual decrease in values at the longer lags. the decline in the autocorrelation numbers of progressively lagged data by country appeared to be reflective of the stringency, with which the pandemic was addressed (figure a). from a lag of onward, poland and south korea have substantially declining values (although due to the recent steep upswing in new infections, poland deviates from the trend at very long lags), germany shows a dramatic lowering at a lag of and above. by contrast, india and brazil stay uniformly high. so do the global numbers, which are inherently heterogeneous. the average mutual information reflects information shared by the measurements of x(t) and x(t+n). expectedly, it declines with increasing lag. poland starts with a relatively low value ( . at t versus t+ ) and shows a rapid decrease with longer lag. it then stays around at a low level of . from lags of to days. france displays a gradually decreasing trend with the average mutual information starting at . and ending at . at the lag of days. india shows a similar pattern as france but with much higher average mutual information (due to the constant uptick in numbers), ranging between . and . . four other countries, including germany, usa, sweden and brazil, all express relatively flat average mutual information values, staying around levels of . for the usa and brazil, . for germany, and . for sweden. reflecting progressively improved control, italy and south korea also have decreasing trends, but much flatter at . - . for italy and . to . for south korea, respectively (figure b). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a rapid increase in new infections is reflected in a small fractal dimension (practically approximated by the box counting dimension with values between and ) of the -dimensional return plots with progressive lags. intermediate phases are characterized by higher fractal dimensions (approaching ), depending on the nature of the oscillations. conversely, successful management through the reduction in new infections should be reflected in a contraction of the attractor on the return plot, which is assessable through the box counting dimension. a trend is displayed in the comparisons from shorter to longer lag periods. distinct management strategies across different countries generate a heterogeneous pattern worldwide, rendering the fractal dimension high regardless of the lag in x(t+n) versus x(t) plots. steep increases in new infections (poland, india) have dimensions close to . intermediate phases are characterized by higher numbers. successful fights against the pandemic (south korea) are causative for declining size dimensions with increasing lag (figure c). . comparison across us states within the usa, individual states have encountered a rather wide range of progression phenotypes in the spread of new covid- infections (figure ). this is due to variations in international connectedness and population density (reflected in the early peaks in the northeastern states new york and massachusetts), holiday travel (florida), policy decisions and other factors. wavelet analysis of new infections (one scale across all states) shows good control (right side of the graph) after initial affliction (left area) for massachusetts and new york, which having had early spikes in new infections have achieved good success in containment. through the observation period, control has not been maintained in ohio. the periodicity in individual states (each on their own scales) is poorly defined, except for florida and ohio, where days yield a prominent signal (figure a,b). we normalized the new infection numbers to rates by relating them per , inhabitants (figure a). figure b shows the periodogram for the states under investigation with frequencies between and . (the graph is almost flat for the higher frequencies). there exist clear heterogeneous patterns in the comparison among these states. new york and massachusetts display steadily decreasing spectral density values from the longest period to around - weeks (corresponding to a frequency range around . - . ). florida and texas .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / share similar patterns with a few low spikes in their periodograms after the first highest ones. the graph for california flattens out after the lowest three frequencies, with the longest period (the whole series) having the highest value. ohio’s pattern is quite unique with fluctuating values from the longest periods through around - weeks. the fourier power spectrum for the infection rates (figure c) indicates similar periodic patterns as in the periodograms of figure b. these patterns are less prominent due to the adjustment to the same y-axis scale (the scale reflects the magnitude of the positive rates, the shape shows the evolution of the disease). we conducted bivariate wavelet analysis on the time-lagged data (figure a and figure s ). the shared synchronicity segments between x(t) and x(t+n) can be grouped into shorter periods (around days) and longer periods (approximately weeks, month, months). new york does not display substantial joint short periods. ohio and texas mainly have correlation at the end of the series around the -day period. massachusetts experiences joint periodicity around the -day period at the early stage of the series. florida and california have joint periods in the middle of the observation time frame. the levels of average cross-wavelet power are higher in states with poor control (x-axes scales for florida, ohio). the peak power shifts toward higher periodicity with improved control (y-axes scales for new york, massachusetts). the return plots in dimensions, utilizing the same time lags as for the countries, seemed to reflect contraction of the attractor in massachusetts, cyclicity in new york, florida and california, no containment in texas, and an ejecting diagonal in ohio which may reflect exacerbation (figure b). the embedding dimensions varies among states, such that the most contained states (new york, massachusetts) have the lowest embedding dimension (table ). the autocorrelation for return plots of increasing lags show a progressive decline in the numbers of new york and massachusetts, which implemented strong containment measures after having been afflicted early. the values decline less steeply for texas and california. ohio displays an anomaly with increasing values for very long lags. the state, while not heavily afflicted on a per capita basis, never achieved containment, only a stationary level, and has since experienced another wave (figure a). up to a maximum lag of days, the average mutual information for the us states under study ranges between . and . . overall, all states show a slightly decreasing pattern except for california, which is relatively leveled at a value of . (figure b). unexpectedly, the box counting dimension (figure c) is less discerning than it was for the evaluation across countries. this may be due to the much lower power conveyed by smaller population sizes. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / discussion in the present investigation we find that the analysis tools for observed complex data can aid in the interpretation of pandemic spread across communities. difficulties in analyzing the non- linear patters of infectious disease spread may be tamed by applying the tools of complex systems research. the approach can reveal patterns, where a simple time course of new cases does not. further, non-linear analysis allows the study into various facets of the process, depending on whether the starting data are new cases, hospitalizations, deaths or other readouts. maps can be generated and evaluated for their fractal dimensions [ ]. the operational approximation of lyapunov exponents may be meaningful, although they were largely uninformative for the present study (supplemental figure s ). among the countries analyzed, south korea has had the most successful control of the pandemic spread according to low intensity in univariate wavelet analysis, low spectral density range in fourier analysis, low spectral density of the normalized rates, a reduction in cross- wavelet power levels according to bivariate wavelet analysis and longer periodicity in the distribution of cross-wavelet power. further, declining box counting dimensions, autocorrelation values with increasing time lag, and decreasing trends (at a low slope) in average mutual information confirm containment. cyclical patterns in return plots, suggesting the outlines of attractors, are apparent and most data points are concentrated near the origin of the graph. germany exhibited good management through october according to univariate wavelet analysis, spectral density in the power spectrum of the normalized rates, a reduction of cross- wavelet power levels in bivariate wavelet analysis, longer periodicity in the distribution of cross- wavelet power, a dramatic lowering of autocorrelation values at a lag of and above, and relatively flat average mutual information values, staying around levels of . . cyclical patterns in return plots suggest the outlines of an attractor. good control by italy consecutive to the early impact and through october is reflected in low intensity and fluctuation when applying univariate wavelet analysis, in a reduction of cross-wavelet power levels for bivariate wavelet analysis of time-delayed data, longer periodicity in the distribution of cross-wavelet power, and decreasing trends (at a low slope) in average mutual information. cyclical patterns in return plots, suggesting the outlines of an attractor, are apparent. poland had two distinct phases. by univariate wavelet analysis and density in the power spectrum of normalized rates, there was indication of good management through october . according to bivariate wavelet analysis for time-delayed data series and return plots, the recent substantial upswing in new infections is reflected, which also results in box counting dimensions close to . from a lag of onward, poland .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / has substantially declining autocorrelation values, although due to the recent steep upswing in new infections, the trend reverses at very long lags. the average mutual information starts with a relatively low value ( . at t versus t+ ) and shows a rapid decrease with longer lag, staying level from lags of to days. in the united states, univariate wavelet analysis displays cyclical fluctuations of various durations, none of which have been contained. according to bivariate wavelet analysis for time-delayed data series, there have been protracted periods of high cross- wavelet power levels. cyclical patterns in return plots, suggesting the outlines of attractors, are apparent. the usa expresses relatively flat average mutual information values, staying around levels of . . in france, univariate wavelet analysis of the time course shows prominence of the recent upswing (heat intensity on the right margin of the graph), the power spectrum is reflective of potentially adverse developments. the second wave of infections is apparent in bivariate wavelet analysis and in the obscured order in return plots. france displays a gradually decreasing trend of average mutual information. india expresses cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. on a population bases, the spectral density suggests that control has not been lost through october . bivariate wavelet analysis shows protracted periods of high cross-wavelet power levels, return plots reflect the progressive increase in new cases over the time period in a predominantly linear curve on each scale, box counting dimensions are close to , and autocorrelation values stay uniformly high with increasing time lag. india displays a gradually decreasing trend of average mutual information. brazil experiences cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. by fourier analysis, the spectral density range is high. the power spectrum is indicative of potentially adverse developments. according to bivariate wavelet analysis, there have been protracted periods of high cross-wavelet power levels. in return plots, the wide fluctuations generate a largely disordered appearance. the autocorrelation values stay uniformly high. brazil expresses relatively flat average mutual information values, staying around levels of . . sweden shows cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. the power spectrum is reflective of potentially adverse developments. in return plots, disorder is apparent. sweden expresses relatively flat average mutual information values. prima facie, the curves of new infections versus time for three western european countries, france, italy, and germany, appear similar. complex systems analysis reveals the upswing in france to be much more perilous than the increases in the curves of new infections by the other two countries. the management of infectious spread also requires improvements in .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the united states, sweden and brazil. the selection of the observation period can dramatically influence the results. poland was initially very successful in containing the pandemic, but then experienced a substantial upswing. analyzing these two phases individually or in conjunction yields very different data sets, which inform about distinct aspects of the infectious progression. the fluctuations of new infections in an epidemic or a pandemic pose challenges to the evaluation whether a decline reflects true containment (“rounding the corner”) or just the calm before another wave. the readouts of non-linear systems analysis can aid in making such a distinction. a complex occurrence that experiences containment will strive toward a point attractor in phase space and move toward the origin. such a progression is represented in a declining fractal dimension, and the transition from fluctuations (often associated with a torus attractor) toward limitation of new cases is expected to reduce the autocorrelation. one constraint of complex systems analysis is the need for large data sets. in this regard, the availability of about data points (daily new cases march through october ) for each geographic area in this study is somewhat low. the robustness of pertinent studies increases with larger data sets over time. reporting errors could have a non-trivial impact, and may be reflected in the frequent occurrence of a peak at days in the spectral analysis (possibly indicating weekly totals). this problem can be addressed by utilizing moving averages. the homogeneity or heterogeneity in management by the community under study determines the noise level. the worldwide numbers of new infections have a lot of background due to varying patterns across countries. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / acknowledgements gfw is supported by nih grant ca . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / references [ ] christakis na. apollo’s arrow: the profound and enduring impact of coronavirus on the way we live. new york (hachette book group) . [ ] mehta m, julaiti j, griffin p, kumara s. early stage machine learning–based prediction of us county vulnerability to the covid- pandemic: a machine learning approach. jmir public health and surveillance ; : e . [ ] wang k, ding l, yan y, dai c, qu m, jiayi d, hao x. modelling the initial epidemic trends of covid- in italy, spain, germany, and france. plos one ; :e . [ ] bin s, sun g, chen c-c. spread of infectious disease modeling and analysis of different factors on spread of infectious disease based on cellular automata. int j environ res public health ; : . [ ] blasius, b. power-law distribution in the number of confirmed covid- cases. chaos ; : . [ ] abarbanel hdi. analysis of observed chaotic data. switzerland (springer nature) . [ ] chakraborty i, maity p. covid- outbreak: migration, effects on society, global environment and prevention. science of the total environment ; : . [ ] bertacchini f, bilotta e, pantano ps. on the temporal spreading of the sarscov- . plos one ; :e . [ ] white er, hébert-dufresne l. state-level variation of initial covid- dynamics in the united states. plos one ; :e . [ ] roesch a, schmidbauer h. waveletcomp: computational wavelet analysis. r package version . . . https://cran.r-project.org/package=waveletcomp [ ] păcurar c-m, necula b-r. an analysis of covid- spread based on fractal interpolation and fractal dimension. chaos, solitons & fractals ; , . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / tables and figures figure : time-course of disease spread by country. numbers of new cases, x(t), per day versus time (t, indicating the date). shown are the curves for (top to bottom, left to right) the globe, india, brazil, sweden, italy, usa, france, germany, poland, and south korea. note the different scales of the y-axes. figure : univariate wavelet analysis. cross-wavelet power spectrum in the time-period domain. the x-axis (index) displays the time progression, whereas the y-axis depicts the length of the period. white contour lines indicate significance of periodicity on the . level for probability of error. lines represent the ridge of cross-wavelet power. the color bar reveals the power gradient. a) all countries on the same scale. b) each country on its own scale. figure : fourier analysis. a) new infection rates. daily reported new numbers of infections divided by , inhabitants. the x-axis shows the calendar date. b) power spectrum. fourier power spectra versus frequency for. new infections per , inhabitants per day in each of countries. c) normalized power spectrum. spectral density (y-axis) versus period (in days) for infection rates per , inhabitants (x-axis). the curve shows the smoothed spectral density estimates. all y-axes have the same scale. figure : time-lagged data analysis. a) bivariate wavelet analysis. shown are cross- wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right in each row) time-lagged data were used for x(t)/x(t+ ) (for the lags x(t)/x(t+ ) and x(t)/x(t ) see figure s ). white contour lines indicate significance for joint periodicity, black arrows depict the phase difference in the areas with significant joint periods. the solid red dots on the average power plot (the third from the left) depict significant joint periods at a probability of error of . . where shown, the color bars reveal the ranges of cross-wavelet power levels. b) return plots in dimensions. time-lagged return plots in dimensions are shown, from left to right, for x(t)/x(t+ )/x(t+ ), x(t+ )/x(t+ )/x(t+ ), and x(t+ )/x(t+ )/x(t+ ). each country of interest has its own row. c) embedding dimension. the plots show how cao’s algorithm uses functions in order to estimate the embedding dimension from the time series (the e (d) and e (d) functions), where d denotes the dimension. figure : readouts of complexity for lagged data on covid- spread by country. a) autocorrelation. bar graph of the autocorrelation in covid- spread with each bar color .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / representing a different country. the selected time lags are indicated on the x-axis, all are calculated versus x(t). b) average mutual information. bar graph of average mutual information in covid- spread with each bar color representing a different country. the selected time lags as indicated on the x-axis are all calculated versus x(t). c) fractal dimensions. box counting dimensions are calculated for -dimensional return plots of increasing lags, x(t+ ) versus x(t) through x(t+ ) versus x(t). countries are evaluated, and the worldwide numbers are shown on the left. poland is represented twice, over the entire evaluation period through october (which contains a steep incline) and over the shorter phase of containment through september (cont. = contained period). figure : time-course of disease spread for individual us states. numbers of new cases, x(t), per day versus time (t, indicating the date). shown are the curves for (top to bottom, left to right) massachusetts, new york, florida, texas, california, and ohio. figure : univariate wavelet analysis. wavelet power spectrum in the time-period domain. contour lines indicate significance of periodicity with . significance level. black lines indicate the ridge of wavelet power. the color bar reveals the power gradient. a) all states on the same scale. b) each state on its own scale. figure : fourier analysis. a) new infection rates. daily reported new numbers of infections divided by , inhabitants (infection rates). the x-axis shows the calendar date. b) power spectrum. periodogram plot on the series of the new infection rates. the x-axis is the frequency (per day) and the y-axis represents the spectral density. the y-axis ranges vary among graphs. c) normalized power spectrum. spectral density versus period (in days) for infection rates. all y-axes have the same scale. figure : time-lagged data analysis by us state. a) bivariate wavelet analysis. shown are cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) time-lagged data were used for x(t)/x(t+ ) (for the lags x(t)/x(t+ ) and x(t)/x(t ) see figure s ). white the contour lines indicate significance of joint periodicity, black arrows indicate the phase difference in the areas with significant joint periods. the solid red dots on the average power plot (the third from the left) reflect significant joint periods at a significance level of . . b) return plots in dimensions. time-lagged return plots in .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / dimensions are shown, from left to right, for x(t)/x(t+ )/x(t+ ), x(t+ )/x(t+ )/x(t+ ), and x(t+ )/x(t+ )/x(t+ ). each state under investigation has its own row. figure : readouts of complexity for time-lagged data by u.s. state. us states have been evaluated. a) autocorrelation. bar graph of the autocorrelation in covid- spread with each bar color representing a different us state. the selected time lags are indicated on the x- axis, all are calculated versus x(t). b) average mutual information. bar graph of average mutual information in covid- spread with each bar color representing a different state. the selected time lags are indicated on the x-axis, all are calculated versus x(t). c) fractal dimensions. box counting dimensions are calculated for -dimensional return plots of increasing lags, x(t+ ) versus x(t) through x(t+ ) versus x(t). table : embedding dimension for time-lagged data by u.s. state. embedding dimensions were calculated according to cao’s algorithm, which uses functions in order to estimate the embedding dimension from the time series. the table shows the calculated time lags and embedding dimensions for each u.s. state under study. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supplement figure s : power spectrum and univariate wavelet analysis for worldwide new cases. a) wavelet analysis and model fit (minimum power level: , significance level: . , only coi: false, only ridge: false). b) fourier analysis. figure s : bivariate wavelet analysis by country. the graphs represent cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) time-lagged data were used for x(t)/x(t+ ) (a) and x(t)/x(t ) (b). white contour lines depict joint significance of periodicity. black arrows reflect the phase difference in the areas with significantly joint periods. the solid red dots on the average power plot (the third from the left) indicate significantly joint periods at a probability of error . . the color bars reveal the cross- wavelet power levels. figure s : return plots in dimensions for poland. new infections per day. top) entire observation period. th march through th november . middle) contained phase. partial time frame through th september . bottom) exacerbating phase. partial time frame from st september . figure s : bivariate wavelet analysis by us state. the graphs display cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) time-lagged data were used for x(t)/x(t+ ) (a) and x(t)/x(t ) (b). white contour lines indicate significance of joint periodicity. black arrows indicate the phase difference in the areas with significantly joint periods. the solid red dots on the average power plot (the third from the left) indicate significance at a level of . . figure s : evolution of lyapunov exponents over time. for a discrete mapping x(t+ ) = f(x(t)) we calculate the local expansion of the flow by considering the difference of trajectories. the lyapunov characteristic exponent can be approximated as 𝜆 ≈ ln (|𝑥𝑛+ ― 𝑦𝑛+ |/|𝑥𝑛 ― 𝑦𝑛|) for points xn,yn close to each other on the trajectory [https://www.math.tamu.edu/~mpilant/math /matlab/lyapunov/lorenzspectrum.pdf]. the changes of lyapunov exponents are presented for the return plots of lags x(t+ ) versus x(t), x(t+ ) versus x(t), x(t+ ) versus x(t), and x(t+ ) versus x(t). a) countries. shown are ranges over days. b) us states. shown are ranges over days. mass. = massachusetts. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / metabolite discovery through global annotation of untargeted metabolomics data li chen , , wenyun lu , , lin wang , , xi xing , , xin teng , xianfeng zeng , , antonio d.muscarella , yihui shen , alexis cowan , , melanie r. mcreynolds , , brandon kennedy , ashley m. lato , shawn r. campagna , mona singh , , joshua rabinowitz , , ,# institute of metabolism and integrative biology, fudan university, shanghai, , china. lewis-sigler institute for integrative genomics, princeton university, princeton, nj, , usa. department of chemistry, princeton university, princeton, nj, , usa. department of molecular biology, princeton university, princeton, nj, , usa. lotus separation llc, department of chemistry, princeton university, princeton, nj, , usa department of chemistry, the university of tennessee at knoxville, knoxville, tn, , usa department of computer science, princeton university, princeton, nj, , usa. # corresponding author, e-mail: joshr@princeton.edu abstract a primary goal of metabolomics is to identify all biologically important metabolites. one powerful approach is liquid chromatography-high resolution mass spectrometry (lc-ms), yet most lc-ms peaks remain unidentified. here, we present a global network optimization approach, netid, to annotate untargeted lc-ms metabolomics data. we consider all experimentally observed ion peaks together, and assign annotations to all of them simultaneously so as to maximize a score that considers properties of peaks (known masses, retention times, ms/ms fragmentation patterns) as well network constraints that arise based on mass difference between peaks. global optimization results in accurate peak assignment and trackable peak-peak relationships. applying this approach to yeast and mouse data, we identify a half-dozen novel metabolites, including thiamine and taurine derivatives. isotope tracer studies indicate active flux through these metabolites. thus, netid applies existing metabolomic knowledge and global optimization to annotate untargeted metabolomics data, revealing novel metabolites. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction metabolomics provides a snapshot of small-molecule concentrations in a biological system. in so doing, it reflects the integrated impact of genetics and the environment on metabolism. one important role of metabolomics is annotating previously unknown or underappreciated metabolites. for example, metabolomics facilitated identification of -hydroxyglutarate as an oncometabolite, eventually leading to the development of inhibitors of -hydroxyglutarate synthesis as anticancer agents , . metabolomics also contributed to identification of a diversity of natural products , and disease biomarkers . a common experimental strategy in metabolomics is liquid chromatography-high resolution mass spectrometry (lc-ms). lc-ms metabolomics measures thousands of ion peaks, of which hundreds are associated with known metabolites. a much greater number of peaks, however, still remain unannotated. the standard approach to peak annotation is to compare exact mass and either retention time or ms/ms fragmentation pattern to authenticated standards. to facilitate such comparisons, extensive chemical databases have been developed (e.g. metlin , hmdb , mona , kegg , pubchem , chebi and nist ), with software tools available for automated peak picking and database comparison. modern software also includes features for annotating peaks arising from isotopes and adducts of known metabolites, based on co-elution and characteristic mass differences (e.g. xcms , , gnps , ms-dial , mzmine , and camera ). such peaks seem to account for at least half of non-background lc-ms features , . despite this progress, a great number of unknown peaks remain, and figuring out their identities is a primary challenge in the field. one promising approach is network analysis, capitalizing on peak-peak relationships to increase annotation scope and accuracy. connections can be drawn based on similar responses across experiments and/or ms similarity. such connections can arise either through biochemical activities or mass spectrometry phenomena, such as isotopes, adducts, or in-source fragments. while distinct metabolites typically separate chromatographically, ions connected through mass spectrometry phenomena co-elute. workflows employing the concept of molecular connectivity have been used to build networks (e.g., gnps , , cliquems , metdna , biocan , and ipa ), and are showing increasing utility for annotating metabolomics data in diverse contexts. for example, gnps has been used broadly in identifying natural products. existing algorithms generally focus on metabolite peaks with ms spectra available, using ms spectral data as the main annotation driver. this is an effective strategy for annotating high abundance peaks with informative ms spectra, such as major secondary metabolites. it is less effective, however, for many low abundance metabolomics peaks, due to poor quality or less informative ms spectra. we accordingly set out to develop a network algorithm for annotating the breadth of metabolomics peaks, capitalizing on available ms spectra but including also low .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abundance peaks lacking ms spectra. effective incorporation of peaks without ms spectra required making yet better use of peak-peak relationships to enhance annotation accuracy, which we achieved through the computational approach of global optimization: not dealing with peak annotation one- by-one, but instead all at once to take full advantage of the entire available information. this global optimization strategy had not previously been applied in the context of molecular networking analysis. to this end, we present the algorithm “netid”. similar to existing network analysis approaches, nodes are experimentally observed non-background ion peaks and connections are mass differences between peaks. we explicitly distinguish connections due to biotransformations (“biochemical connections” linking two metabolites) from those due to mass spectrometry phenomenon (“abiotic connections” linking isotopes, adducts, and fragments to the metabolites from which they are derived). peak annotation occurs in a single global optimization step, based on linear programming, that enforces a single formula assignment for each experimentally observed ion peak. using this approach, we can annotate roughly % of untargeted metabolomics peaks, with a majority being isotopes and adducts of known metabolites. through these efforts, we provide likely formulae for several hundred novel metabolites, and confirm the identities of half-dozen species not currently in metabolomics databases. results netid algorithm netid involves three computational steps: initial annotation, scoring, and optimization (figure ). the workflow starts with a peak table that contains a list of peak m/z, rt, intensity, and (when available) associated ms spectra, with background peaks removed by comparing to a process blank sample. each peak defines a node in the network. in the initial annotation phase, we match every experimentally measured node m/z to formulae in the hmdb database. peaks matching to hmdb formula within ppm are annotated as seed nodes, from which we extend edges to build the network. edges connect two nodes via gain or loss of specific chemical moieties (atoms). the atom differences can occur either due to metabolism (biochemical connection) or due to mass spectrometry phenomena (abiotic connections). for example, a difference of h suggests an oxidation/reduction relationship and defines a biochemical edge. a difference of na-h suggests sodium adducting and is a type of abiotic edge (adduct edge). other atom differences define other types of abiotic connections (isotope or fragment edges). most atom differences are specific to biochemical, adduct, isotope, or fragment edges, but a few occur in multiple categories. for example, h o loss can be either biochemical (enzymatic dehydration) or abiotic (in-source water loss). by integrating literature and in-house data, we assembled a list of biochemical atom differences and abiotic atom .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / differences which together define all connections in the network (supplementary table , ). using these lists, starting from the seed nodes, we draw all feasible edges such that (i) Δm/z between the connected nodes matches the atom mass difference and (ii) only co-eluting peaks are connected by abiotic edges. through the edge extension process, possible formulae are assigned to nodes outside the initial seeds. a few rounds of edge extension suffice to give thorough coverage. due to finite mass measurement precision, a single node may be assigned multiple contradictory formulae, which are resolved at the optimization step (see methods). netid then scores every node and edge annotation. node annotations are scored based on precision of m/z match to the molecular formula, precision of retention time match to known metabolite retention time and (when the relevant information is available) quality of ms spectra match to database structure. in addition, there is a bonus for matching to formula in hmdb and a penalty for breaking basic chemical rules (seven golden rules for filtering molecular formulae ). biochemical edges receive a positive score for ms spectra similarity match between the connected nodes, and are otherwise unscored. abiotic edges are scored based on precision of co-elution with the parent metabolite, connection type (adduct, isotope, etc.), and features specific to the connection type, such as expected natural abundance for isotope peaks (see methods). the overall impact is to assign high scores to annotations that effectively align the experimentally observed ion peaks with prior metabolomics knowledge. with a score assigned for each potential node and edge annotation, we formulate the global network optimization problem as that of maximizing the network score with linear constraints that each node and edge has a single unique annotation and that these are consistent (e.g. peaks connected by h edge must have formula differing by h). such optimization is readily performed by linear programing with a typical runtime of hours in r on a personal computer, and results in an optimal and consistent network annotation. global network optimization as an example of the utility of global network optimization, where all peaks and connections are simultaneously considered to enhance annotation accuracy, we present an example network containing five peaks (figure a). we first match experimental measurements to the database, annotating node a and node b as seed nodes adenosine monophosphate (amp, c h n o p) and adenosine (c h n o ), respectively. we also identify five possible connections between the five nodes. two alternative networks are generated by extending annotations. in the left network, node c is annotated as adenosine hcl adduct (c clh n o ), whereas in the right network, node c is annotated as a putative metabolite (c h n o p) resulting from co loss from amp. node d is c isotope of node c in both networks. node e is annotated as cl isotope of node c in the left network, and is unannotated in the right network because there is no cl atom in the parent molecule. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the left network has higher total node and edge annotation scores than the right network, and thus is selected by netid. this selection makes sense to an experienced mass spectroscopist: the cl isotope signature in node e indicates that node c should contain cl. the power of netid is that it automatically captures such logic, and uses the power of global computational optimization to extend such inferences across the network in an automated manner. to test the netid workflow, we applied it to both yeast and liver datasets, in both positive and negative ionization mode (figure b, c). considering the example of negative mode yeast data with a total of , non-background peaks, in the initial annotation step, roughly , potential formulae were assigned to , peaks, with about peaks receiving multiple formula annotations. these nodes were connected by just over , potential edges. edge extension expanded coverage to over , nodes with an average of twelve potential formulae each, highlighting the importance of scoring and network optimization to assign proper formulae. after scoring node and edge annotations, global network optimization settled on about , unique node annotations. about % of the annotated peaks were metabolites, % were putative novel metabolites, and the rest were mass spectrometry phenomena, such as adducts, fragments, isotopes. nodes were connected by about , edges, roughly evenly split between biochemical and abiotic connections (figure c, supplementary fig. a). more than % of annotated nodes fell into a single dominant connected network (supplementary fig. b), reflecting most peaks being connected to core metabolism. about % of peaks, however, remained unannotated. these unannotated peaks likely reflect deficiencies in our lists of allowed atom differences, including additional forms of mass spectrometry phenomena. for example, manual examination of the unconnected peaks revealed a dozen nickel adducts of known compounds (supplementary table. ). importantly, the annotated peaks included several hundred novel metabolite formulae (supplementary fig. , supplementary data ). collectively, these provide a wealth of opportunities for metabolite discovery. thiamine-derived metabolites netid optimization provided not only a list of putative metabolites, but also connections linking these putative metabolites to known metabolites. in the yeast metabolomics dataset, we found three putative metabolites that have total ion current > , connected in a subnetwork around thiamine. their formulae are c h n o s (thiamine+o), c h n o s (thiamine+c h o) and c h n o s, (thiamine+c h o) (figure a). while not found in hmdb, thiamine+o is documented in metlin as a thiamine oxidation product, so we focused on the other two potential thiamine derivatives. ms/ms spectra of the putative thiamine+c h o and thiamine+c h o contained characteristic thiamine fragments. both contained a classical pyrimidine fragment, with thiamine+c h o also containing an acetylated pyrimidine fragment, leading to a probable structure (figure a,b). the structural assignment is further supported by the presence of an unmodified thiazole fragment. in .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / contrast, thiamine+c h o lacked a classical unmodified thiazole fragment, instead showing a thiazole+c h o fragment (and a fragment with further water loss) (figure a,b). isotope tracing experiments further confirm these two peaks contain thiamine. when fed [u- c]glucose as sole carbon source, yeast synthesize thiamine de novo, resulting fully labeled thiamine species, with carbon counts matching the netid formula assignments (figure c). adding unlabeled thiamine to the [u- c]glucose culture media, yeast uptake the unlabeled thiamine, resulting in unlabeled thiamine and m+ labeled thiamine+c h o and thiamine+c h o species. although discovered in yeast, these are conserved metabolites, found also in mammalian samples (figure d). acetylation is one of the biochemical atom transformations allowed in netid. the addition of c h o is much less common biochemically, and was captured in netid as two steps, acetylation followed by reduction. accordingly, we looked into thiamine metabolism to explore how thiamine+c h o might be produced. thiamine pyrophosphate is an important cofactor in pyruvate dehydrogenase (pdh, the entry step to tca cycle) (figure e). the de-pyrophosphorylation product of thiamine intermediate in pdh reaction yields thiamine+c h o matches the proposed thiamine+c h o structure (figure f). based on this biochemical route, we realized that analogous products could be formed by α- ketoglutarate dehydrogenase (thiamine+c h o ) and branched-chain keto acid dehydrogenase (thiamine+c h o) (figure f). peaks at both of these exact masses were also experimentally observed, with isotope labeling results supporting their being thiamine-derived metabolites (supplementary fig. ). thus, netid enabled the discovery of four novel thiamine-derived metabolites. n-glucosyl-taurine we similarly carried out netid annotation of a mouse liver dataset. we observed multiple putative metabolite peaks linked to taurine, by apparent glucosylation (+c h o ), palmitylation (+c h o) and transamination (+o-nh ) (figure a). the latter two, while missing in hmdb, were found in metlin: n-palmitoyl taurine (c h no s) and sulfoacetaldehyde (c h o s). to elucidate the structure of the putative taurine glucosylation product (c h no s), we chemically synthesized n- glucosyl-taurine. synthetic n-glucosyl-taurine matched the retention time and ms/ms fragmentation pattern of the observed c h no s peak (figure b,c). in liver samples of mice infused with [u- c]glucose, c h no s appeared in m+ form, suggesting active synthesis of the n-glucosyl-taurine from circulating glucose (figure d). n-glucosyl-taurine was not observed in yeast extract but was detected in multiple mouse tissues. quantitation using the synthetic standard shows that liver has the highest level of glucosyl-taurine at ~ μm (figure e, supplementary fig. ). this ranks among the few dozen most abundant liver metabolites. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion the advent of lc-ms metabolomics revealed tens of thousands of metabolite peaks not matching known formulae, raising the possibility that the majority of metabolites remained to be discovered. while the biosphere likely contains many novel metabolites, it has been increasingly recognized that most peaks in typical untargeted metabolomics studies do not arise from novel metabolites, but rather mass spectrometry phenomena. the goal of comprehensively annotating untargeted metabolomics peaks with molecular formulae has, however, remained elusive. one promising strategy for peak annotation involves building molecular networks where nodes are lc-ms peaks (with associated molecular formulae) and edges are atom transformations linking the peaks. here we advance this strategy by combining metabolomics knowledge with computational global optimization. we explicitly differentiate biochemical connections reflecting metabolic activity and abiotic connections arising from mass spectrometry phenomena. by formulating the peak annotation challenge as a linear program, we identify an optimal network in light of all observed peaks. rather than weeding out peaks from mass spectrometry phenomena like adducts and natural isotopes, this approach takes advantage of the information embedded in them. it further provides traceable peak-peak relationships, which illuminate the basis for assigned formulae and suggest candidate structures. applying this approach to untargeted lc-ms data from yeast and liver samples, we assign formulae to roughly three-quarters of all non-background peaks. in each of positive and negative mode, the annotated peaks cover about known metabolites, with on average more than four mass peaks for every metabolite (e.g. m+h plus three adduct or isotope peaks). this leaves a couple thousand unannotated peaks from each lc-ms run. based on the observed ratio between peaks and metabolites, this likely correspond to hundreds (but not thousands) of unidentified metabolites. this number may actually be less, due to novel adducts (e.g. nickel adducts, which we discovered via careful examination of the unannotated peaks) or other mass spectrometry phenomena. importantly, this approach has already generated likely formulae for many hundreds of putative novel metabolites (supplementary fig. , supplementary data ), including a half-dozen for which we assign structures (figure , ). a key benefit of molecular network-based annotation is the ability to assimilate steadily new information , . each newly identified metabolite provides an additional anchor point for optimizing the network. other data types can be seamlessly added. for example, compound class categorization based on ms/ms data or retention time prediction can be added to score nodes. labeling similarity upon feeding different isotope-labeled nutrients could potentially be added to score edges. global optimization, integrating all new information comprehensively with prior knowledge to arrive at optimal annotations, is novel and potentially transformative for the field more broadly. the cycle .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / of careful experimentation and focused computational method developments holds the potential to identify most unknown metabolites over the coming decade, providing a robust blueprint of the metabolome (figure ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods yeast metabolomics sample preparation and isotope labeling s. cerevisiae strain fy was grown for at least generations in minimal essential media containing . % [u- c] or [u- c] glucose and mm ammonium sulfate with or without . mg/l thiamine hydrochloride . then, in mid-exponential phase, ml culture broth (od = . ) was filtered and metabolites were extracted using ml extraction buffer ( : : : . acetonitrile:methanol:water:formic acid), followed by adding μl neutralization buffer ( % nh hco ). the extracts were kept at - ℃ for at least min to precipitate protein before centrifuging at , g for min. the supernatant was used for lc–ms analysis. murine metabolomics sample preparation and intravenous infusion experiment animal studies followed protocols approved by the princeton university institutional animal care and use committee. twelve-month-old female wild-type c bl/ mice (the jackson laboratory, bar harbor, me) on normal diet were sacrificed by cervical dislocation and tissues quickly dissected and snap frozen in liquid nitrogen with precooled wollenberger clamp. frozen samples from liquid nitrogen were then transferred to − °c freezer for storage. to extract metabolites, frozen liver tissue samples were first weighed (~ mg each) and transferred to ml round-bottom eppendorf safe-lock tubes on dry ice. samples were then ground into powder with a cryomill machine (retsch, newtown, pa) for seconds at hz, and maintained at cold temperature using liquid nitrogen. for every mg tissues, ul extraction buffer (as above) was added to the tube, vortexed for seconds, and allowed to sit on ice for minutes. then l neutralization buffer was added and the samples vortexed. the samples were allowed to sit on ice for minutes and then centrifuged at , g for min at °c. the supernatants were transferred to another eppendorf tube and centrifuged at , g for another min at °c. the supernatants were transferred to glass vials for lc-ms analysis. a procedure blank was generated identically without tissue, which was used later to remove the background ions. detailed methods for intravenous infusion of mice have been described previously . briefly, in vivo infusions were performed on – -week-old c bl/ mice pre-catheterized in the right jugular vein (charles river laboratories). mice were kept fasted for h and then infused for . h with [u- c]glucose ( mm, . l/min/g). the mouse infusion setup (instech laboratories) included a tether and swivel system so that the animal had free movement in the cage. venous samples were taken from tail bleeds. at the end of the infusion, the mouse was euthanized by cervical dislocation and tissues were collected and extracted as above. serum metabolites were extracted by adding l methanol to l of serum and centrifuging for min. the supernatant was used for lc–ms analysis. lc-ms and lc-ms/ms .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / lc separation was achieved using a vanquish uhplc system (thermo fisher scientific) with an xbridge beh amide column ( × mm, . µm particle size; waters). solvent a is : water: acetonitrile with mm ammonium acetate and mm ammonium hydroxide at ph . , and solvent b is acetonitrile. the gradient is min, % b; min, % b; min, %; min, % b; min, %, min, % b; min, % b; min, % b; min, % b; min, % b; min, % b, . min, % b; min, % b; min, % b. total running time is min at a flow rate of µl/min. lc-ms data were collected on a q-exactive plus mass spectrometer (thermo fisher) operating in full scan mode with a ms scan range of m/z - , and resolving power of , at m/z . other ms parameters are as follows: sheath gas flow rate, (arbitrary units); aux gas flow rate, (arbitrary units); sweep gas flow rate, (arbitrary units); spray voltage, . kv; capillary temperature, °c; s- lens rf level, ; agc target, e and maximum injection time, ms. to demonstrate the utility of inclusion of ms data for netid analysis, and ms spectra were obtained for selected peaks with intensity > in positive and negative ionization mode respectively from a previous liver dataset . targeted ms spectra were collected using the prm function at ev hcd energy with other instrument setting being, resolution , agc target , maximum it ms, isolation window . m/z. glucosyl-taurine synthesis glucosyl-taurine synthesis was carried out following previous literature reports with slight modifications . in brief, dry methanol was obtained by distillation of hplc-grade methanol (fisher; hplc grade . micron filtered) over cah (acros organics; ca. % extra pure, - mm grain size). a flame-dried round-bottom flask equipped with a reflux condenser and stir bar was charged with . g taurine (alfa aesar; %), . g d-glucose (acros organics; acs reagent), and ml of dry methanol. this mixture was sonicated under an inert atmosphere for minutes before being returned to the manifold for the reaction. to the fine-suspension of taurine and glucose in dry methanol at room temperature, . ml . m sodium methoxide in methanol (acros organics) was added via glass syringe. at this point, the suspension began to dissolve and after minutes, gave a clear and colorless solution. the solution was stirred vigorously under an inert atmosphere for hours, which resulted in a faint peach-colored solution. this solution was chilled to ˚c, and ~ ml of absolute ethanol ( proof) was added and precipitation was allowed to occur at this temperature for minutes. solvent was then removed by filtration over a glass filter (medium porosity), and washed with ~ ml of absolute ethanol, affording a fine pale-yellow powder ( . g; crude material). nmr experiment was carried out to validate the structure of synthesized n-glucosyl-taurine. selective tocsy experiments using dipsi spin-lock and with added chemical shift filter were run on a bruker avance iii hd nmr spectrometer equipped with a custom-made qci-f cryoprobe (bruker, billerica, ma) at mhz and at . k controlled temperature. the sample was dissolved in dmso- .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / d . the spectra shown on the plots are results of ms sl mixing, scans each. data processing (mnova v. , mestrelab research s.l., santiago de compostela, spain) included zero filling, hz gaussian apodization, phase- and baseline correction. nmr analysis suggests that the final crude material contains . % n-glucosyl-taurine and unreacted substrates (supplementary figure ). netid algorithm i. data preparation and input lc-ms raw data files (.raw) were converted to mzxml format using proteowizard (version . . ). el-maven (version . ) was used to generate a peak table containing m/z, retention time, intensity for peaks. parameters for peak picking were the defaults except for the following: mass domain resolution is ppm; time domain resolution is scans; minimum intensity is ; minimum peak width is scans. the resulting peak table was exported to a .csv file. redundant peak entries due to imperfect peak picking process are removed if two peaks are within . min and their m/z difference are within ppm. background peaks are removed if its intensity in procedure blank sample is > . -fold of that in biological sample. the m/z of the remaining peaks are recalibrated by applying an absolute m/z adjustment factor εabsolute (independent of measured m/z) and a relative m/z adjustment factor εrelative (linearly dependent on measured m/z). for each peak i the recalibrated values im/z, adjusted are computed as 𝑖 / , = 𝑖 / , × ( + 𝜀 ) + 𝜀 ( ) the εrelative and εabsolute values are fit via linear regression using measured m/z values of selected known metabolite ion peaks and their calculated m/z. that is, for each of these known metabolite k, we have equations 𝑘 / , = 𝑘 / , × ( + 𝜀 ) + 𝜀 ( ) lc-ms/ms data were extracted from the mzxml files using lab-developed matlab code. ms spectra may contain interfering product ions from co-eluting isobaric parent ions. these interfering product ions were removed by examining the extracted ion chromatogram (eic) similarity between the product ions in ms data and the parent ion in ms data. a pearson correlation coefficient of . was used as a cutoff to retain those product ions that has similar eic as the parent ion. the cleaned ms data were exported to excel files for further processing. structures, formulae, m/z and ms spectra of metabolites were obtained from the human metabolome database (hmdb, version . ), and retention times of selected metabolites were determined through running authentic standards using the above-mentioned lc-ms method. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / netid algorithm requires three types of input files: a peak table (in .csv format) recording m/z, retention time, intensity for peaks; an atom difference rule table (in .csv format) containing a list of biochemical atom differences and abiotic atom differences which together define all connections in the network (supplementary table , ), and metabolite information files containing structure, formula, m/z and ms spectra of hmdb metabolites and retention time of selected metabolites under different lc conditions. exemplary peak table from the yeast dataset, atom difference rule table and hmdb metabolite information file are provided in supplementary data . ii. initial annotation of nodes and edges in the network the first step of netid algorithm is to make an initial annotation for seed nodes, determine possible annotations for other nodes, and determine edges in the network. each peak is a node in the network. we compare the experimentally measured m/z for each node to those of all metabolite formulae in the hmdb database. when the m/z difference is within ppm, candidate formulae and hmdb ids are assigned to the node, and this node is defined as a primary seed node. a primary seed node can contain more than one candidate formulae and hmdb ids if all are within the m/z difference range. edges connect two nodes via gain or loss of specific atoms. we assembled a list of biochemical atom differences and abiotic atom differences which together define all connections in the network (supplementary table , ). let each of these differences be denoted by di. for each node u, if there is a node v such that the difference in the measured m/z of the nodes matches one of the those in the list of atom mass differences, we add an edge between u and v. that is, if um/z and vm/z are the experimentally measured m/z for the peaks corresponding to nodes u and v respectively (assuming vm/z > um/z for simplicity), then there is an edge between these nodes if there is some difference di such that | 𝑣 / − 𝑢 / − 𝐷 | < 𝑣 / × ppm ( ) if di is an abiotic difference, in order to add an edge, it is additionally required that the retention time between two nodes should be within . min. that is, if urt and vrt are the retention times for u and v respectively, then it is required that | 𝑣 − 𝑢 | < . min ( ) for each node, its candidate formulae set will expand due to propagating formulae from its neighboring nodes through edge atom differences. for example, when applying the atom difference of edge (u, v) on the formula assigned to primary seed node u, we can derive a new candidate formula for the connected node v. if the derived formula’s calculated m/z is within ppm of node v’s measured m/z, then a new candidate formula is added for node v. iterating the process to all candidate formulae of node u through edge (u, v) will further expand candidate formulae for node v. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we apply the above extension process to formulae of all primary seed nodes through atom difference edges, and these new candidate formulae can themselves be used for another round of extension. note that a primary seed node will be treated as the rest of nodes during the subsequent rounds of extension, and may as well be assigned with new formulae. to avoid duplicated efforts in the extension process, we allow formulae of primary seed nodes and biotransformed formulae thereof to be extended through both biotransformation and abiotic atom difference edges, and do not allow abiotic candidate formulae be further extended through biotransformation atom difference edges. the default extension process includes two rounds of biotransformation edge extensions and three rounds of abiotic edge extensions. iii. scoring node annotations netid then scores every candidate node and edge annotation assigned in the initial annotation step. the node scoring system aims to assign high scores to annotations that align observed ion peaks with known metabolites based on m/z, retention time, ms/ms, and/or isotope abundances. let the set of candidate annotation for node u be denoted as {𝑎 … 𝑎 … 𝑎 }. for each node u and each of its candidate annotation 𝑎 , let s(u, 𝑎 ) denotes the score of candidate annotation 𝑎 for node u. different scoring components for candidate node annotations are defined as below: (a) sm/z(u, 𝑎 ) is negative when measured m/z differs from the calculated m/z of assigned molecular formula. a larger ppm difference between calculated formula m/z and measurement m/z results to lower scores. the default scale factor is - . . let 𝑎 , / be the calculated formula m/z of annotation 𝑎 , and 𝑢 / be the measured m/z of node u, then s / (𝑢, 𝑎 ) = − . × 𝑢 / − 𝑎 , / / 𝑢 / × ( ) (b) srt(u, 𝑎 ) is positive if the measured rt for the peak corresponding to node u matches to a known standard. a smaller difference between known and measured rt results in a higher score. let 𝑎 , is the known rt of annotation 𝑎 , and 𝑢 be the measured rt of node u, then s (𝑢, 𝑎 ) = − 𝑢 − 𝑎 , , if 𝑢 − 𝑎 , < . min otherwise, s (𝑢, 𝑎 ) = ( ) (c) sms (u, 𝑎 ) is positive if the measured ms spectrum of node u matches the database ms spectrum of annotation 𝑎 . a dot product scoring function is used to score the ms spectra similarity . the intensities of the fragment ions in the ms spectra are rescaled so that the highest fragment ion is set to . ms spectra are represented as w = [relative intensity of ms ions]n[m/z value]m, with n = , m = . dot product (dp) and score for ms match (sms (u, 𝑎 )) are defined as below. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / 𝐷𝑃 = ∑ ∑ × ∑ ( ) s (𝑢, 𝑎 ) = dp, if dp > . otherwise s (𝑢, 𝑎 ) = ( ) (d) sdatabase(u, 𝑎 ) is positive if the annotated formula 𝑎 exists in hmdb. we give a positive score to a primary seed node annotation if that annotated formula exists in hmdb. s (𝑢, 𝑎 ) = . , if 𝑎 in hmdb otherwise, s (𝑢, 𝑎 ) = ( ) (e) smissing_isotope(u, 𝑎 ) is negative if an isotopic peak is missing. we penalize a formula annotation if it passes the intensity threshold (default at x ) but does not have isotopic peaks of specified elements. the default isotope being evaluated is cl. any other elements, such as c or o, can be included by users. s _ (𝑢, 𝑎 ) = − , if isotopic peak is missing otherwise s _ (𝑢, 𝑎 ) = ( ) (f) srule(u, 𝑎 ) is negative if annotation 𝑎 violates basic chemical rules. we strongly penalize formulae that violate basic chemical rules, including a negative rdbe (ring and double bond equivalents), and unlikely element ratios in metabolites (o/p < , o/si < ). s (𝑢, 𝑎 ) = − , if chemical rules are violated otherwise, s (𝑢, 𝑎 ) = ( ) (g) sderivative(u, 𝑎 ) is positive if the annotation 𝑎 is derived from a parent peak p with an annotation h that has high score sparent(p, h), which is calculated by summing up scores in (a)-(f) for s(p, h). s (𝑢, 𝑎 ) = s (𝑝, ℎ) − . ( ) s (𝑝, ℎ) = s / (𝑝, ℎ) + s (𝑝, ℎ) + s (𝑝, ℎ) + s (𝑝, ℎ) + s _ (𝑝, ℎ) + s (𝑝, ℎ) ( ) this is particularly helpful in annotating abiotic peaks. for example, annotation of glutamate sodium adduct will be given a positive sderivative when its parent node is annotated as glutamate with high sparent score. a final score s(u, 𝑎 ) for each candidate annotation 𝑎 of node u is calculated by summing scores in (a)-(g). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / s(𝑢, 𝑎 ) = s / (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s _ (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) + s (𝑢, 𝑎 ) ( ) note that for each node u, we have one of candidate “annotations” that corresponds to no annotation being chosen for that node. the node score for this null annotation is at default, and can be set at a negative value to promote choosing actual annotations. iv. scoring edge annotations (biological, adduct, isotope) the edge scoring system aims to assign high scores to edge annotations that correctly capture biochemical connections between metabolites (based on ms spectra similarity) and abiotic connections between metabolites and their mass spectrometry phenomena derivatives, such as isotopes and adducts. biochemical, isotope, and adduct edge annotations are the most common types, and other less common abiotic connection types are then described in the subsequent section. suppose we consider two nodes u and v that are connected by an edge (u, v). for each pair of nodes u and v such that there is an edge (u, v), let the set of candidate formula for node u and v be denoted as {𝑎 … 𝑎 … 𝑎 } and {𝑏 … 𝑏 … 𝑏 }, respectively, and let the set of candidate atom differences for edge (u, v) be {𝐷 … 𝐷 … 𝐷 }. let s(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) be the score of choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v). note that s(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is set to be if atom difference 𝐷 does not represent the formula difference of 𝑎 and 𝑏 . s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝑎 − 𝑏 ≠ 𝐷 different scoring components for candidate edge annotations are defined as below: (h) when node u and v have experimental measured ms spectra, sms _similarity( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for a biochemical edge, and is a positive score if two connected nodes u and v have ms similarity, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . sms _similarity is determined using the dot product (dp), as described in previous section, and reverse dot product (dp_r), which evaluates the neutral ion loss similarity in the ms spectra . a reverse ms spectrum is represented as r = [relative intensity of ms ions]n[parent m/z – measured m/z value]m, with n = , m = . dp = ∑ ∑ × ∑ ( ) dp_r = ∑ ∑ × ∑ ( ) s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = max (dp, dp_r), if max(dp, dp_r) > . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / otherwise, s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = ( ) (i) sco_elution(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for an abiotic edge, and is a negative score if the rt of two connected nodes differ more than a threshold ( . min), given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = − × |𝑢 − 𝑣 |, if |𝑢 − 𝑣 | ≥ . min otherwise, s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = ( ) (j) stype(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for all edges, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 , and is a non-negative score depending on the connection type of edge, which is defined by 𝐷 , including biotransformation, adduct, isotope and fragment (supplementary table , ). the magnitude of scores reflects the empirical confidence in the annotation type when certain atom differences occur, and can be adjusted based on personal use. s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ biotransformation s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ adduct s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ isotope s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ fragment ( ) (k) for each 𝐷 ϵ isotope, sisotope_intensity(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for isotope edge (u, v) where 𝑏 is the isotopic derivative of 𝑎 with atom difference of 𝐷 , and is a negative score if the measured isotope peaks deviate from expected natural abundance. the score for an isotope edge depends on how likely the ratio of measured and expected isotopic intensity (ratioisotope) is observed in an empirical normal distribution n , σ . isotopes of all elements included in the atom difference table are evaluated. ratio = / ( , , ) ( ) s (𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) = 𝑙𝑜𝑔 𝜇 = ratio n , σ 𝜇 = n , σ ( ) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / σisotope is empirically defined as below, so that when measured isotope intensity is close to detection limit, a larger σisotope (a widened distribution, which is more tolerant to discrepancy) will be used. σ = . + ( ) ( ) a final edge annotation score s(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(k), if other less common abiotic connection types are not considered (see next section). s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ( ) v. additional abiotic edge types lc-ms metabolomics may include additional abiotic relationships. in orbitrap data, these include oligomers, multi-charge species, heterodimers, in-source fragments of known or unknown metabolites , and ringing artifact peaks surrounding high intensity ions , . these relationships were included in netid as additional edge types, which are evaluated for all m/z pairs within a predefined rt range ( . min). (l) oligomer and multi-charge species. an oligomer/multi-charge edge is assigned between two nodes u and v, if their m/z satisfy |𝑣 / − n × 𝑢 / | < 𝑢 / × ppm, n ϵ {positive integers} ( ) (m) heterodimer. heterodimer peak (node v) may be observed when one abundant metabolite (node u) forms ion cluster with other ion species (node t). we examine nodes that have intensity above , and assign a heterodimer edge between two nodes u and v if their m/z difference satisfy |( 𝑣 / − 𝑢 / ) − 𝑡 / | < 𝑢 / × ppm ( ) (n) in-source fragments. fragmentation peaks may be observed when one abundant metabolite breaks up into fragments during the ionization process. database ms of known metabolites can be used to identify known ion fragmentation peaks . if candidate annotation 𝑏 of node v is annotated with a hmdb id associated with database ms spectrum, and m/z of node u matches to a fragment m/z in 𝑏 ’s ms spectrum, then a database fragment edge will connect such two nodes. that is, .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / 𝑢 / ϵ database ms spectrum of candidate annotation 𝑏 of node v ( ) measured ms spectra can be used to identify unknown ion fragmentation peaks. if node v is associated with a measured ms spectrum, and m/z of another node u matches to a fragment m/z in the ms spectra, then an experiment fragment edge will connect such two nodes. that is, 𝑢 / ϵ measured ms spectrum of node v ( ) (o) ringing artifacts. ringing peaks are artifact peaks (node v) often observed on both sides of the m/z of an intense ion peak (node u) in fourier-transformed ms instrument including orbitrap. we examine nodes that have intensity above , and assign a ringing artifact edge between two nodes if two nodes satisfy ppm < | 𝑣 / − 𝑢 / | / 𝑢 / < ppm 𝑢 / 𝑣 > ( ) scoring of these additional abiotic edges follow the same rules described in the “scoring edge annotations” section with additional stype defined as below. s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ oligomer or multi-charge s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ heterodimer s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = . , if 𝐷 ϵ database ms fragment s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ measured ms fragment s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = , if 𝐷 ϵ ringing artifacts ( ) a final edge annotation score s( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(o). s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + s _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ( ) vi. global network optimization using linear programing .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / using scores assigned for each candidate node and edge annotation, our goal is to find annotations for each node so as to maximize the sum of the scores across the network under the constraints that each node is assigned a single annotation, and that the network annotation is consistent. we use linear programming to solve this optimization problem optimally, as described next. for each node u and each of its candidate formula 𝑎 , we define a node binary decision variable 𝑥 , to denote whether candidate formula 𝑎 is selected as the annotation for node u. that is, 𝑥 , = , if node u is annotated with formula 𝑎 otherwise, 𝑥 , = ( ) we define a binary decision variable 𝑐 , , , , to denote whether candidate formulae 𝑎 and 𝑏 are chosen for nodes u and v , and the candidate atom difference 𝐷 corresponds to the formula difference of candidate formulae 𝑎 and 𝑏 of the connected nodes u and v. that is, 𝑐 , , , , = , if 𝑎 and 𝑏 are chosen for nodes u and v respectively, and 𝑎 − 𝑏 = 𝐷 otherwise, 𝑐 , , , , = ( ) we constrain the optimization so that each node has a single annotation, and an edge exists and only exist if the atom difference of that edge annotation matches the formula difference of nodes. as a result, the node and edge binary variables should satisfy ∑ 𝑥 , = ( ) 𝑐 , , , , ≤ 𝑥 , , 𝑐 , , , , ≤ 𝑥 , ( ) 𝑐 , , , , ≥ 𝑥 , + 𝑥 , − ( ) for all variables defined above, we add the constraints that they are either or . with each candidate node and edge annotation being scored, the objective for the optimization is to find values for all variables 𝑥 , and 𝑐 , , , , so as to maximize the sum of all node scores and edge scores in a network while satisfying the constraints. maximize: ∑ 𝑥 , × s(𝑢, 𝑎) + ∑ 𝑐 , , , , × s(𝑢, 𝑣, 𝑎, 𝑏, 𝐷) ( ) the optimization result provides a string of binary numbers that denote if a candidate node or edge annotation is selected for the global optimal network. ibm ilog cplex optimization studio (version . . or later) is used to solve the linear programing problem. a cplexapi package for r is used to .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / call cplex optimization function in an r environment. for the yeast datasets and using the above scoring parameters, optimization finishes within an hour on a standard laptop. depending on the number of peaks in data tables, the entries in the atom difference tables, and the parameters involved in scoring, runtimes during internal testing ranged from minutes to h. code availability netid was developed mainly in r, and used a mixture of ibm ilog cplex optimization studio, matlab and python. netid code is available for non-commercial use in github at https://github.com/lichenpu/netid, under the gnu general public license v . . a shinyr app is provided to visualize the network results from netid in a local environment, along with a detailed user guide and example files (supplementary note , supplementary data ). acknowledgement this work was supported by a department of energy (doe) grant (no. de-sc to j.d.r.), the center for advanced bioenergy and bioproducts innovation (grant no. de-sc , subcontract to j.d.r.) and nih grant r ca to w.l. m.r.m is funded by the howard hughes medical institute and burroughs wellcome fund via the pdep and hanna h. gray fellows programs. we thank istvan pelczer at nmr facility of department of chemistry, princeton university for the nmr analysis, and x. su for scientific discussion and help. the center for advanced bioenergy and bioproducts innovation and the center for bioenergy innovation are both u.s. department of energy bioenergy research centers supported by the office of biological and environmental research in the doe office of science. any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the u.s. department of energy. competing interests the authors declare no competing interests. author contributions l.c., m.s. and j.d.r. conceived the project. l.c. developed the netid algorithm. w.l., l.w., x.z., a.c. m.m. performed experiments on mouse. l.w., w.l. and l.c. performed experiments on yeast. l.c., w.l., l.w. and x. x. analyze lc-ms and lc-ms/ms data. x.t., a.m. and y.s. contributed to coding development. b.k., a.m.l., and s.r.c. provided chemical synthesis of taurine-related compounds. l.c. and j.d.r. wrote the manuscript. all authors discussed the results and commented on the manuscript. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure legends figure . a global network optimization approach for untargeted metabolomics data annotation (netid). the input data are lc-ms peaks with m/z, retention times, intensities and optional ms spectra. the output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). peaks are classified as “metabolite” (m+h or m-h peak of formula found in hmdb), “putative metabolite” (formula not found in hmdb but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). netid algorithm involves three steps. initial annotation first matches peaks to hmdb formulae. these seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. each node and edge annotation are then scored based on match to known masses, retention times, and ms/ms fragmentation patterns. global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. figure . utility of global network optimization. (a) an example network demonstrating the value of the global optimization step in netid. node a and node b match hmdb formulae and are connected by an edge of phosphate (hpo ). node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. the table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (b) visualization of the optimal network obtained from negative mode lc-ms analysis of baker’s yeast, containing nodes and connections. metabolite and putative metabolite peaks are in green and artifact peaks in purple. (c) summary table of netid annotations of negative and positive mode lc-ms data from baker's yeast and mouse liver. figure . netid reveals thiamine-derived metabolites in yeast. (a) subnetwork surrounding thiamine. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) ms spectra of thiamine, thiamine+c h o, and thiamine+c h o, with proposed structures of the major fragments. (c) labeling fraction of thiamine and its derivatives, in [u- c]glucose with and without unlabeled thiamine in the medium. (d) the thiamine derivatives are also found in mouse tissues and urine. (e) proposed mechanism for formation of thiamine+c h o. pyruvate dehydrogenase (pdh) decarboxylates pyruvate, and adds the resulting [c h o] unit (in red) to thiamine. (f) the same enzymatic mechanism occurs in oxoglutarate dehydrogenase (ogdh) and branched-chain α-ketoacid dehydrogenase complex (bckdc), and generates thiamine+c h o and thiamine+c h o respectively. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid discovers mammalian taurine derivatives. (a) subnetwork surrounding taurine from mouse liver extract data. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) lc-ms chromatogram of n-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (c) ms spectrum of glucosyl-taurine peak from liver extract (top), and synthetic n-glucosyl-taurine standard (bottom). (d) isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for h with [u- c]glucose. (e) absolute n-glucosyl-taurine concentration in murine serum and tissues. figure . netid applies global optimization for metabolomics data annotation and metabolite discovery. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / reference . dinardo, c. d. et al. durable remissions with ivosidenib in idh -mutated relapsed or refractory aml. n. engl. j. med. , – ( ). . dang, l. et al. cancer-associated idh mutations produce -hydroxyglutarate. nature , ( ). . doroghazi, j. r. et al. a roadmap for natural product discovery based on large-scale genomics and metabolomics. nature chemical biology , – ( ). . aron, a. t. et al. reproducible molecular networking of untargeted mass spectrometry data using gnps. nature protocols , – ( ). . johnson, c. h., ivanisevic, j. & siuzdak, g. metabolomics: beyond biomarkers and towards mechanisms. nature reviews molecular cell biology , – ( ). . guijas, c. et al. metlin: a technology platform for identifying knowns and unknowns. anal. chem. , – ( ). . wishart, d. s. et al. hmdb . : the human metabolome database for . nucleic acids res , d –d ( ). . tsugawa, h. et al. hydrogen rearrangement rules: computational ms/ms fragmentation and structure elucidation using ms-finder software. anal. chem. , – ( ). . kanehisa, m., sato, y., kawashima, m., furumichi, m. & tanabe, m. kegg as a reference resource for gene and protein annotation. nucleic acids res , d –d ( ). . kim, s. et al. pubchem update: improved access to chemical data. nucleic acids res , d –d ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . hastings, j. et al. chebi in : improved services and an expanding collection of metabolites. nucleic acids res , d –d ( ). . sherena.johnson@nist.gov. nist standard reference database a. nist https://www.nist.gov/srd/nist-standard-reference-database- a ( ). . tautenhahn, r., patti, g. j., rinehart, d. & siuzdak, g. xcms online: a web-based platform to process untargeted metabolomic data. anal. chem. , – ( ). . forsberg, e. m. et al. data processing, multi-omic pathway mapping, and metabolite activity analysis using xcms online. nature protocols , – ( ). . wang, m. et al. sharing and community curation of mass spectrometry data with global natural products social molecular networking. nature biotechnology , – ( ). . tsugawa, h. et al. a cheminformatics approach to characterize metabolomes in stable-isotope- labeled organisms. nature methods , ( ). . pluskal, t., castillo, s., villar-briones, a. & orešič, m. mzmine : modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. bmc bioinformatics , ( ). . kuhl, c., tautenhahn, r., böttcher, c., larson, t. r. & neumann, s. camera: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. anal. chem. , – ( ). . sindelar, m. & patti, g. j. chemical discovery in the era of metabolomics. j. am. chem. soc. , – ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . wang, l. et al. peak annotation and verification engine for untargeted lc–ms metabolomics. anal. chem. , – ( ). . schmid, r. et al. ion identity molecular networking in the gnps environment. http://biorxiv.org/lookup/doi/ . / . . . ( ) doi: . / . . . . . nothias, l.-f. et al. feature-based molecular networking in the gnps analysis environment. nat methods , – ( ). . senan, o. et al. cliquems: a computational tool for annotating in-source metabolite ions from lc-ms untargeted metabolomics data based on a coelution similarity network. . . shen, x. et al. metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. nature communications , ( ). . alden, n. et al. biologically consistent annotation of metabolomics data. anal. chem. , – ( ). . del carratore, f. et al. integrated probabilistic annotation: a bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns, and adduct relationships. anal. chem. ( ) doi: . /acs.analchem. b . . kind, t. & fiehn, o. seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. bmc bioinformatics , ( ). . dührkop, k. et al. systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. nat biotechnol ( ) doi: . /s - - - . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . bonini, p., kind, t., tsugawa, h., barupal, d. k. & fiehn, o. retip: retention time prediction for compound annotation in untargeted metabolomics. anal. chem. , – ( ). . xu, y.-f. et al. discovery and functional characterization of a yeast sugar alcohol phosphatase. acs chem. biol. , – ( ). . hui, s. et al. glucose feeds the tca cycle via circulating lactate. nature , – ( ). . lu, w. et al. improved annotation of untargeted metabolomics data through buffer modifications that shift adduct mass and intensity. anal. chem. , – ( ). . cho, h. j., you, j. s., chang, k. j., kim, k. s. & kim, s. h. anti-adipogenic effect of taurine- carbohydrate derivatives. bulletin of the korean chemical society , – ( ). . robinson, p. t., pham, t. n. & uhrıń, d. in phase selective excitation of overlapping multiplets by gradient-enhanced chemical shift selective filters. journal of magnetic resonance , – ( ). . chambers, m. c. et al. a cross-platform toolkit for mass spectrometry and proteomics. nat biotechnol , – ( ). . xue, j. et al. enhanced in-source fragmentation annotation enables novel data independent acquisition and autonomous metlin molecular identification. anal. chem. , – ( ). . mitchell, j. m. et al. new methods to identify high peak density artifacts in fourier transform mass spectra and to mitigate their effects on high-throughput metabolomic data analysis. metabolomics , ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . a global network optimization approach for untargeted metabolomics data annotation (netid). the input data are lc-ms peaks with m/z, retention times, intensities and optional ms spectra. the output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). peaks are classified as “metabolite” (m+h or m-h peak of formula found in hmdb), “putative metabolite” (formula not found in hmdb but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). netid algorithm involves three steps. initial annotation first matches peaks to hmdb formulae. these seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. each node and edge annotation are then scored based on match to known masses, retention times, and ms/ms fragmentation patterns. global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . utility of global network optimization. (a) an example network demonstrating the value of the global optimization step in netid. node a and node b match hmdb formulae and are connected by an edge of phosphate (hpo ). node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. the table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (b) visualization of the optimal network obtained from negative mode lc-ms analysis of baker’s yeast, containing nodes and connections. metabolite and putative metabolite peaks are in green and artifact peaks in purple. (c) summary table of netid annotations of negative and positive mode lc-ms data from baker's yeast and mouse liver. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid reveals thiamine-derived metabolites in yeast. (a) subnetwork surrounding thiamine. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) ms spectra of thiamine, thiamine+c h o, and thiamine+c h o, with proposed structures of the major fragments. (c) labeling fraction of thiamine and its derivatives, in [u- c]glucose with and without unlabeled thiamine in the medium. (d) the thiamine derivatives are also found in mouse tissues and urine. (e) proposed mechanism for formation of thiamine+c h o. pyruvate dehydrogenase (pdh) decarboxylates pyruvate, and adds the resulting [c h o] unit (in red) to thiamine. (f) the same enzymatic mechanism occurs in oxoglutarate dehydrogenase (ogdh) and branched-chain α-ketoacid dehydrogenase complex (bckdc), and generates thiamine+c h o and thiamine+c h o respectively. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid discovers mammalian taurine derivatives. (a) subnetwork surrounding taurine from mouse liver extract data. nodes, connections, and formulae are direct output of netid. boxes with structures were manually added. (b) lc-ms chromatogram of n-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (c) ms spectrum of glucosyl- taurine peak from liver extract (top), and synthetic n-glucosyl-taurine standard (bottom). (d) isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for h with [u- c]glucose. (e) absolute n-glucosyl-taurine concentration in murine serum and tissues. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . netid applies global optimization for metabolomics data annotation and metabolite discovery. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / deepstrain: a deep learning workflow for the automated characterization of cardiac mechanics deepstrain: a deep learning workflow for the automated characterization of cardiac mechanics manuel a. morales, maaike van den boomen, christopher nguyen, jayashree kalpathy-cramer, bruce r. rosen, collin m. stultz, david izquierdo-garcia*, and ciprian catana* abstract—myocardial strain analysis from cinematic magnetic resonance imaging (cine-mri) data could provide a more thorough characterization of cardiac mechanics than volumetric parameters such as left-ventricular ejection fraction, but sources of variation including segmentation and motion estimation have limited its wide clinical use. we designed and validated a deep learning (dl) workflow to generate both volumetric parameters and strain measures from cine-mri data, including strain rate (sr) and regional strain polar maps, consisting of segmentation and motion estimation convolutional neural networks developed and trained using healthy and cardiovascular disease (cvd) subjects (n= ). dl-based volumetric parameters were correlated (> . ) and without significant bias relative to parameters derived from manual segmentations in healthy and cvd subjects. compared to landmarks manually-tracked on tagging-mri images from healthy subjects, landmark deformation using dl-based motion estimates from paired cine-mri data resulted in an end- point-error of . ± . mm. measures of end-systolic global strain from these cine-mri data showed no significant biases relative to a tagging-mri reference method. on healthy subjects, intraclass correlation coefficient for intra- scanner repeatability was excellent (> . ) for strain, moderate to excellent for sr ( . - . ), and good to excellent ( . - . ) in most polar map segments. absolute relative change was within ~ % for strain, within ~ % for sr, and < % in half of polar map segments. in conclusion, we developed and evaluated a dl-based, end- to-end fully-automatic workflow for global and regional myocardial strain analysis to quantitatively characterize cardiac mechanics of healthy and cvd subjects based on ubiquitously acquired cine-mri data. index terms—cardiac cine-mri, deep learning, motion estimation, myocardial strain, segmentation. submitted for review on dec , . this work was supported in part by the u.s. national cancer institute under grant r ca - a . (asterisk indicates d. izquierdo-garcia and c. catana contributed equally to this work). (corresponding authors: d. izquierdo-garcia; c. catana). m.a. morales, d. izquierdo-garcia and b.r. rosen, with athinoula a. martinos center for biomedical imaging, mgh, hms, th st, boston, ma (email: moralesq@mit.edu; davidizq@nmr.mgh.harvard.edu; brrosen@mgh.harvard.edu) and with harvard-mit health science and technology, massachusetts ave, cambridge, ma, . m.v.d. boomen and c. nguyen, with cardiovascular research center and martinos center for biomedical imaging, mgh, hms, th st, boston, ma , with department of radiology, and m.v.d. boomen also with university medical center groningen, gz groningen (email: mvandenboomen@mgh.harvard.edu; christopher.nguyen@mgh.havard.edu). c.m. stultz, with electrical engineering and computer science, with harvard-mit health science and technology, massachusetts ave, cambridge, ma, , and with division of cardiology, mgh, fruit st, boston, ma, (cmstultz@mit.edu). j. kalpathy-cramer, and ciprian catana, with athinoula a. martinos center for biomedical imaging, mgh, hms, th st, boston, ma (jkalpathy-cramer@mgh.harvard.edu; ccatana@mgh.harvard.edu). i. introduction ardiac mechanics reflects the precise interplay between myocardial architecture and loading conditions that is essential for sustaining the blood pumping function of the heart. the ejection fraction (ef) is often used as a left- ventricular (lv) functional index, but its value is limited when mechanical impairment occurs without an ef reduction [ ]. alternatively, tissue tracking approaches for strain analysis provide a more thorough characterization through non-invasive evaluation of myocardial deformation from echocardiography or cinematic magnetic resonance imaging (cine-mri) data [ ], and could be used to identify dysfunction before ef is reduced [ ]. unfortunately, various sources of discrepancies have limited the wide clinical applicability of these techniques, including factors related to imaging modality, algorithm, and operator [ ]. more accurate measures could be obtained from tagging-mri data widely regarded as the reference standard for strain quantification [ ], [ ], but collection of these data requires highly specialized and complex sequences that have mainly remained research tools, whereas echocardiography and cine-mri data are ubiquitously acquired in clinical practice. irrespective of algorithm or modality, e.g., speckle tracking for echocardiography or feature tracking for cine-mri, the main challenge is to estimate motion within regions along the myocardial wall [ ]. operator-related discrepancies are introduced when the myocardial wall borders are delineated manually, a time-consuming process that requires considerable expertise and results in significant inter- and intra-observer variability [ ], [ ]. automatic delineation approaches have been implemented within computational pipelines [ ], but other factors related to motion tracking algorithms also influence strain assessment, including the appropriate selection of tuneable parameters whose optimal values can differ between c .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / patient cohorts and acquisition protocols (e.g., the size of the search region in block-matching methods [ ]). further, these algorithms often make assumptions about the properties of the myocardial tissue (e.g., incompressible and elastic [ ], [ ]), or use registration methods to drive the solution towards an expected geometry. however, recent evidence has shown the validity of these assumptions varies between healthy and diseased myocardium [ ], [ ], suggesting these approaches may not accurately reflect the underlying biomechanical motion [ ]. lastly, modality-related image quality could complicate interpretation of abnormal strain values since these could reflect either real dysfunction or artifact-related inaccuracies, leading to some degree of subjectivity or non-conclusive results [ ]. deep learning (dl) methods have demonstrated the advantage of allowing real-world data guide learning of abstract representations that can be used to accomplish pre-specified tasks, and have been shown to be more robust to image artifacts than non-learning techniques for some applications [ ], [ ]. dl segmentation methods have been proposed [ ]–[ ] and implemented within strain computational pipelines [ ], [ ], and recent studies have shown that cardiac motion estimation can also be recast as a learnable problem [ ]–[ ]. these methods usually consist of an intensity-based loss function and a constrain term [ ], [ ], the latter using common machine learning techniques (e.g., l regularization of all learnable parameters [ ]) or direct regularization of the motion estimates (e.g., smoothness penalty [ ], anatomy-aware [ ]). however, because ground-truth cardiac motion is challenging to acquire, whether these constrains improve the accuracy of motion or strain estimates is not yet clear. further, the added-value of dl- based regional strain estimation has not been demonstrated. we have recently developed a learning method for cardiac motion estimation that produces more accurate estimates than various techniques, including b-spline, diffeomorphic, and mass-preserving algorithms [ ], and showed these estimates could potentially be used to detect regional dysfunction. thus, incorporating our method within a strain analysis framework could potentially enable accurate, user-independent, and quantitative characterization of cardiac mechanics at a both global and regional level. once trained, such method would not necessitate further parameter tunning or optimization, which is time-consuming for larger datasets and daily clinical practice. while this framework could be based on echocardiography images [ ], these data remain limited for strain mapping tasks by their low reproducibility of acquisition planes [ ] and temporal stability of tracking patterns [ ]. in contrast, cine- mri offers the most accurate and reproducible assessment of cardiac anatomy and function, thus providing a more thorough set of data for learning-based motion models. we propose deepstrain, an automated workflow that derives global and regional strain measures from cine-mri data by decoupling motion estimation and segmentation tasks. after verifying the effects of smoothing and anatomical regularizers on motion and strain, convolutional neural networks for pre- processing (i.e., centering and cropping), segmentation, and motion estimation were implemented, trained, validated, and compared to state-of-the-art methods. finally, accuracy of strain values was assessed using a tagging-mri algorithm as reference standard, intra-scanner repeatability was measured using subjects with repeated scans, and potential clinical applications of global and regionals myocardial strain measures were demonstrated on patient populations. ii. method a. datasets for development we used the automated cardiac diagnosis challenge (acdc) dataset [ ], consisting of cine-mri data from subjects evenly divided into five groups: healthy and patients with hypertrophic cardiomyopathy (hcm), abnormal right ventricle (arv), myocardial infarction with reduced ejection fraction (mi), and dilated cardiomyopathy (dcm). these data were publicly available as train (n= ) and test (n= ) sets, with manual segmentations included for the train set only. for validation of motion and strain measures we used the cardiac motion analysis challenge (cmac) dataset [ ], consisting of paired tagging- and cine-mri data from healthy subjects. to assess intra-scanner repeatability, four healthy volunteers were recruited to undergo repeated scans on a t mri scanner. all cine-mri frames and corresponding segmentations were resampled to a × × volume grid with . mm × . mm in-plane resolution and variable slice thickness ( - mm). see supplementary section s for acquisition protocol. b. myocardial strain definitions strain represents percent change in myocardial length per unit length. the three-dimensional ( d) analog for mri is given by the lagrange strain tensor 𝝐 𝑡 = 𝛻𝒖 𝑡 + 𝛻𝒖 𝑡 ( + 𝛻𝒖 𝑡 ( 𝛻𝒖 𝑡 / , ( ) where 𝒖 𝑡 denotes myocardial displacement from a fully- relaxed end-diastolic phase at t= , to a contracted frame at t> . radial and circumferential strain are the diagonal components of the tensor 𝝐 evaluated in cylindrical coordinates. strain rate (sr) is the time derivative of ( ). global strain is defined as the average of 𝝐 over the whole lv myocardium (lvm) volume. regional strain is defined as the average of 𝝐 over the volume of specific lvm segments defined by the american heart association (aha) polar map [ ], which requires labels of the right ventricle to construct. specific parameters based on timing and magnitude are extracted from the measures evaluated over a whole cardiac cycle: end-systolic strain (ess), defined as the global strain value at end-systole; systolic strain rate (srs), defined as the peak (i.e., maximum) absolute value of global sr during systole; early-diastolic strain rate (sre), defined as the peak absolute value of global sr during diastole. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c. centering, segmentation, and motion estimation deepstrain (fig. ) consists of a series of convolutional neural networks that perform three tasks: a ventricular centering network (vcn) for automated centering and cropping, a cardiac motion estimation network (carmen) to generate 𝒖, and a cardiac segmentation network (carson) to generates tissue labels. estimates of 𝒖 are used to calculate myocardial strain, and segmentations are used to derive volumetric parameters, identify a cardiac coordinate system for strain analysis, and generate tissue labels used for anatomical regularization of the motion estimates at training time. let 𝑉- be a cine-mri frame at time t defined over a d spatial domain 𝛺 ⊂ ℝ . using a pair of frames 𝑉 ,𝑉- as an input, vcn centers and crops the images around the center of mass of the lv, carson generates segmentations 𝑀 ,𝑀- of the lv, rv, and lvm, and carmen estimates the motion 𝒖- of the heart from 𝑉 to 𝑉-. thus, for each voxel 𝑝 ∈ 𝛺, 𝒖- 𝑝 is an approximation of the myocardial displacement during contraction such that 𝑉 (𝑝) and (𝒖- ∘ 𝑉-)(𝑝) correspond to similar cardiac regions. the operator ∘ refers to application of a spatial transform to 𝑉- using 𝒖- via trilinear interpolation [ ]. ) architectures all networks have a common encoder-decoder architecture consisting primarily of convolution, batch normalization [ ], and prelu [ ] layers with residual connections [ ] (see supplementary section s ). briefly, vcn is a d architecture that uses a single-channel array 𝑉 with size × × to generate a single-channel array 𝐺<=>? of equal size, where 𝐺<=>? corresponds to a gaussian distribution with mean defined as the lvm center of mass. v is centered and cropped around the voxel with the highest value in 𝐺<=>? to generate a new cropped array of size × × , which is then the input to segmentation and motion estimation networks. carson is a two-dimensional ( d) architecture that uses images of size × to generate a -channel segmentation 𝑀<=>? of equal size, each channel corresponding to a label. carmen uses a - channel input volume, consisting of two concatenated arrays with size × × , to generate a -chanel array 𝒖 of equal size. each channel in 𝒖 represents the 𝑥, 𝑦 and 𝑧 components of motion. ) loss functions vcn was evaluated using the mean square error ℒdef 𝐺g-,𝐺<=>? = h |j| 𝐺(𝑝) − 𝐺<=>? 𝑝 l <∈j . ( ) for carson, we implemented a multi-class dice coefficient function ℒn>g 𝑀g-,𝑀<=>? = − h o bna-c that evaluates carmen using the input volumes and generated motion estimates ℒab->bna-c 𝑉 ,𝑉-,𝑢- = h j 𝑉 𝑝 − (𝑢- ∘ 𝑉- 𝑝<∈j . ( ) second, we used a supervised function ℒebe-fgahei that leverages segmentations of the input volumes at training time to impose an anatomical constrain on the estimates fig. . overview of proposed deepstrain workflow. vcn centers and crops the input pair of cine-mri frames. tissue labels generated by carson are used to build an anatomical model. motion estimates derived from carmen are used to calculate strain measures, and these estimates are combined with the anatomical model to enable global and regional strain analyses. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ℒebe-fgahei 𝑀 ,𝑀-,𝑢- = ℒn>g 𝑀 ,𝑢- ∘ 𝑀- . ( ) third, smooth estimates were encouraged by using a diffusion regularizer ℒngff-jb>nn(𝑢-) = 𝛻𝑢- 𝑝 ⋅ 𝑑𝑟 l<∈j ( ) where 𝑑𝑟 is the spatial resolution of 𝑉 and accounts for differences between in-plane and slice resolution. thus, the loss function for carmen is a linear combination of ( ), ( ), and ( ), weighted by 𝜆a, 𝜆e,𝜆n, accordingly. we conducted optimization experiments using synthetic data [ ], [ ] to assess the impact of smoothing and anatomical regularization on motion and strain estimates (supplementary section s ). these experiments showed smoothness improves the accuracy of the motion vectors direction, and anatomical regularization improves the magnitude of the vectors relative to the ground-truth motion (see supplementary fig. s and s ). the optimal values 𝜆a = . , 𝜆e = . ,𝜆n = . were used to train carmen. ) training and testing networks were trained in tensorflow ver. . with adam optimizer parameters beta , = . , . , batchsize = ( for carmen), and epochs = ( for carmen). ground-truth distributions for vcn were created using the manual segmentations. vcn and carson were trained using the end- diastolic and end-systolic frames of the train set, as only these included ground-truth segmentations. this provided training samples for vcn and for carson, the latter having more samples since it is a d architecture and all frames were resampled to a volume with slices. vcn was tested by five-fold cross-validation, whereas the accuracy of carson was assessed by submitting the results to the challenge website. once carson was trained, we generated segmentations of the test set to train carmen using the entire acdc dataset. only the [end-diastolic, end-diastolic] and [end-diastolic, end- systolic] pairs were used. the former is essential for the network to adequately learn how to scale the motion vectors, i.e., motion should be exactly zero if the frames are equal. the entire cycle is analyzed at testing time by using sequential input pairs [𝑉 , • ] that kept the end-diastolic frame constant while we varied 𝑉- for all time frames t > . using this approach 𝒖- was derived for all times. data augmentation included random rotations and translations, random mirroring along the x and y axes, and gamma contrast correction. all data augmentation was performed only in the x-y plane. d. evaluation ) segmentation and motion estimation carson and manual segmentations were compared using the hausdorff distance (hd) and dice similarity coefficient (dsc) metrics at both end-diastole and end-systole. accuracy of lv volumetric measures derived from segmentations, including end-diastolic volume (edv), ef, and lvm, was assessed using the correlation, bias, and standard deviation metrics. the mean absolute error (mae) for the lv edv and lvm were also computed for comparison against the intra- and inter-observer variability reported by [ ]. we compared our results to top- ranked methods published for the acdc test set as these appear in the leader-board of the challenge [ ]–[ ]. the cmac organizers defined landmarks at the intersection of gridded tagging lines at end-diastole on tagging images, one landmark 𝑝 per wall per ventricular level. these landmarks were manually-tracked by two observers over the cardiac cycle. conversion from tagging to cine coordinates was done using dicom header information. we used the carmen motion estimates 𝑢- to automatically deform the landmarks at end-diastole, and the accuracy was assessed using the in-plane end-point error (epe) between deformed 𝑝-q = 𝑢- ∘ 𝑝 and manually-tracked 𝑝- landmarks, defined by 𝐸𝑃𝐸 𝑝,𝑝q = 𝑝t − 𝑝tq l + 𝑝c − 𝑝cq l . ( ) due to temporal misalignment between the tagging and cine acquisitions, epe was evaluated only at end-systole (𝑡 = 𝑡fe). specifically, let 𝑝au(𝑡) denote the manually-tracked landmarks of subject 𝑖 at frame 𝑡 by observer 𝑗. the accuracy of carmen was assessed using the average epe aepe = h lb 𝐸𝑃𝐸(𝑝au 𝑡fe ,𝑢a(𝑡fe) ∘ 𝑝 ) l u[h b a[h . ( ) our results were compared to those reported by the four groups that responded to the challenge [ ], mevis [ ], iucl [ ], upf [ ], and inria [ ], [ ]. all groups submitted tagging-based motion estimates, but only upf and inria provided estimates based on cine-mri. ) strain validation and intra-scanner repeatability the tagging-mri method with the lowest aepe was used as the reference for strain analysis. the tagging-mri-based motion estimates were registered and resampled to the cine- mri space. global strain and sr values throughout the entire cardiac cycle were derived from the resampled estimates as described in [ ]. global- and regional-based analyses were performed to assess the repeatability of measures from two acquisitions. relative changes (rc) and absolute relative changes (arc) were calculated, taking the first acquisition as the reference. ess and sr were calculated for the global-based analysis, and for region-based analyses, ess values were normalized using the aha polar map, and both rc and arc were evaluated for each of the segments in the polar map. ) statistics bland-altman analysis was used to quantify agreement between predicted and tagging strain measures. we used the term bias to denote the mean difference and the term precision to denote the standard deviation of the differences. differences were also assessed using a paired t-test with bonferroni correction for multiple comparisons. for global- and regional- based analyses of intra-scanner repeatability, icc estimates and their % confidence intervals (ci) were calculated based on a single-rating, absolute agreement, -way mixed-effects model. analyses were performed on python v . [ ]. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / iii. results a. segmentation and motion estimation centering, segmentation, and motion estimation for an entire cardiac cycle (~ frames) was accomplished in < s on a gb gpu and < . min on a gb ram cpu. vcn located the lv center of mass with a median error of . mm. correlation of carson and manual lv volumetric measures was > . across all measures (table ), and biases in ef (+ . ± . %), end-diastolic (+ . ± . ml) and end-systolic (+ . ± . ml) volumes, and mass (+ . ± . g) were not significant. further, these biases were smaller than those obtained with other methods, which were positive for lv edv ( . to . ml), negative for lvm (- . to - . g), and close to zero (± . %) for ef. simantiris et al. [ ] obtained the best precision for lv ef ( . vs. . % variance with carson), edv ( . vs. . mm), and lvm ( . vs. . g). isensee et al. [ ] obtained the best results on geometric metrics, i.e., lower hd for the lv (end-diastole . vs. . mm; end-systole . vs. . mm) and lvm ( . vs. . mm; . vs. . mm), and higher dsc for the lvm ( . vs. . ; . vs. . ). the dsc for the lv was similar for all methods (~ . , ~ . ). mae for the lv edv and lvm were . ± . ml and . ± . g. fig. a illustrates a representative example of the tagging and cine images from a cmac subject. landmarks defined at end-diastole were deformed to end-systole using the carmen estimates and compared to manual tracking. banding artifacts on cine images showed no clear effect on derived motion estimates or landmark deformation, as shown in end-systole (fig. a, yellow arrow) or throughout the whole cardiac cycle (see supplementary video). the manual tracking inter-observer variability was . mm (fig. b, dotted line). within cine- table i state-of-the-art methods for left-ventricular segmentation shown at end-diastole (ed) and end-systole (es) on the acdc test set compared to proposed approach. red are the best results for each metric. fig. . validation of motion and strain. (a) landmarks at end-diastole (unfilled green) are manually-tracked (green) and deformed with carmen to end-systole (red). yellow arrow indicates a banding artifact. (b) average end-point-error (aepe) was assessed and compared to other methods. (c) mevis- and deepstrain- based strain (top) and strain rate (sr, bottom) measures are compared. black arrow shows strain inaccuracies with mevis. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / based techniques, carmen ( . ± . mm) and upf ( . ± . mm) had lower (p< . ) aepe relative to inria ( . ± . mm), but there was no significant difference between carmen and upf. all tagging-based methods had lower aepe compared to cine approaches, particularly mevis ( . ± . mm) and upf ( . ± . mm). b. strain analysis table shows the normal ranges (mean [ % ci]) of strain derived from cine-mri data for all healthy subjects, including subjects from the training, validation, and repeatability cohorts. deepstrain generated values with narrow ci for circumferential (~ %) and radial (~ %) ess, and circumferential (~ . s- ) and radial (~ . s- ) sr. specifically, circumferential and radial values across datasets were: - . % [- . - . ] and . % [ . . ] for ess, - . s- [- . - . ] and . s- [ . . ] for srs, and . s- [ . . ] and - . s- [- . - . ] for sre, accordingly. these values were similar to those from tagging-based ones, although circumferential sre from cine-mri data was lower, mostly in the train set ( . ± . s- ). comparison of tagging- and cine-based strain measures with matched subjects showed an overall agreement in timing and magnitude of strain and sr throughout the cardiac cycle, although tagging-based measures of radial ess diverge after early diastole (fig. c, black arrow), and there were visual differences in peak sr parameters. visual inspection of image artifacts on cine data showed no clear evidence that these artifacts affected strain values derived with deepstrain (see supplementary fig. s ). quantitative comparisons of tagging- and cine-based measures showed biases in circumferential ess (- . ± . vs. - . ± . %; bias - . ± . %), radial ess ( . ± . vs. . ± . %; + . ± . %) and sre (- . ± . vs. - . ± . ; - . ± . s- ) were small and not significantly different from zero (see supplementary fig. s ). however, there were larger differences (p< . ) in radial srs ( . ± . vs. . ± . s- ; . ± . s- ), and circumferential srs (- . ± . vs. - . ± . s- ; . ± . s- ) and sre ( . ± . vs. . ± . s- ; . ± . s- ). representative strain measures of a single subject derived from two acquisitions are shown in fig. . the aha polar maps from both acquisitions showed comparable regional variations in ess, particularly for circumferential ess in the inferoseptal wall (fig. a, orange arrows). global curves throughout the entire cardiac cycle also showed visual agreement in both timing and magnitude (fig. b). from these data, circumferential (- . vs. - . %) and radial ( . vs. %) ess (fig. b, purple), circumferential srs ( . vs. . s- ) and sre (- . vs. - . s- ), and radial srs ( . vs. . s- ) and sre (- . vs. - . s- ) global parameters were also found to be similar (fig. b, yellow). in addition, while not quantified in this study, the late-diastolic filling peaks were also comparable (fig. b, blue). table shows the rc, arc, icc, and loa across subjects for the global parameters. the average arc was below % for ess (circumferential: . ± . %; radial: . ± . %), below % for srs ( . ± . %; . ± . %), and below % for sre parameters ( . ± . %; . ± . %). icc results showed repeatability was excellent for ess ( . ; . ), good for srs ( . ; . ), moderate for circumferential sre ( . ), and excellent for radial sre ( . ) values. the loa, which defines the interval where to find the expected differences in % of the cases assuming normally distributed data, were ~ % and ~ % for circumferential and radial ess, and < . s- for all sr measures. the ess, rc, and arc maps averaged across subjects are shown in fig. . visually, these maps (fig. b) showed the average rc and arc were marginal ( ~ %) in more than half of the polar map segments. specifically, values were marginal for circumferential ess (~ %) in the anterior, anteroseptal, and anterolateral walls, but were larger in the inferior region, most notably in the basal- and mid-inferoseptal segments ( %). for radial ess the largest changes were found in the mid- anterolateral segment ( %), whereas changes in the anteroseptal, inferior and inferolateral walls were very small (~ %). the rc and arc per subject are provided in boxplot form in supplementary fig s . these results showed that, in most of the segments, the rc and arc were less than ~ %, although larger differences were noted in the inferoseptal wall for radial ess, and anterolateral wall for circumferential. supplementary table s shows the icc and loa per segment, including the whole-map average. for radial ess, the icc results showed excellent repeatability across all segments. circumferentially, all segments showed good to excellent repeatability, except for the basal- and mid-inferolateral segments. loas showed that % of differences occurred within ~ % and ~ % intervals for circumferential and radial ess. c. evaluation in patients with cardiovascular disease regional measures of ess averaged over patient population (see supplementary figure s ), as well as global values of strain and sr across the cardiac cycle (fig. ) for all subjects in the acdc train set showed progressive decline in strain values table ii normal ranges of strain with deepstrain in healthy subjects. tagging-based measures are shown for the cmac cohort. deepstrain repeatability is shown for two acquisitions (acq). table iii intra-scanner repeatability of global circumferential (circ) and radial (rad) strain measures. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / starting with hcm, followed by arv, mi, and dcm. specifically, relative to the healthy group, radial ess was reduced in all patient populations. radial systolic and early- diastolic sr were also reduced in all patient groups, except for systolic sr in hcm. fig. shows both the cine-mri image and the circumferential ess polar map of a healthy subject and two patients with mi. strain values in the healthy polar map have a homogeneous distribution. in contrast, in one mi patient the map indicates a diffused reduction, and inspection of the myocardium on the cine-mri image shows an anteroseptal infarct that coincides in location with segments with more prominent decreases in strain. in a different mi patient with an infarct located in a similar septal region, strain changes are focal and localized to the anteroseptal wall. iv. discussion learning-based methodologies have the potential to meet the technical challenges associated with myocardial strain analysis. in this study we developed a fast dl framework for strain analysis based on cine-mri data that does not make assumptions about the underlying physiology, and we benchmarked its segmentation, motion, and strain estimation components against the state-of-the-art. we compared our segmentations to other dl methods, motion estimates to other non-learning techniques, and strain measures to a reference tagging-mri technique. we also presented the intra-scanner repeatability of deepstrain-based global and regional strain measures, and showed that these measures were robust to image artifacts in some cases. global and regional applications were also presented to demonstrate the potential clinical utilization of our approach. a. volumetric measures segmentation from mri data is a task particularly well suited for convolutional networks given the excellent soft-tissue contrast, thus all top performing methods on the acdc test set were based on dl approaches (table ). isensee et al. [ ] had remarkable success on geometric metrics, but this and other approaches result in a systematic overestimation of the lv edv and thus underestimation of lvm. in contrast, carson generated less biased measures of lv volumes and mass, which were not significant. although simantiris et al. [ ] obtained the most precise measures, possibly due to their extensive use of augmentation using image intensity transformations, across methods the precision of ef was within the ~ - % [ ] needed when it is used as an index of lv function in clinical trials [ ]. lastly, we showed that the error in our measures of lv edv and lvm was almost half the inter-observer (~ . ml, . g), and comparable to the intra-observer (~ .. ml, . g) mae reported in [ ], but further investigations are required to assess the performance on more heterogeneous populations. b. strain measures the application of myocardial strain to quantify abnormal deformation in disease requires accurate definition of normal ranges. however, previously reported normal ranges vary largely between modalities and techniques, particularly for radial ess [ ]. in this study we showed deepstrain generated strain measures with narrow ci in healthy subjects from across three different datasets (table ). although direct comparison with the literature is difficult due to differences in the datasets, overall our strain measures agreed with several reported results. specifically, circumferential strain is in agreement with studies fig. . global and regional strain measures of representative subject. (a) regional end-systolic strain measures show visual agreement (orange arrow). (c) global strain and strain rate (sr) measures also show visual agreement. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in healthy participants based on tagging (- . %, n= ) and speckle tracking echocardiography (- %, n= ) datasets [ ], [ ], as well a recently proposed (- . % basal, n= ) tagging-based dl method [ ]. radial strain is in agreement with tagging-based ( . %, n= ; . % basal, n= ) studies [ ], [ ], but are lower than most reported values [ ]. this is a result of smoothing regularization used during training to prevent overfitting. however, lowering the regularization without increasing the size of the training set would lead to increased epe and wider ci. sr measures derived with deepstrain were also in good agreement with previous tagging- based studies [ ] . the cmac dataset enabled us to compare our results to non-learning methods using a common dataset. we found that aepe was lower with tagging-based techniques, reflecting the advantage of estimating cardiac motion from a grid of intrinsic tissues markers (i.e., grid tagging lines). further, the tagging techniques also benefited from the fact that landmarks were placed at the center of the ventricle, whereas motion estimation from tagging data at the myocardial borders and in thin-walled regions of the lv is less accurate due to the spatial resolution of the tagging grid [ ]. in addition, some of the tagging-mri images did not enclosed the whole myocardium and some contained imaging artifacts, which resulted in strain artifacts towards the end of the cardiac cycle (fig. c, black arrow). we found that mevis had the lowest aepe, which could be a result of their image term ( ) that penalizes phase shifts in the fourier domain instead of intensity values, an approach that is less affected by desaturation (i.e., fading) of the tagging grid over time. the upf approach also achieved a low aepe using multimodal integration and d tracking to leverage the strengths of both modalities and improve temporal consistency [ ]. although this approach could in principle be recast as dl technique using recurrent neural networks [ ], this would require a significant increase in the number of learnable parameters, therefore very large datasets would be needed to avoid overfitting. using mevis as the tagging reference standard, we found no significant differences in measures of radial and circumferential ess (fig. c). temporally, we found significant differences in sr measures between the two techniques that could be due to drift errors in the mevis implementation, i.e., errors that accumulate in sequential implementations in which motion is estimated frame-by-frame [ ]. although we did not observe considerable improvements in aepe compared to tagging- and cine-based methods, an important advantage of our approach is the reduced computational complexity (~ sec in gpu) relative to the proposed mevis ( - h), iucl ( - h), upf ( h) and inria ( h) approaches [ ]. specifically, because once trained our network does not optimize for a specific test subject (i.e., it does not iterate on the cine-data to generate the desired output), centering, segmentation, and motion estimation for the entire cardiac cycle can be accomplished much faster (< min in cpu). an additional advantage of non-iterative implementations is that we obtain deterministic results. since this implies the exact same motion estimates are generated given the same input, we expect strain measures not to vary meaningfully if the anatomy and function remain fixed. here we studied this property by evaluating the intra-scanner repeatability, an important aspect to consider when assessing the potential clinical utility of deepstrain. global measures of ess showed excellent repeatability with narrow loas and with absolute fig. . intra-scanner repeatability of regional myocardial strain measures. (a) average of subject-specific regional end-systolic strain (ess) maps during two acquisitions. (b) average changes between acquisitions. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / rcs of less than % on average, and regional analyses also showed the average rc and arc was less than % in more than half of the polar map segments, with the maximum difference being %. finally, all sr measures showed good to excellent repeatability, except for sr which was moderate. c. clinical evaluation deepstrain could be applied in a wide range of clinical applications, e.g., automated extraction of imaging phenotypes from large-scale databases (e.g., uk biobank [ ]). such phenotypes include global and regional strain, which are important measures in the setting of existing dysfunction with preserved ef [ ]. deepstrain generated measures of global strain and sr over the entire cardiac cycle from a cohort of subjects in < min (fig. ). these results showed that radial sre was reduced in patients with hcm and arv, despite having a normal or increased lv ef. decreased sre with normal ef is suggestive of subclinical lv diastolic dysfunction, which is in agreement with previous findings [ ], [ ]. our results also showed deepstrain-based maps could be used to characterize regional differences between groups (supplementary fig. s ). at an individual level (fig. ), we showed that in mi patients, polar segments with decreased circumferential strain matched myocardial regions with infarcted tissue. further, we showed that the changes in regional strain due to mi can be both diffuse and focal. these abnormalities could be used to discriminate dysfunctional from functional myocardium [ ], or as inputs for downstream classification algorithms [ ]. more generally, deepstrain could be used to extract interpretable features (e.g., strain and sr) for dl diagnostic algorithms [ ], which would make understanding of the pathophysiological basis of classification more attainable [ ]. d. study limitations a limitation of our study was the absence of important patient information (e.g., age), which would be needed for a more complete interpretation of our strain analysis results, for example to assess the differences in strain values found between the healthy subjects from the acdc and cmac datasets. however, using publicly available data enables the scientific community to more easily reproduce our findings, and compare our results to other techniques. another limitation was the absence of longitudinal analyses, i.e., longitudinal strain was not reported because it is normally derived from long-axis cine- mri data not available in the training dataset. the size of the datasets is another potential limitation. the number of patients used for training is much smaller than the number of trainable parameters, potentially resulting in some degree of overfitting. to correct this, the training set for motion estimation could be expanded by validating the proposed segmentation network on more heterogeneous populations. also, while our repeatability results were promising despite testing in only a small number of subjects, repeatability in patient populations was not shown. e. conclusion we developed an end-to-end learning-based workflow for strain analysis that is fast, operator-independent, and leverages real-world data instead of making explicit assumptions about myocardial tissue properties or geometry. this approach enabled us to derive strain measures from new data without further training or parameter finetuning, and our measures were robust to image artifacts, repeatable, and comparable to those derive from dedicated tagging data. these technical and practical attributes position deepstrain as an excellent candidate for use in routine clinical studies or data-driven research. acknowledgment we acknowledge the support of nvidia corporation with the donation of the titan x pascal gpu used for this research. we also thank p. jodoin (acdc) and c. tobon-gomez (cmac) for their assistance with the challenge datasets. references [ ] m. a. konstam and f. m. abboud, “ejection fraction: misunderstood and over-rated (changing the paradigm in categorizing heart failure),” circulation, vol. , no. , pp. – , feb. . [ ] p. claus, a. m. s. omar, g. pedrizzetti, p. p. sengupta, and e. nagel, “tissue tracking technology for assessing cardiac mechanics,” jacc: cardiovascular imaging, vol. , no. , pp. – , dec. . [ ] o. a. smiseth, h. torp, a. opdahl, k. h. haugaa, and s. urheim, “myocardial strain imaging: how useful is it in clinical decision making?,” eur heart j, vol. , no. , pp. – , apr. . [ ] m. s. amzulescu, m. de craene, h. langet, a. pasquet, d. vancraeynest, a. c. pouleur, j. l. vanoverschelde, and b. l. gerber, fig. strain and strain rate measures computed on the acdc train set. fig. . regional strain in healthy and patients with mi. myocardial infarction can result in diffused (center) and focal (right) strain reduction. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / “myocardial strain imaging: review of general principles, validation, and sources of discrepancies,” european heart journal - cardiovascular imaging, mar. . [ ] n. f. osman, s. sampath, e. atalar, and j. l. prince, “imaging longitudinal cardiac strain on short-axis images using strain-encoded mri,” magn. reson. med., vol. , no. , pp. – , aug. . [ ] d. kim, w. d. gilson, c. m. kramer, and f. h. epstein, “myocardial tissue tracking with two-dimensional cine displacement-encoded mr imaging: development and initial evaluation,” radiology, vol. , no. , pp. – , mar. . [ ] n. risum, s. ali, n. t. olsen, c. jons, m. g. khouri, t. k. lauridsen, z. samad, e. j. velazquez, p. sogaard, and j. kisslo, “variability of global left ventricular deformation analysis using vendor dependent and independent two-dimensional speckle-tracking software in adults,” journal of the american society of echocardiography, vol. , no. , pp. – , nov. . [ ] a. schuster, v.-c. stahnke, c. unterberg-buchwald, j. t. kowallick, p. lamata, m. steinmetz, s. kutty, m. fasshauer, w. staab, j. m. sohns, b. bigalke, c. ritter, g. hasenfuß, p. beerbaum, and j. lotz, “cardiovascular magnetic resonance feature-tracking assessment of myocardial mechanics: intervendor agreement and considerations regarding reproducibility,” clinical radiology, vol. , no. , pp. – , sep. . [ ] wenzhe shi, xiahai zhuang, haiyan wang, s. duckett, d. v. n. luong, c. tobon-gomez, kaipin tung, p. j. edwards, k. s. rhode, r. s. razavi, s. ourselin, and d. rueckert, “a comprehensive cardiac motion estimation framework using both untagged and -d tagged mr images based on nonrigid registration,” ieee trans. med. imaging, vol. , no. , pp. – , jun. . [ ] g. pedrizzetti, p. claus, p. j. kilner, and e. nagel, “principles of cardiovascular magnetic resonance feature tracking and echocardiographic speckle tracking for informed clinical use,” journal of cardiovascular magnetic resonance, vol. , no. , p. , dec. . [ ] m. de craene, g. piella, o. camara, n. duchateau, e. silva, a. doltra, j. d’hooge, j. brugada, m. sitges, and a. f. frangi, “temporal diffeomorphic free-form deformation: application to motion and strain estimation from d echocardiography,” medical image analysis, vol. , no. , pp. – , feb. . [ ] t. mansi, x. pennec, m. sermesant, h. delingette, and n. ayache, “ilogdemons: a demons-based registration algorithm for tracking incompressible elastic biological tissues,” int j comput vis, vol. , no. , pp. – , mar. . [ ] r. avazmohammadi, j. s. soares, d. s. li, t. eperjesi, j. pilla, r. c. gorman, and m. s. sacks, “on the in vivo systolic compressibility of left ventricular free wall myocardium in the normal and infarcted heart,” journal of biomechanics, vol. , p. , jun. . [ ] v. kumar, a. j. ryu, a. manduca, c. rao, r. j. gibbons, b. j. gersh, k. chandrasekaran, s. j. asirvatham, p. a. araoz, j. k. oh, a. c. egbe, a. behfar, b. a. borlaug, and n. s. anavekar, “cardiac mri demonstrates compressibility in healthy myocardium but not in myocardium with reduced ejection fraction,” international journal of cardiology, vol. , pp. – , jan. . [ ] b. zhu, j. z. liu, b. r. rosen, and m. s. rosen, “image reconstruction by domain transform manifold learning,” arxiv: . [cs], apr. . [ ] p. dong, b. provencher, n. basim, n. piché, and m. marsh, “forget about cleaning up your micrographs: deep learning segmentation is robust to image artifacts,” microsc microanal, pp. – , jul. . [ ] g. simantiris and g. tziritas, “cardiac mri segmentation with a dilated cnn incorporating domain-specific constraints,” ieee j. sel. top. signal process., vol. , no. , pp. – , oct. . [ ] f. isensee, p. jaeger, p. m. full, i. wolf, s. engelhardt, and k. h. maier-hein, “automatic cardiac disease assessment on cine-mri via time-series segmentation and domain specific features,” arxiv: . [cs], vol. , . [ ] c. zotti, z. luo, a. lalande, and p.-m. jodoin, “convolutional neural network with shape prior applied to cardiac mri segmentation,” ieee j. biomed. health inform., vol. , no. , pp. – , may . [ ] m. baldeon calisto and s. k. lai-yuen, “adaen-net: an ensemble of adaptive d– d fully convolutional networks for medical image segmentation,” neural networks, vol. , pp. – , jun. . [ ] k. hammouda, f. khalifa, h. abdeltawab, a. elnakib, g. a. giridharan, m. zhu, c. k. ng, s. dassanayaka, m. kong, h. e. darwish, t. m. a. mohamed, s. p. jones, and a. el-baz, “a new framework for performing cardiac strain analysis from cine mri imaging in mice,” sci rep, vol. , no. , p. , dec. . [ ] e. puyol-anton, b. ruijsink, w. bai, h. langet, m. de craene, j. a. schnabel, p. piro, a. p. king, and m. sinclair, “fully automated myocardial strain estimation from cine mri using convolutional neural networks,” in ieee th international symposium on biomedical imaging (isbi ), washington, dc, , pp. – . [ ] c. qin, w. bai, j. schlemper, s. e. petersen, s. k. piechnik, s. neubauer, and d. rueckert, “joint learning of motion estimation and segmentation for cardiac mr image sequences,” arxiv: . [cs], jun. . [ ] m. qiao, y. wang, y. guo, l. huang, l. xia, and q. tao, “temporally coherent cardiac motion tracking from cine mri: traditional registration method and modern cnn method,” med. phys., vol. , no. , pp. – , sep. . [ ] h. yu, s. sun, h. yu, x. chen, h. shi, t. s. huang, and t. chen, “foal: fast online adaptive learning for cardiac motion estimation,” in ieee/cvf conference on computer vision and pattern recognition (cvpr), seattle, wa, usa, , pp. – . [ ] p. chen, x. chen, e. z. chen, h. yu, t. chen, and s. sun, “anatomy- aware cardiac motion estimation,” arxiv: . [cs, eess], aug. . [ ] b. d. de vos, f. f. berendsen, m. a. viergever, m. staring, and i. išgum, “end-to-end unsupervised deformable image registration with a convolutional neural network,” arxiv: . [cs], vol. , pp. – , . [ ] m. a. morales, d. izquierdo-garcia, i. aganj, j. kalpathy-cramer, b. r. rosen, and c. catana, “implementation and validation of a three- dimensional cardiac motion estimation network,” radiology: artificial intelligence, vol. , no. , p. e , jul. . [ ] a. Østvik, e. smistad, t. espeland, e. a. r. berg, and l. lovstakken, “automatic myocardial strain imaging in echocardiography using deep learning,” in deep learning in medical image analysis and multimodal learning for clinical decision support, vol. , d. stoyanov, z. taylor, g. carneiro, t. syeda-mahmood, a. martel, l. maier-hein, j. m. r. s. tavares, a. bradley, j. p. papa, v. belagiannis, j. c. nascimento, z. lu, s. conjeti, m. moradi, h. greenspan, and a. madabhushi, eds. cham: springer international publishing, , pp. – . [ ] j.-u. voigt, g. pedrizzetti, p. lysyansky, t. h. marwick, h. houle, r. baumann, s. pedri, y. ito, y. abe, s. metz, j. h. song, j. hamilton, p. p. sengupta, t. j. kolias, j. d’hooge, g. p. aurigemma, j. d. thomas, and l. p. badano, “definitions for a common standard for d speckle tracking echocardiography: consensus document of the eacvi/ase/industry task force to standardize deformation imaging,” european heart journal - cardiovascular imaging, vol. , no. , pp. – , jan. . [ ] o. bernard, a. lalande, c. zotti, f. cervenansky, x. yang, p.-a. heng, i. cetin, k. lekadir, o. camara, m. a. gonzalez ballester, g. sanroma, s. napel, s. petersen, g. tziritas, e. grinias, m. khened, v. a. kollerathu, g. krishnamurthi, m.-m. rohe, x. pennec, m. sermesant, f. isensee, p. jager, k. h. maier-hein, p. m. full, i. wolf, s. engelhardt, c. f. baumgartner, l. m. koch, j. m. wolterink, i. isgum, y. jang, y. hong, j. patravali, s. jain, o. humbert, and p.-m. jodoin, “deep learning techniques for automatic mri cardiac multi- structures segmentation and diagnosis: is the problem solved?,” ieee trans. med. imaging, vol. , no. , pp. – , nov. . [ ] c. tobon-gomez, m. de craene, k. mcleod, l. tautz, w. shi, a. hennemuth, a. prakosa, h. wang, g. carr-white, s. kapetanakis, a. lutz, v. rasche, t. schaeffter, c. butakoff, o. friman, t. mansi, m. sermesant, x. zhuang, s. ourselin, h.-o. peitgen, x. pennec, r. razavi, d. rueckert, a. f. frangi, and k. s. rhode, “benchmarking framework for myocardial tracking and deformation algorithms: an open access database,” medical image analysis, vol. , no. , pp. – , aug. . [ ] american heart association writing group on myocardial segmentation and registration for cardiac imaging:, m. d. cerqueira, n. j. weissman, v. dilsizian, a. k. jacobs, s. kaul, w. k. laskey, d. j. pennell, j. a. rumberger, t. ryan, and m. s. verani, “standardized myocardial segmentation and nomenclature for tomographic imaging of the heart: a statement for healthcare professionals from the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cardiac imaging committee of the council on clinical cardiology of the american heart association,” circulation, vol. , no. , pp. – , jan. . [ ] m. jaderberg, k. simonyan, a. zisserman, and k. kavukcuoglu, “spatial transformer networks,” arxiv: . [cs], jun. . [ ] s. ioffe and c. szegedy, “batch normalization: accelerating deep network training by reducing internal covariate shift,” arxiv: . [cs], mar. . [ ] b. xu, n. wang, t. chen, and m. li, “empirical evaluation of rectified activations in convolutional network,” arxiv: . [cs, stat], nov. . [ ] k. he, x. zhang, s. ren, and j. sun, “deep residual learning for image recognition,” arxiv: . [cs], dec. . [ ] w. p. segars, g. sturgeon, s. mendonca, j. grimes, and b. m. w. tsui, “ d xcat phantom for multimodality imaging research: d xcat phantom for multimodality imaging research,” medical physics, vol. , no. , pp. – , aug. . [ ] l. wissmann, c. santelli, w. p. segars, and s. kozerke, “mrxcat: realistic numerical phantoms for cardiovascular magnetic resonance,” journal of cardiovascular magnetic resonance, vol. , no. , dec. . [ ] l. tautz, a. hennemuth, and h.-o. peitgen, “motion analysis with quadrature filter based registration of tagged mri sequences,” in statistical atlases and computational models of the heart. imaging and modelling challenges, vol. , o. camara, e. konukoglu, m. pop, k. rhode, m. sermesant, and a. young, eds. berlin, heidelberg: springer berlin heidelberg, , pp. – . [ ] k. mcleod, a. prakosa, t. mansi, m. sermesant, and x. pennec, “an incompressible log-domain demons algorithm for tracking heart tissue,” in statistical atlases and computational models of the heart. imaging and modelling challenges, vol. , o. camara, e. konukoglu, m. pop, k. rhode, m. sermesant, and a. young, eds. berlin, heidelberg: springer berlin heidelberg, , pp. – . [ ] e. ferdian, a. suinesiaputra, k. fung, n. aung, e. lukaschuk, a. barutcu, e. maclean, j. paiva, s. k. piechnik, s. neubauer, s. e. petersen, and a. a. young, “fully automated myocardial strain estimation from cardiovascular mri–tagged images using a deep learning framework in the uk biobank,” radiology: cardiothoracic imaging, vol. , no. , p. e , feb. . [ ] r. vallat, “pingouin: statistics in python,” joss, vol. , no. , p. , nov. . [ ] n. painchaud, y. skandarani, t. judge, o. bernard, a. lalande, and p.- m. jodoin, “cardiac mri segmentation with strong anatomical guarantees,” in medical image computing and computer assisted intervention – miccai , vol. , d. shen, t. liu, t. m. peters, l. h. staib, c. essert, s. zhou, p.-t. yap, and a. khan, eds. cham: springer international publishing, , pp. – . [ ] m. khened, v. alex, and g. krishnamurthi, “densely connected fully convolutional network for short-axis cardiac cine mr image segmentation and heart diagnosis using random forest,” in statistical atlases and computational models of the heart. acdc and mmwhs challenges, vol. , m. pop, m. sermesant, p.-m. jodoin, a. lalande, x. zhuang, g. yang, a. young, and o. bernard, eds. cham: springer international publishing, , pp. – . [ ] j. a. san román, j. candell-riera, r. arnold, p. l. sánchez, s. aguadé-bruix, j. bermejo, a. revilla, a. villa, h. cuéllar, c. hernández, and f. fernández-avilés, “quantitative analysis of left ventricular function as a tool in clinical research. theoretical basis and methodology,” revista española de cardiología (english edition), vol. , no. , pp. – , may . [ ] j. p. kelly, r. j. mentz, a. mebazaa, a. a. voors, j. butler, l. roessig, m. fiuzat, f. zannad, b. pitt, c. m. o’connor, and c. s. p. lam, “patient selection in heart failure with preserved ejection fraction clinical trials,” journal of the american college of cardiology, vol. , no. , pp. – , apr. . [ ] b. a. venkatesh, s. donekal, k. yoneyama, c. wu, v. r. s. fernandes, b. d. rosen, m. l. shehata, r. mcclelland, d. a. bluemke, and j. a. c. lima, “regional myocardial functional patterns: quantitative tagged magnetic resonance imaging in an adult population free of cardiovascular risk factors: the multi-ethnic study of atherosclerosis (mesa): reference values of strain from tagged mri,” j. magn. reson. imaging, vol. , no. , pp. – , jul. . [ ] d. muraru, u. cucchini, s. mihăilă, m. h. miglioranza, p. aruta, g. cavalli, a. cecchetto, s. padayattil-josè, d. peluso, s. iliceto, and l. p. badano, “left ventricular myocardial strain by three-dimensional speckle-tracking echocardiography in healthy subjects: reference values and analysis of their physiologic and technical determinants,” journal of the american society of echocardiography, vol. , no. , pp. - .e , aug. . [ ] z. gan, j. tang, and x. yang, “left ventricle motion estimation based on unsupervised recurrent neural network,” in ieee international conference on bioinformatics and biomedicine (bibm), san diego, ca, usa, , pp. – . [ ] a. fry, t. j. littlejohns, c. sudlow, n. doherty, l. adamska, t. sprosen, r. collins, and n. e. allen, “comparison of sociodemographic and health-related characteristics of uk biobank participants with those of the general population,” american journal of epidemiology, vol. , no. , pp. – , nov. . [ ] s. chen, j. yuan, s. qiao, f. duan, j. zhang, and h. wang, “evaluation of left ventricular diastolic function by global strain rate imaging in patients with obstructive hypertrophic cardiomyopathy: a simultaneous speckle tracking echocardiography and cardiac catheterization study,” echocardiography, vol. , no. , pp. – , may . [ ] a. j. marian and e. braunwald, “hypertrophic cardiomyopathy: genetics, pathogenesis, clinical manifestations, diagnosis, and therapy,” circ res, vol. , no. , pp. – , sep. . [ ] m. j. w. götte, a. c. van rossum, j. w. r. twisk, j. p. a. kuijer, j. t. marcus, and c. a. visser, “quantification of regional contractile function after infarction: strain analysis superior to wall thickening analysis in discriminating infarct from remote myocardium,” journal of the american college of cardiology, vol. , no. , pp. – , mar. . [ ] n. zhang, g. yang, z. gao, c. xu, y. zhang, r. shi, j. keegan, l. xu, h. zhang, z. fan, and d. firmin, “deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri,” radiology, vol. , no. , pp. – , jun. . [ ] q. zheng, h. delingette, and n. ayache, “explainable cardiac pathology classification on cine mri with motion characterization by semi-supervised learning of apparent flow,” arxiv: . [cs, stat], mar. . [ ] p. n. kampaktsis and m. vavuranakis, “diastolic function evaluation,” jacc: cardiovascular imaging, vol. , no. , pp. – , jan. . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / improving variant calling using population data and deep learning improving variant calling using population data and deep learning nae-chyun chen , ‡,∗, alexey kolesnikov , sidharth goel , taedong yun , pi-chuan chang , †, and andrew carroll , †,∗ department of computer science, johns hopkins university, baltimore, md , usa google health, palo alto, ca and cambridge, ma , usa corresponding author: cnaechy @jhu.edu; awcarroll@google.com †these authors contributed equally to this work. ‡work performed while an intern at google health. january , abstract large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. these approaches do not incorporate population information directly into the process of variant calling, and are often limited to filter- ing which trades recall for precision. in this study, we modify deepvariant to add a new channel encoding population allele frequencies from the genomes project. we show that this model reduces variant calling errors, improving both precision and recall. we assess the impact of using population-specific or diverse reference panels. we achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. finally, we show that this benefit generalizes to samples with differ- ent ancestry from the training data even when the ancestry is also excluded from the reference panel. background variant calling [ – ] identifies the positions in an individual genome which differ from a reference or population, and is used to characterize a single sample or build large research cohorts [ , ]. variant calling is non-trivial, because of sequencing errors, systematic errors in mapping to repetitive and variable regions [ ], and imbalanced sampling of alleles needed to identify a heterozygous variant from a homozygous one. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . variant calling can be improved by jointly genotyping multiple samples together [ – ], but the raw sequence data for a cohort is not always available, and this process is computationally expensive. instead, large-scale reference panels from a wide range of populations can provide similar information [ , ]. recent studies use such information to improve alignment accuracy and reduce biases in alignment [ – ], but there has been little work to incorporate population data with variant calling. because far more variants are transmitted than arise de novo, real variants in a pop- ulation tend to recur at various frequencies [ ], while false positives are often either not seen elsewhere in a population, or are seen with a consistent signature [ ]. researchers use this knowledge to filter variant calls, often with rules which lose recall for a gain in precision [ ]. more sophisticated machine-learning methods to filter are used in larger cohorts, such as gnomad, but these also trade recall for precision and also only operate on variant calls and summary information [ ]. we reason that including population-level information at an earlier stage in variant calling, when the full read-level data is available, might allow for more effective use of population data. to do this, we adapted deepvariant [ ], which represents bam infor- mation as a multi-dimensional pileup and uses a convolutional neural network (cnn) to call variants. because deepvariant learns the features important for variant classifica- tion directly from the data, it allows us to feed in the population allele information as an additional channel. we trained population-aware models and compared them with the default deepvari- ant v . models which are agnostic of population information. the population-aware approach reduces the number of errors for all tested datasets, including wgs and wes reads, when using the allele frequencies from genomes. it also shows stronger error reduction efficacy for lower-coverage read sets. while traditional filtering approaches will increase precision at the expense of recall, we observe improvements to both precision and recall with this method. when incorporating population data, it is also important for fairness and equity to understand how it changes the accuracy of methods for individuals with ancestries out- side of those used in the development of the population resources. it is known that many genomic databases have collected more data for the european population than others [ – ]. we demonstrate that even using frequencies from a genetically distinct popula- tion, the population-aware model still performs similarly as the baseline. we find that a reference panel consisting of all ancestries in the genomes project ( genomes) outperforms a reference panel with only one of the genomes population groups, even when that population matches the sample being called. this implies that maximizing the diversity of ancestries in population resources has the potential to improve variant calling for all populations. the genome in a bottle (giab) truth sets used to train deepvariant are from eu- ropean, ashkenazi, and asian ancestry. to assess whether the addition of the refer- ence panel information improves variant calling for populations outside of the popula- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . tions represented in training, we use high quality pacbio hifi [ ] data from the human genome structural variation consortium for an individual of puerto rican ancestry as an evaluation set. we show that an illumina model using the reference panel has superior concordance with the highly accurate pacbio hifi variant calls compared to an illumina model without the reference panel. results . population information improves deepvariant performance deepvariant converts input from a bam file into a pileup image with channels, repre- senting ) bases, ) base qualities, ) mapping quality, ) strand, ) supports variant, and ) base differs from reference. we modified deepvariant v . to take an additional input channel, the allele-frequency (af) of the variant [ ]. we trained deepvariant models with and without the af channel with the testing samples held out. we first compared the whole-genome sequencing (wgs) variant calling accuracy for sample hg , sequenced with x coverage from the precisionfda v truth challenge [ ], using the latest giab v . . truth set [ ] (figure ). hg is not used in the training of these deepvariant models, and so acts as an independent holdout to evaluate their quality. the population-aware model has superior accuracy than default deepvariant v . in both precision and recall for both types of variants. it has an overall error reduction of ( . %). for snps, the error rate (defined as -f score) decreases from . to . ; for indels, the error rate decreases from . to . . notably, the population- aware model improves snp false discovery rate (fdr, defined as -precision) from . to . , equivalent to an error reduction of , ( . %) variants. we then down-sampled the hg reads from x to x to evaluate the performance of the models with lower-coverage datasets. the population-aware method demonstrates a larger improvement in accuracy over default deepvariant v . by reducing , ( . %) overall errors. the error rate decreases from . to . for snps, and . to . for indels. similar to using the x read set, the population-aware model shows the strongest improvement to reduce false-positive snps, reducing fdr from . to . , equivalent to , ( . %) errors. we further evaluated the performance of the models using two whole-exome sequenc- ing (wes) datasets from a recently released set of genome and exome data [ ] (figure ). for both wes datasets, the population-aware model outperforms deepvariant v . in overall number of errors. it has an overall error reduction of ( . %) for the idt dataset, and ( . %) for the oslo dataset. it has a slightly higher rate for snps for the oslo dataset, from . to . , but the difference is smaller than the gain for indels for that dataset. the population-aware model tends to have a larger lead on precision for both types of variants compared to the baseline, but still has similar or better recall. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x -f -precision (fdr) -recall (fnr) indel . . . . . v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x -f -precision (fdr) -recall (fnr) snp figure : wgs variant calling error rates for hg . all results are evaluated using the giab v . . truth set in the high-confidence regions. v . : deepvariant v . ; af: the population-aware model that uses the allele-frequency channel. the column label suffixes show the average coverage of the read sets. lower values correspond to better accuracy. . . . . . . . . v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo -f -precision (fdr) -recall (fnr) indel . . . . . . . v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo v . -idt af-idt v . -oslo af-oslo -f -precision (fdr) -recall (fnr) snp figure : wes variant calling error rate for hg . the idt results (“*-idt”) are grch -based and evaluated using the giab v . . truth set; the oslo datasets (“*-oslo”) are grch -based and evaluated using the giab v . . truth set. v . : deepvariant v . ; af: the population-aware model that uses the allele-frequency channel. lower values correspond to better accuracy. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . model-specific errors for population-aware models intuitively, population information helps deepvariant decide whether to make a call based on the commonness of a variant, especially for cases where the variant calling confidence levels are low. with a population-aware model, a variant caller should be more likely to make a positive variant call for a candidate with high allele frequency, and is less likely to make a call when seeing a rare candidate variant. to understand the influence of allele frequencies in the model, we design an analy- sis framework to compare a population-agnostic model with a population-aware model. we call this a model-specific error analysis. we stratify the errors into three groups: population-resolved, population-induced and common. the population-resolved vari- ants are called correctly with the allele frequency model, but called incorrectly when us- ing the baseline model. we say such errors are “rescued” by population information. the population-induced errors are specific to the population-aware model, i.e. they are in- duced by the extra features. the common group contains errors called by both models. the common errors are viewed as ones more difficult to solve without major changes in the data processing pipeline, such as variant caller, upstream computational methods, or sequencing technology. thus, in this analysis we focus on the first two groups. for sim- plicity, we only considered bi-allelic calls in this analysis, which are the majority of overall errors. we used the x hg wgs dataset to perform the model-specific error analysis. af- ter extracting model-specific erroneous calls, we matched the calls with the genomes variants to obtain associated allele frequencies. we first examined the relationship be- tween allele frequency (af) and variant allele fraction (vaf), which is the fraction of reads supporting an alternate allele in a given sample, of each false-positive call. there is an ob- servable distinction between the population-induced group and the population-resolved group in the vaf-af plots (figure , left and middle panels). among the population- resolved false-positive errors, more than two third ( . %) are uncommon (allele fre- quency ≤ %) among the genomes samples, whereas there are only . % uncom- mon variants for population-induced false positives. this observation supports the hy- pothesis that the population-aware model uses allele frequency to adjust its variant calls. we then investigated bi-allelic false-negative errors, as shown in the right panel in fig- ure . variant allele fraction for false negatives are not always available because many false negatives are not identified as a variant candidate due to reasons including low read coverage, incorrect mapping or insufficient sensitivity in variant candidate discovery. thus, we only evaluated the allele frequency distribution for false negatives. we noticed a significant difference in the number of common variants (with greater than % allele frequency). among all population-resolved false negatives, . % ( , out of , ) are common variants. for population-induced false negatives, . % ( out of ) are un- common. the model-specific analysis highlights the difference of the deepvariant models with or without the af channel. with the additional population information, deepvari- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure : errors specific to a population-agnostic model (in blue) and a population-aware model (in red) using x hg wgs data. ant is capable of adjusting the calls according to the commonness of a variant and shows improvements in both precision and recall. . performance on zero-frequency variants a potential concern for population-aware variant calling models is increasing false neg- ative rate for novel alleles. since it is not trivial to define a set of truly novel variants in the genomes project, we extracted variants with zero allele frequency to investigate the impact when population information is included in a variant calling model. using the giab v . . truth set, there are , ( . %) snps and , ( . %) indels that have zero allele frequency for sample hg . we then use the zero-frequency variant set to evaluate recall of actual variant calls using hap.py [ ]. we observed that the recall on zero-frequency variants underperforms the rest using all deepvariant models, regardless of variant types and whether to utilize population information. with x reads, the false-negative rate (fnr, or -recall) of the population- agnostic model is . for snps and . for indels (figure ). the fnrs further in- crease to . for snps and . for indels when using the population-aware model. when using x reads, the drop in accuracy gets larger for both types of variants. this is consistent with our analysis that the population-aware deepvariant model requires stronger evidence (higher-quality pileup images) to call zero-frequency variants, thus re- ducing recall. further, the population information has a stronger influence in variant call- ing for low-coverage datasets. despite the disadvantages, the negative impact on zero- frequency variants is small compared to overall error reduction. to better understand the zero-frequency variants, we called variants using the deep- variant pacbio model with the precisionfda v x hg reads set sequenced with the pacbio hifi technology [ ]. the fnrs for the zero-frequency variants improve to . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . . . -r e c a ll ( f n r ) v . af v . - x af- x v . af v . - x af- x indel snp figure : the false negative rate (fnr) of zero-frequency variants for hg with differ- ent models. lower values correspond to better accuracy. for snps and . for indels. the large difference in recall/fnr indicates that many of the zero-frequency variants are hard to genotype using illumina reads, and may not be novel mutations relative to samples in reference panels. in the future, reference panels utilizing high-quality long reads will likely provide better allele frequency estimates and improve the population-aware model performance. . assessing biases using different genomes populations it is important to understand if the inclusion of population information reduces deep- variant’s performance for populations that are not well represented, especially when they have a large genomic difference with the reference panel. we first note that ashke- nazi jewish, the ethnicity of the hg , is not among the ethnicities collected by genomes. using a testing sample not in the reference panel reduces the risk of bias. second, we ran inference on the population-aware model using reference panels of alleles frequencies. we split the genomes sample into five groups based on the superpopu- lation labels (african, afr; admixed american, amr; east asian, eas; european, eur; south asian, sas) and calculated allele frequencies for each super-population. we show that all population-aware approaches outperform for snps but underperform for indels when evaluated using hg (figure ). when considering the overall number of errors, only the model inferred with eas frequencies calls more errors than the baseline, but the deficit ( , or . %) is small. we also compared the performance of using different superpopulation frequencies and observed a correlation between variant calling accuracy and the distance between the tested sample and ethnicity groups. according to the principal component (pc) analysis performed by gnomad v [ ], ashkenazi jewish is closer to the european populations and is farther from east asian and african in the pc -pc space. we observed that using frequencies from a genetically closer population usually resulted in higher variant calling (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . . . . v . all eur amr sas afr eas v . all eur amr sas afr eas v . all eur amr sas afr eas -f -precision (fdr) -recall (fnr) indel . . . . . . . . v . all eur amr sas afr eas v . all eur amr sas afr eas v . all eur amr sas afr eas -f -precision (fdr) -recall (fnr) snp figure : variant calling accuracy when inferring x illumina reads from hg using default deepvariant v . (v . ), allele frequencies in the entire genomes (all) and five genomes superpopulations (eur, amr, sas, afr and eas). lower values cor- respond to better accuracy. accuracy. using eur frequencies outperforms using other population frequencies, only falling behind using the entire genomes. on the other hand, using eas frequencies results in the highest numbers of errors among all population-aware methods. we point out that using genomes frequencies from all populations results in the lowest number of errors among all population-aware results, suggesting an advantage to using a diverse population than finding a genetically similar group. this finding echoes our previous statement that we anticipate the population-aware variant calling model to improve further with larger-scaled and more diverse population callsets. . silver-standard truth set for hg genome-in-a-bottle (giab) truth variant sets provide gold standards to benchmark vari- ant callers, but until now there are only three samples (hg -hg -hg , the ashke- nazi trio) with curated calls in difficult-to-map regions added in the v . . release [ ]. further, the samples are from the same ancestry, making it challenging to perform a generalized benchmarking considering the genetic diversity of the human population. to deal with this difficulty, it is desirable to have other high-quality variant sets from non-giab samples, preferably from ancestries not covered by giab. thus, we called variants using the deepvariant pacbio model with x high-coverage pacbio hifi reads [ ] for hg , a puerto rican (labelled as pur under the amr superpopulation in genomes) sample. the deepvariant pacbio model has a snp f score higher than (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x v . - x af- x -f -precision (fdr) -recall (fnr) snp figure : variant calling results when evaluated using hg data, compared to the pacbio-deepvariant silver-standard truth set. lower values correspond to better accu- racy. . % and is one of the most accurate models using pacbio hifi data [ ]. we used the deepvariant hg pacbio snp calls as a “silver-standard” truth set and benchmarked the performance for models using illumina reads. we excluded the puerto rican popula- tion when calculating allele frequencies to avoid biases in favor of the population-aware models. we used x illumina wgs reads sequenced by the new york genome center to test all hg models. because the genomes has a collection of pur samples, we excluded all pur samples and re-calculated allele frequencies for both genomes and the amr superpopulation. the population-aware model has a lower snp error rate ( . vs. . ), fdr ( . vs. . ) and fnr ( . vs. . ) than the baseline for hg (figure ). the number of snp errors is reduced by , ( . %). similar to the finding using hg , the population-aware model performs strongly with a down-sampled ( x) read set. the error rate for the x read set is reduced from . to . , and the snp error reduction is , ( . %). we also tested the model using different superpopulation fre- quencies (figure ). all but the eas population-aware model has lower snp error rates than the baseline. when inferred using the eas allele frequencies, the snp error rate in- creased from . to . , equivalent to ( . %) more errors. all population-aware models, including eas, outperform the baseline on fdr and only eas has a higher fnr than the baseline ( . vs. . ). discussion we designed a new population-aware deepvariant model which can incorporate both base- and read-level information with the population information. we find that population- aware models reduce error rates by . % for wgs and . - . % for wes compared to population-agnostic baselines (default deepvariant v . ) the relative advantage of the population-aware model increases at lower coverage ( . % reduction at x and . % at x). the increased accuracy at lower coverage suggests that population information is (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . . . . . . . . v . all eur amr sas afr eas v . all eur amr sas afr eas v . all eur amr sas afr eas -f -precision (fdr) -recall (fnr) snp figure : number of snp errors when evaluated using x wgs reads from a puerto ri- can sample hg . all models other than v . are population-aware, inferred using alleles frequencies from different populations. lower values correspond to better accu- racy. most valuable in difficult examples, where read-level information alone may not be suffi- cient for confident calling. in population sequencing projects, this finding could be rele- vant to the question of whether to sequence more individuals at lower coverage, or fewer at a high coverage. when sequencing for a species without a reference panel, it is possible that sequencing more, diverse individuals at lower coverage could still retain compara- ble accuracy to traditional methods which do not incorporate population information in calling. we evaluate potential biases introduced by population information in variant call- ing by comparing population-aware models that use allele frequencies from different genomes superpopulation. this experiment simulates a scenario where the tested sample is genetically distinct from the reference panel. only one population-aware method (inferred with eas frequencies) underperforms the baseline in total number of errors, but with a small deficit. furthermore, using allele frequencies calculated from the entire genomes outperforms population-specific methods. this finding implies that a di- verse population can provide more benefits than using a homogeneous one, even when the homogeneous population is more genetically similar with the tested sample. this finding may inform efforts to build population or country-specific resources. increasing the number of samples for a given population will improve accuracy for that population, but the inclusion of samples from diverse populations will also improve the resource. we believe that the accuracy of the population-aware model can further improve with a larger and more diverse population callset in the future, reinforcing the benefit of collaboration between nation-scale efforts. we provide an additional “silver-standard” snp set for a purto rican sample, hg , a population not present in the labeled training data. we used high-coverage pacbio hifi reads and an accurate deepvariant pacbio model to generate this high-quality call set. this method can provide high-confidence snp calls for non-giab samples and increase population diversity when assessing variant calling results. similar to the results using hg data, we show that the proposed model has strong performance compared to the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . baseline, and only suffers slight loss of accuracy when inferred using a distinct popu- lation. when more high-coverage pacbio hifi data become available in the future, the high-quality calls generated by deepvariant can provide a more diversified dataset for variant calling benchmarking and downstream analysis. despite greater overall accuracy, we note that the population-aware model under- performs on variants with zero allele frequencies in genomes. although the dis- advantage is small compared to the overall gain, this results suggests that the decision of whether to use population-aware models should consider the end goal. if reducing po- tential false positives is a larger concern, the use of a population-aware method could be recommended, but if the goal is to maximize recall of rare or novel variants, traditional methods could be preferred. we also notice that all tested illumina models performed poorly on the zero-frequency variants, regardless of using population information or not. by analyzing the variants with pacbio reads, we point out many zero-frequency variants in genomes located in difficult-to-map regions, but likely not genetically novel in the population. this suggests that the power of population-aware methods should increase as large panels of long-read population data become available. methods . training the model we trained the model following the procedure described in [ ], with additional illumina wgs datasets included [ ]. variants in chromosomes to are used as the training ex- amples, and those in chromosome and are used for tuning. variants in chromosome are never used in the training process. . datasets the model is evaluated using the giab v . . truth set for hg across whole genomes [ ]. we also generated another high-quality snp set using deepvariant v . and hg pacbio hifi data [ ] across the whole genome. we used the intersection of high-confidence regions of hg , hg , and hg (giab v . . ) as the high-confidence regions for the hg snp set. the read sets used for experiments are listed in table and the read sets for supporting experiments are provided in table . . allele matching algorithm when incorporating population information in deepvariant, we need to match a variant candidate with a cohort variant. however, this is not a straightforward task since a vari- ant can be represented in multiple formats [ , ]. a common approach is to normalize variants, such as using bcftools norm [ ], but that’s not sufficient for complicated (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . table : testing datasets. sample ethnicity truth variant dataset hg ashkenazi jewish v . . (grch ) x illumina wgs [ ] x illumina wes [ ] hg ashkenazi jewish v . . (grch ) x illumina wes [ ] hg puerto rican deepvariant v . pacbio snp calls (grch ) x illumina wgs (nygc) table : other datasets used in this study. sample ethnicity dataset hg ashkenazi jewish x pacbio hifi [ ] hg puerto rican x pacbio hifi [ ] cases. we designed an algorithm that constructed local haplotypes and performed pre- cise allele matching (figure ). the algorithm starts with querying all cohort variants vc overlapped with a window [startv, endv), where startv and endv are the starting and ending positions of a variant candidate v respectively. the queried cohort variants and the candidate variant form set v ≡ v ∪ v c. then the window is extended to the small- est starting position and the largest ending position within v , as [startv , endv ), where startv ≡ min(startu)∀u ∈ v and endv ≡ max(endw)∀w ∈ v . local reference haplotype is queried from the reference genome in window [startv , endv ]. for each variant allele in v , its allele haplotype is updated in this window. if there’s a perfect match between a cohort allele haplotype and a candidate allele haplotype, the allele frequency of the cohort allele is added to an allele frequency dictionary, using the alternate allele of the candidate variant as key. afterwards, deepvariant looks up the dictionary when processing reads overlapped with the candidate variant. . allele frequency channel for deepvariant to make full advantages of the cnn-based classifier of deepvariant, allele frequencies need to be encoded in pileup images. we apply a logarithmic transformation to gain resolution for low-frequency signals. for each variant candidate, an additional allele fre- quency channel is added to the pileup image. in this channel, a read is colored by the transformed frequency of its allele at the variant candidate position. a read can carry multiple alternate alleles with different frequencies, so its color intensity may vary across pileup images, where the variant candidates differ. an alternative method to encode al- lele frequencies is to include the information as features in the fully-connected layers [ ], but this approach sacrifices the capability to incorporate allele frequencies with base- and read-level information and thus is not adopted. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cohort variants position= ref=tttcca alt=t,tttccattcca af= . e- , e- position= ref=ttccag alt=t af= . e- variant candidate position= ref=attccag alt=at reference: - tttccattccag a b b updated haplotypes tttcca-----t----- t----------ttccag tttccattccattccag tttcca-----t----- c tttccat tttccag tttccattccattccag tttccat d dict(at= . , attccag= . ) variant candidate position= ref=attccag alt=at cohort variants cohort variant position= ref=tttcca alt=t,tttccattcca af= . , . cohort variant position= ref=ttccag alt=t af= . reference: - tttccattccag updated haplotypes tttcca-----t----- t----------ttccag tttccattccattccag tttcca-----t----- tttccat tttccag tttccattccattccag tttccat candidate frequency at: . figure : an example for the allele matching algorithm. this algorithm first queries cohort variants overlapped with the variant candidate. these cohort variants and the candidate determine the window where haplotypes are updated. the frequencies of matched allele haplotypes are then updated for the variant candidate as a dictionary. in this diagram, haplotypes are updated with dashes to keep sequenced aligned for better visualization. in practice, dash-free haplotypes are generated by the allele matching algorithm. to enable the allele frequency channel, users need to enable flag --use allele frequency and provide deepvariant cohort variants in vcf format with flag --population vcfs. . model-specific error analysis we compared actual variant calls with giab v . . truth variants using bcftools isec. variants specific to actual calls are regarded as false positives, and those specific to the truth set are regarded as false negatives. we generated the false-positive and false-negative sets for two models, and then applied bcftools isec again to obtain model-specific false positives and false negatives. for both sets, we applied the allele matching algo- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . rithm to obtain allele frequencies for the variants. for the false-positive sets, we extracted variant allele fractions from the vcf files generated by deepvariant. . genomes frequencies from the deepvariant-glnexus pipeline we used the genomes reference panel generated with the deepvariant-glnexus pipeline (v ) [ ] for all population-aware experiments, including training and inferring the models. we fill the missing genotypes with the reference genotypes with bcftools +missing ref to make sure all variants have the same denominator. availability of data and materials the deepvariant source code is available at https://github.com/google/deepvariant under the bsd- -clause license. the pacbio-based hg snp set is available at https://console.cloud.google.com/storage/browser/brain-genomics-public/ research/allele_frequency/hg _snp_set. the pre-trained population-aware deepvariant models are available at https://console.cloud.google.com/storage/ browser/brain-genomics-public/research/allele_frequency/pretrained_ model_wgs (wgs) and https://console.cloud.google.com/storage/browser/ brain-genomics-public/research/allele_frequency/pretrained_model_wes (wes). the vcf files used in this study are available at https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/ cohort_dv_glnexus_opt/v _missing ref (grch ) and https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/ cohort_dv_glnexus_opt/v _grch _missing ref (grch ). ethics approval and consent to participate not applicable. consent for publication not applicable. competing interests ak, sg, ty, pc and ac are employees of google llc and own alphabet stock as part of the standard compensation package. this study was funded by google llc. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/google/deepvariant https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/hg _snp_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/hg _snp_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wgs https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wgs https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wgs https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wes https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_wes https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _grch _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _grch _missing ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/ kgp/cohort_dv_glnexus_opt/v _grch _missing ref https://doi.org/ . / . . . funding all compute resources used in this work were provided by google, llc. ak, sg, ty, pc and ac are full-time, salaried employees of google, llc. nc con- tributed to this work as a salaried intern of google, llc. acknowledgments we thank babak alipanahi, gunjan baid, daniel cook, alexander d’amour, hojae lee, cory mclean, maria nattestad and other colleagues at google for their feedback on this manuscript and the project in general. the hg illumina data were generated at the new york genome center with funds provided by nhgri grant um hg - s . authors’ contributions nc, ak, pc and ac designed the method. nc, ak and pc implemented the software. nc and pc performed the experiment. nc, ak, sg, ty, pc and ac analyzed the re- sults. nc, pc and ac wrote the manuscript. all authors read and approved the final manuscript. references . depristo, m. a., banks, e., poplin, r., garimella, k. v., maguire, j. r., hartl, c., philippakis, a. a., del angel, g., rivas, m. a., hanna, m., et al. a framework for variation discovery and genotyping using next-generation dna sequencing data. nature genetics , ( ). . poplin, r., chang, p.-c., alexander, d., schwartz, s., colthurst, t., ku, a., new- burger, d., dijamco, j., nguyen, n., afshar, p. t., et al. a universal snp and small- indel variant caller using deep neural networks. nature biotechnology , – ( ). . krusche, p., trigg, l., boutros, p. c., mason, c. e., francisco, m., moore, b. l., gonzalez- porta, m., eberle, m. a., tezak, z., lababidi, s., et al. best practices for benchmark- ing germline small-variant calls in human genomes. nature biotechnology , – ( ). . karczewski, k. j., francioli, l. c., tiao, g., cummings, b. b., alföldi, j., wang, q., collins, r. l., laricchia, k. m., ganna, a., birnbaum, d. p., et al. the mutational constraint spectrum quantified from variation in , humans. nature , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . genomes project consortium et al. a global reference for human genetic varia- tion. nature , – ( ). . li, h. toward better understanding of artifacts in variant calling from high-coverage samples. bioinformatics , – ( ). . lin, m. f., rodeh, o., penn, j., bai, x., reid, j. g., krasheninina, o. & salerno, w. j. glnexus: joint variant calling for large cohort sequencing. biorxiv, ( ). . yun, t., li, h., chang, p.-c., lin, m. f., carroll, a. & mclean, c. y. accurate, scalable cohort variant calls using deepvariant and glnexus. biorxiv ( ). . poplin, r., ruano-rubio, v., depristo, m. a., fennell, t. j., carneiro, m. o., van der auwera, g. a., kling, d. e., gauthier, l. d., levy-moonshine, a., roazen, d., et al. scaling accurate genetic variant discovery to tens of thousands of samples. biorxiv, ( ). . chen, n.-c., solomon, b., mun, t., iyer, s. & langmead, b. reducing reference bias using multiple population reference genomes. biorxiv ( ). . rautiainen, m. & marschall, t. graphaligner: rapid and versatile sequence-to-graph alignment. genome biology , – ( ). . garrison, e., sirén, j., novak, a. m., hickey, g., eizenga, j. m., dawson, e. t., jones, w., garg, s., markello, c., lin, m. f., et al. variation graph toolkit improves read mapping by representing genetic variation in the reference. nature biotechnology , – ( ). . witherspoon, d. j., wooding, s., rogers, a. r., marchani, e. e., watkins, w. s., batzer, m. a. & jorde, l. b. genetic similarities within and between human populations. genetics , – ( ). . abramovs, n., brass, a. & tassabehji, m. hardy-weinberg equilibrium in the large scale genomic sequencing era. frontiers in genetics , ( ). . pedersen, b. s., brown, j. m., dashnow, h., wallace, a. d., velinder, m., tvrdik, t., mao, r., best, h. d., bayrak-toydemir, p. & quinlan, a. r. effective variant filter- ing and expected candidate variant yield in studies of rare human disease. biorxiv ( ). . sirugo, g., williams, s. m. & tishkoff, s. a. the missing diversity in human genetic studies. cell , – ( ). . martin, a. r., kanai, m., kamatani, y., okada, y., neale, b. m. & daly, m. j. clinical use of current polygenic risk scores may exacerbate health disparities. nature genetics , – ( ). . mcguire, a. l., gabriel, s., tishkoff, s. a., wonkam, a., chakravarti, a., furlong, e. e., treutlein, b., meissner, a., chang, h. y., lópez-bigas, n., et al. the road ahead in genetics and genomics. nature reviews genetics , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . wenger, a. m., peluso, p., rowell, w. j., chang, p.-c., hall, r. j., concepcion, g. t., ebler, j., fungtammasan, a., kolesnikov, a., olson, n. d., et al. accurate circular con- sensus long-read sequencing improves variant detection and assembly of a human genome. nature biotechnology , – ( ). . carroll, a. & chang, p.-c. improving the accuracy of genomic analysis with deepvariant . https://ai.googleblog.com/ / /improving-accuracy-of- genomic-analysis.html. . (accessed: - - ). . olson, n. d., wagner, j., mcdaniel, j., stephens, s. h., westreich, s. t., prasanna, a. g., johanson, e., boja, e., maier, e. j., serang, o., et al. precisionfda truth chal- lenge v : calling variants from short-and long-reads in difficult-to-map regions. biorxiv ( ). . wagner, j., olson, n. d., harris, l., khan, z., farek, j., mahmoud, m., stankovic, a., kovacevic, v., wenger, a. m., rowell, w. j., et al. benchmarking challenging small variants with linked and long reads. biorxiv ( ). . baid, g., nattestad, m., kolesnikov, a., goel, s., yang, h., chang, p.-c. & carroll, a. an extensive sequence dataset of gold-standard samples for benchmarking and development. biorxiv. eprint: https://www.biorxiv.org/content/early/ / / / . . . .full.pdf. https://www.biorxiv. org/content/early/ / / / . . . ( ). . porubsky, d., ebert, p., audano, p. a., vollger, m. r., harvey, w. t., marijon, p., ebler, j., munson, k. m., sorensen, m., sulovari, a., et al. fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. nature biotechnology. issn: - . https://doi.org/ . /s - - - (dec. ). . zook, j. m., catoe, d., mcdaniel, j., vang, l., spies, n., sidow, a., weng, z., liu, y., mason, c. e., alexander, n., et al. extensive sequencing of seven human genomes to characterize benchmark reference materials. scientific data , – ( ). . sun, c. & medvedev, p. varmatch: robust matching of small variant datasets using flexible scoring schemes. bioinformatics , – ( ). . li, h. a statistical framework for snp calling, mutation discovery, association map- ping and population genetical parameter estimation from sequencing data. bioinfor- matics , – ( ). . yi, r., chang, p.-c., baid, g. & carroll, a. learning from data-rich problems: a case study on genetic variant calling. arxiv preprint arxiv: . ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://ai.googleblog.com/ / /improving-accuracy-of-genomic-analysis.html https://ai.googleblog.com/ / /improving-accuracy-of-genomic-analysis.html https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://www.biorxiv.org/content/early/ / / / . . . https://www.biorxiv.org/content/early/ / / / . . . https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . / . . . background results population information improves deepvariant performance model-specific errors for population-aware models performance on zero-frequency variants assessing biases using different genomes populations silver-standard truth set for hg discussion methods training the model datasets allele matching algorithm allele frequency channel for deepvariant model-specific error analysis genomes frequencies from the deepvariant-glnexus pipeline availability of data and materials ethics approval and consent to participate consent for publication competing interests funding acknowledgments authors' contributions liquidcna: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations liquidcna: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations eszter lakatos ⇤, helen hockings , , maximilian mossner , weini huang , michelle lockley , , trevor a. graham ⇤ centre for genomics and computational biology, barts cancer institute, queen mary university of london, london, uk centre for cancer cell and molecular biology, barts cancer institute, queen mary university of london, london, uk barts health nhs trust, st bartholomew’s hospital, west smithfield, london, uk school of mathematical sciences, queen mary university of london, london, uk department of gynaecological oncology, cancer services, university college london hospital, london, uk ⇤ correspondence: e.lakatos@qmul.ac.uk; t.graham@qmul.ac.uk abstract cell-free dna (cfdna) measured via liquid biopsies provides a way for minimally-invasive monitoring of tumour evolutionary dynamics during therapy. here we present liquidcna, a method to track subclonal evolution from longitudinally collected cfdna samples based on somatic copy number alterations (scnas). liquidcna utilises scna profiles derived through cost-e↵ective low-pass whole genome sequencing to automatically and simulta- neously genotype and quantify the size of the dominant subclone without requiring prior knowledge of the genetic identity of the emerging clone. we demonstrate the accuracy of liquidcna in synthetically generated sample sets and in vitro and in silico mixtures of cancer cell lines. application in vivo in patients with metastatic lung cancer reveals the progressive emergence of a novel tumour sub-population. liquidcna is straightfor- ward to use, computationally inexpensive and enables continuous monitoring of subclonal evolution to understand and control therapy-induced resistance. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction liquid biopsies, primarily the analysis of cell free dna (cfdna) present in blood samples, o↵er the potential for regular longitudinal and minimally invasive monitoring of cancer dynamics [ , , , , , , ]. circulating cfdna is released into the blood via apoptosis or necrosis of cells. tumour-derived cfdna in the blood is detectable from tumours as small as million cells [ ], it shows correlation with disease stage [ , ], and o↵ers the same diagnostic potential as tissue-based biopsies [ ]. cfdna is an aggregate of dna shed from multiple locations and multiple malignant cells across the body and hence a single sample can provide a comprehensive overview of systemic disease. consequently, cfdna is an exceptional resource for non-invasive tracking of tumour composition and for monitoring response to therapy or clinical relapse. typically, cfdna analysis has focused on the detection of driver gene single nucleotide variants (snvs), with the size of mutation-bearing clones inferred from the relative se- quencing read count at the mutation site. for instance, in high-grade serous ovarian cancer (hgsoc) the frequency of tp mutation in cfdna is a measure of tumour burden and is predictive of treatment response [ ]. in colorectal cancer, kras mutation frequency in cfdna is predictive of response to anti-egfr therapy [ ]. somatic copy number alterations (scnas) are widespread in cancers [ , , ], and have been used extensively to track tumour composition and dynamics over time [ , , , ]. scnas can be detected in cfdna without prior knowledge of the tumour scna profile, through measurement of the relative number of reads mapping within ‘bins’ spaced across the genome [ ]. relative di↵erences in read count between bins can be sensitively detected even when the total read count is low [ , , ], meaning that scnas can be detected with a fraction of the sequencing depth required for snv detection. therefore scna profiling o↵ers a high-throughput and cost-e↵ective means to evaluate cfdna samples [ , , , , , ]. whilst measuring clone sizes based on the frequency of snvs is straightforward, de- riving quantitative information on the proportion of tumour population that carries a particular scna is challenging. tumour cells are not the only contributors to the cfdna pool, and an scna can in theory change the copy number to any non-negative integer value. thus total read count per bin is a noisy compound function of the relative tumour cell contribution to the total cfdna pool, and the specific copy number of the alteration. here we present a new method to identify and track tumour subclonal evolution based solely on measurement of scnas from longitudinal cfdna samples. our algorithm, named liquidcna, firstly determines the contribution of tumour dna to the total cfdna pool (i.e. cellularity/purity) and then uses scna data to characterise and quantify the size of the most pervasive (putatively resistant) subclone emerging or contracting over time. the e�cacy of the method is demonstrated using synthetic datasets, in vitro cell line mixtures, and in vivo via longitudinal analysis of cfdna from lung cancer patients undergoing targeted treatment. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results emergent subclone tracking from copy number information first, we derive a mathematical definition of the problem of tracking an emergent (pu- tatively resistant) tumour subclone from longitudinal cfdna samples, typically taken throughout the course of treatment. we consider a tumour cell population undergoing continuous evolution characterised by two cell types, ancestral tumour cells (a) and an emerging subclone (s). we assume that liquid biopsies contain dna originating from an- cestral and subclonal tumour cells, as well as contaminating dna from normal cells (n). the proportion of dna arising from cells of the emergent subclone within the tumour is expressed by the subclonal-ratio, ri, while the overall proportion of tumour-originating dna is termed the purity or tumour fraction of the sample, denoted by pi. we consider that the copy number (cn) profile of each sample has been measured – for example using low-pass whole genome sequencing (lpwgs) – and so the genome can be divided into segments, contiguous regions of constant cn. each measured segment cn in sample i (c j i ) is the combination of each cell population’s cn at the jth genomic location ( for normal cells and c(a) and c(s) for ancestral and subclonal tumour cells, respectively), weighted by the proportions of the three populations (fig. ). c j i = + pi � ( � ri)c(a)j + ric(s)j � � . ( ) we assume that each segment can fall into one of three categories depending on its cn in ancestral and subclonal tumour cells. clonal alterations (and unaltered segments) are at the same cn in both tumour populations, and their measured cn is only a↵ected by the purity of a sample. subclonal segments represent scnas that are unique to the emerging subclone. their measured cn is influenced by the subclonal-ratio of a sample, as well as sample purity. finally, segments that do not follow either of these patterns – due to uncertain measurements or ongoing instability – are termed unstable. our aim is to estimate the underlying purity and subclonal-ratio, pi and ri, from longitudinal cn measurements of clonal and subclonal segments (fig. ). estimation of subclonal-ratio estimation is carried out in three steps (fig. a and methods). first, the purity of each sample is assessed using the distribution of segment cn values. we assume that the majority of segments have integer cn in all tumour cells, hence the distribution is expected to have distinct peaks at regular intervals of pi, corresponding to clonal segments with cn of , , , etc. (fig. b). we derive the purity estimate as the value that minimises the squared error between observed and expected peaks (fig. c). the inferred purity values are used to correct the segment cn values, thus estimating the tumour-specific cn of each segment. liquidcna does not require a mainly diploid tumour genome (i.e. major peak at cn= ) to derive correct estimates, but will derive erroneous conclusions if the cn values .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / – as measured by the cn quantification software, e.g. qdnaseq [ ] – are incorrectly centred (e.g. major peak is defined as copy number , but the true value is copy number ). to control for this an initial manual check of the cn profile is recommended prior to applying liquidcna and renormalisation to the correct ploidy if required. next, for every segment we compute the change in cn, �cn, between each sample and a baseline sample that is assumed to have negligible proportions of the emerging (putatively resistant) subclone – for example a sample taken upon diagnosis or before start of therapy. �cn values naturally highlight subclone-associated segments altered in non-baseline samples, as these segments display markedly positive (cn gain compared to baseline) or negative (cn loss) values (fig. d). from these �cns we then establish the set of segments that are subclonal and the sample ordering that reflects increasing subclonal proportions. to do this, we examine each possible order of samples, classifying each segment as clonal (if the variance of its �cns across samples is below a pre-defined threshold), subclonal (if it shows monotone change in �cn value along the order of the samples - i.e. if the �cns are consistent with an emerging subclone) or unstable (if it does not correlate with sample order) according to that order (fig. e). the order with the highest proportion of segments classified as subclonal is selected, and these subclonal segments are used for downstream computation of tumour composition (fig. f). the methodology ensures that the dominant subclone associated with the most pervasive sc- nas is evaluated and that subclonal-ratio inference is robust to segments with unstable cn. finally, we compute the relative and absolute subclonal-ratio of each sample using the identified set of subclonal segments. relative subclonal-ratios are defined as the median ratio of segment �cns compared to the sample with the maximum subclonal proportion (fig. g). the absolute subclonal-ratio is computed based on the assumption that sub- clonal segment cn values correspond to distinct scnas that di↵er between ancestral and subclonal cells. the subclonal-ratio of sample i is therefore derived as the shared mean (ri) of a mixtures of gaussian distributions with constrained means �ri, +ri, etc., fitting the �cn distribution of subclonal segments (fig h). we also provide the % confidence interval of the absolute subclonal-ratio estimate based on the shared variance of the fitted gaussians (fig i). liquidcna outputs both relative and absolute subclonal-ratio measures, since for most applications the relative value holds su�cient information on how the subclonal (putative resistant) population changes between time-points. relative proportions are also less susceptible to the measurement noise in the measured segment cns, while a combination of low subclonal proportion and high sequencing noise can cause the fitting of absolute subclonal-ratio estimates to fail to converge. synthetic mixed populations we first evaluated the performance of liquidcna using synthetic datasets where input values of subclonal proportion and purity were known. we generated synthetic datasets .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / characteristics matching typical longitudinal measurements of patients. in order to simu- late imperfect measurements, we added varying levels of normally distributed measurement noise (defined by the dimensionless parameter �) to bin-wise cn values (fig. a-c and methods). we evaluated the accuracy of the purity estimation on synthetic samples (fig. d), and found that purity p could be estimated within % of the true tumour fraction in % of samples at noise levels �  . the error on the purity estimation was greater when the noise was increased (fig. e), and was most pronounced in samples with high noise and low tumour fraction. consequently, we restricted our subsequent analysis to only cases of higher purity (pi � . ). next, we derived subclonal-ratios using purity-corrected cn profiles on the higher purity subset of synthetic mixtures. we set a threshold to filter out clonal segments (see fig. e) such that at least segments were retained and the proportion of retained segments classified as subclonal was maximal following segment classification. fig. f shows the true and estimated subclonal-ratios for synthetic experiments. overall, we found that subclonal-ratio was estimated with ⇠ % error, and the accuracy was influenced by measurement noise (fig. g). relative subclonal-ratios (calculated compared to the sample with highest subclonal proportion) were estimated with higher accuracy (error ⇠ %, fig. s a-b). we found that computing absolute subclonal-ratios in a two-step process from these values yielded similar results to direct estimation by fitting a gaussians mixture model, and provided an estimate even in cases where the direct estimation did not converge (fig. s c and methods). the proportion of unstable segments, unlike noise, had little e↵ect on the estimation accuracy (fig. s ). mixtures of ovarian cancer cell lines next, we evaluated liquidcna on real data derived from in vitro mixtures of two paired high grade serous ovarian cancer (hgsoc) cell lines [ ] (see method and table s ). hgsoc cells were ideally suited for this evaluation as high levels of chromosomal insta- bility are a hallmark of the disease [ , ]. we anticipated that liquidcna will be most applicable for the tracking of subclonal evolution in malignancies with high cna burden [ ]. we divided a population of ovcar cells into two aliquots, and the first aliquot was untreated and classified as ‘sensitive’. in a process described in detail by hoare et al. [ ], cells from the second aliquot were cultured so that they evolved resistance to platinum- containing chemotherapy and thus were termed ‘resistant’. in addition to the high scna burden inherited from the ancestral sensitive cell line, resistant cells acquired new scnas during the in vitro evolution of resistance (figure a). we then mixed, in varying known proportions, the genomic dna extracted from the two cell lines, with sensitive cells representing the ancestral and resistant cells the emerging subclonal population. the mixtures were further diluted with dna from blood samples of healthy volunteers assumed to have a diploid genome; this modelled the e↵ect of normal .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / contamination in patient samples (table s ). these dna mixtures were sequenced to mean depth . x and composite scna profiles were generated (see methods). in addition, we generated further in silico mixtures by sampling and mixing genome-aligned reads from sequencing data from each of the three cell types sequenced individually. in these mixtures, we controlled the total number of reads per sample to study the e↵ect of variable read depth and associated measurement noise. first, we used liquidcna to estimate the purity of in vitro mixed samples (samples s -s ). the purity of each sample was estimated to be lower than the theoretical mixing proportion (fig. b). in the in silico mixed samples, we found that there was a strong linear relationship between estimated and true purity (fig. c). the underestimation of purity in the samples might be explained by our definition of theoretical purity in the in vitro and in silico mixing procedure (respectively defined as proportion of dna weight versus the proportion of read counts). a highly aneuploid genome will likely have a higher weight than a diploid genome, therefore mixing of equal weights results in a higher pro- portion of normal genomes than expected. our purity estimates were in agreement with observed peaks of the cn distribution (fig. s a), further confirming that there was no bias in the estimation. by fitting a linear model to the estimates, the theoretical tumour fraction could be fully recovered, as illustrated by the ‘corrected’ estimates of samples s -s (fig. b). the number of reads (sequencing depth) did not systematically influ- ence the accuracy of estimating tumour fraction, but purity estimates of samples with low tumour fraction were noisier at low read depth (fig. c). in summary, liquidcna pro- vided an accurate estimate for purity values when true purity was above %. decreased measurement accuracy below % purity is consistent with our observations on synthetic data and is similar to reported limitations of other methods quantifying tumour fraction from lpwgs cfdna [ , , ]. therefore, for samples below % predicted purity, we advise to discard the sample from downstream analysis, although low-purity samples may be usable if a very accurate purity estimate can be derived by other means. next, we inferred the subclonal-ratio for cell line mixtures using purity-corrected �cn values, with sample s used as the baseline sample for both in vitro and in silico sample sets. we could correctly order cell line mixtures according to subclonal-ratios without any a priori information (fig. s b), and both absolute subclonal-ratio and relative subclonal changes were estimated on average within % and % of the true subclonal percentage (fig. d,f). in particular, we noted that samples s and s were accurately estimated as having an equal subclonal-ratio, despite originating from di↵erent biological replicates with di↵erent tumour purity, which was reflected in the small confidence intervals of their estimates. we also note that even though there were no truly unstable segments in this dataset as measurements were not taken over time, three non-clonal segments were clas- sified as such, probably due to higher noise in their measured cn value. using datasets of randomly selected in silico samples with million reads, we con- firmed that our algorithm could accurately infer the subclonal-ratio of samples, in partic- ular when considering relative proportions (fig. e,g). although the estimation quality .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / decreased with lower read counts (fig. s ), in most cases the estimated absolute and relative subclonal-ratio was within % and % of the true subclonal proportion, re- spectively. furthermore, we found that cases with high estimation error were typically caused by low-purity samples, which could be easily identified and removed without a priori information, as demonstrated in fig. s . using the known theoretical mixing values of tumour-dna content – instead of data- derived estimates – to derive purity-corrected cn values increased the estimation error, especially in low read count samples (fig. s ). this finding emphasises that non-diploid genomes might bias alternative measurement methods and internal consistency in the method of deriving sample characteristics (purity and subclonal-ratio) is crucial when assessing the dynamics of the subclonal population. subclonal analysis of patient samples we used liquidcna to analyse emergent subclones in longitudinal cfdna samples from pa- tients with non-small cell lung cancer (nsclc) undergoing therapy, as previously reported by chen and colleagues [ ]. the liquid biopsies were collected as part of the figaro study (go , nct ), a randomised phase ii trial designed to evaluate the e�cacy of pictilisib, a selective inhibitor of phosphatidylinositol kinase [ ]. pictilisib or placebo was given in combination with standard chemotherapy regimen which was de- termined based on the subtype of nsclc. blood samples were taken at baseline (day of the first treatment cycle) and at -week intervals up to the end of treatment (eot). dna was isolated from the plasma of liquid biopsies and sequenced using lpwgs to an average depth of . x, as described in details in [ ]. chen et al. [ ] identified several scnas in eot samples that were absent at baseline and described several genes within these regions that might be associated with resistance. we sought to apply liquidcna to these cases to corroborate their observations, and further to quantify the size of emergent subclones over time in these patients. we obtained the lpwgs data (fastq files) and performed cn profiling (see methods) on patients with cfdna samples from � time-points (n = ). we identified three patients ( , and ) whose sample series fulfilled the following criteria: (i) had a cfdna sample taken on the first day of therapy with purity above ⇠ %; (ii) and had at least two non-baseline samples with purity above ⇠ %. patients and were in the experimental arm of the study, while patient was assigned to the control arm; and all three patients have progressed during the course of the trial. we ran liquidcna on data from the three selected patients (discarding samples with purity below % (fig. s )) and examined the genomic segments that liquidcna identified as subclonal relative to baseline samples (fig. ). while we observed a good overlap with the cns previously reported to be associated with subclonal evolution through therapy (figures and s of [ ]), we also found a few segments that were missed or additionally identified by liquidcna. the original study focused on the comparison of pre- and post-treatment and highlighted scnas occurring .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / between the first and last time-points. as our analysis put equal focus on all time-points, it classified some of the previously identified segments as unstable if the cn progression was not consistently with subclone evolution. furthermore, some segments were too small to pass our initial filtering. on the other hand, liquidcna was able to identify subclonal segments which were at an abnormal cn in the baseline sample, and subsequently showed diploid cn or a further gain/loss in subclonal tumour cells. for example, in the samples from patient , whilst liquidcna identified subclonal scnas on chromosomes , and that overlapped with the findings of the original study; it also detected additional subclonal changes on chromosomes and . however, we did not observe the previously described focal loss on chromosome (harbouring the gene mll ), probably due to its small size. overall, we identified , and subclone-associated scnas in patients , and , respectively. a further segments in patient were classified as non-clonal but ’unstable’ as the cn over time was not consistent with the pattern defined by the emerging subclone. as samples from patient had lower purity, these inconsistent cn changes might have resulted from measurement noise. we found that the emerging subclone accounted for to % of the tumour derived dna in the cfdna in the three patients evaluated. patient showed evidence of a subclonal proportion consistently around %, which could be explained by samples from this patient taken at later time-points. samples from patient obtained at weeks and end of therapy contained below % dna derived from subclonal tumour cells (fig. ). patient , on the other hand, showed a contracting subclone that reduced in proportion from % presence at week to < % at the end of therapy. in case the total population size was known – which might be accessible from additional measurements of the tumour-associated cfdna pool –, the tumour subclone fractions established here could also be converted into growth rates to enable future predictions of the tumour dynamics. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion we present liquidcna, a computational algorithm to infer longitudinal subclonal dynam- ics using copy number measurements. our algorithm performs simultaneous analysis of several longitudinal samples to identify sample purity, subclonal scnas and the abun- dance of an emerging subclone. liquidcna distinguishes between scnas that are associ- ated with the emerging subclone and those showing unstable behaviour, and consequently is not confounded by uncertain cn measurements. we validate our method both on synthetic scna datasets, and in vitro and in silico mixtures of two ovarian cancer cell lines. we successfully infer the proportion of the dominant subclone in all of the above datasets, with good accuracy across a range of sample qualities defined by the noise level or sequenced reads. in patients with lung cancer, liquidcna applied to lpwgs data derived from longitudinal liquid biopsies (cfdna) shows the emergence of subclones during therapy and identifies genomic regions associated with the emergent tumour cells. we demonstrate that liquidcna can identify and quantify emerging subclones from cfdna samples, therefore enabling tracking of tumour subclone evolution through the course of therapy. deciphering the evolutionary trajectory of cancer can aid prognostic and therapeutic decision-making and further our understanding of therapy-induced drug resistance [ ]. measuring the dynamics of tumour composition is particularly crucial for prospective monitoring during an adaptive therapy regime aiming to control resistant subclones [ , , ]. furthermore, the proportion of cfdna that is tumour-derived (what we term ’purity’) in itself is a promising biomarker for determining initial therapy response and prognosis [ , ], as well as for tracking tumour progression during and after therapy [ , , , ]. we note that there are limitations in our liquidcna method. since our inference relies on heterogeneous copy number profiles and subclone-specific scnas, we cannot analyse cancer (sub)types with very low chromosomal instability, for example microsatellite un- stable tumours. conversely, extremely high levels of ongoing instability might bias our analysis due to the lack of stable subclone-associated scna profile, and therefore liq- uidcna is not suitable for oligo-metastatic disease if spatially separate metastases carry distinct karyotypes. furthermore, the accuracy of our estimation reduces at low purity (below %). however, a tumour fractions above this regime were observed in a sub- stantial number of patients, especially in late stage disease where liquidcna can o↵er the largest benefit, [ , , , , , ]. in addition, recent studies have shown that the unique fragment length of tumour-derived cfdna can be utilised to enrich for tumour purity either experimentally or bioinformatically [ , , ]. finally, liquidcna tracks a single dominant subclone associated with the largest set of subclone-specific scnas, and if there are multiple smaller subclones (with less or no associated scnas), these will be ignored by the algorithm. in summary, we provide a robust tool to derive quantitative information about dy- .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / namic changes in clonal composition from scna measurements derived from cfdna. liquidcna enables real-time non-invasive tracking of subclonal tumour evolution, which can provide new insights into the evolution of scnas and the dynamical emergence of therapy-associated resistance. acknowledgements we thank ann-marie baker for reviewing the clarity of the text, and steve gendreau and craig cummings from genentech, inc. for providing access to patient cfdna sequencing results and for their critical comments on the presentation of the data. this work was supported by the wellcome trust (grant /z/ /z to t.a.g.) and cancer research uk (grant a to t.a.g. supporting e.l.; advanced clinician scien- tist fellowship c /a to m.l.; clinical research training fellowship to h.h.). m.l. also received support from a barts and the london charity strategic research grant ( / ). t.a.g. also received founding from the national institutes of health, national cancer institute (grant u ca ). author contributions e.l., w.h., m.l. and t.a.g. conceived and designed the study. m.l. and t.a.g. acquired funding for the study. e.l. developed the inference method and performed bioinformatic analysis. h.h. and m.m. performed in vivo experiments and sequencing. e.l. and t.a.g. wrote the original draft, and all authors reviewed and approved the manuscript. competing interests the authors declare no competing interest. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genomic segment clonal subclonal unstable purity (pi) =subclonal-ratio (ri) = normal contamination tumour cell mixture sample sample sample sample sample sample genomic segment m ea su re d co py n um be r genomic segment tu m ou r c op y nu m be r c op y nu m be r subclonal/resistant tumour cells ancestral/sensitive tumour cells figure : schematic of copy number measurements. the first panel shows the scna profile of ancestral (in yellow) and subclonal (in red) tumour cells. at di↵erent sampling time-points, the overall tumour scna profile is a mixture of these profiles (second panel), influenced by the composition of tumour-derived dna depicted on the pie-charts. clonal, subclonal and unstable segments are indicated in yellow, red and blue, respectively. note that the cn of clonal segments remains the same. in the liquid biopsies taken at each time-point, contamination from normal cells leads to ’flattened’ measured scna profiles (last panel) due to normal cells having a neutral karyotype. this contamination a↵ects the cn of each segment. our aim is to estimate purity (pi) and subclonal-ratio (ri) based on clonal and subclonal scnas. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . . . Δcn in sample n um be r o f s eg m en ts . . . . sample sample sample sample s ub cl on al -r at io . . . . segment cn d en si ty . . . . . . . . purity estimate e rr or o f f it b c d ordered samples clonal/normal unstable subclonale x x x x x x x x x x x x x x x . . . . s eg m en t c n segment cn distribution purity p , p , … purity-corrected segment cns baseline sample segment classification maximal relative subclonal-ratio r ,n, r ,n, … . . . . sample sample sample s ub cl on al -r at io c om pa re d to s am pl e sample sample sample sample sample c n a sample order sample sample sample sample sample s eg m en t c n score: ( %) order order order order optimal: order (score = ) order f g subclone sample subclonal-ratio r , r , … optimal subclonal-ratio: . h i subclonal segments Δcn compared to figure : illustration of the estimation algorithm. (a) outline of the steps of the estima- tion algorithm. (b) purity estimation based on the peaks of the distribution of segment cns. green lines show the peaks expected at an example purity of . . (c) the error of a range of purity estimates, computed from the distance of observed and estimated peaks in (b). each line corresponds to a smoothing kernel applied to the raw segment cn distribution. the optimal purity is indicated with arrow. (d) change in segment cn values (�cns) plotted according to an example sample order. the number of subclonal segments computed in (e) is indicated below. (e) classification of segments based on the sample order in (d). segments with low variance are classified as clonal (in grey). non- clonal segments are evaluated whether they follow a quasi-monotone pattern (indicated by the shaded regions) and classified as unstable (outside of shaded region, in blue) or subclonal (in red). (f) �cn values plotted according to the optimal sample order max- imising subclonal segments. line colours indicate the class of each segment as in (e). (g) relative subclonal-ratio estimation compared to maximal subclonal-ratio sample (right- most in (f)). points show individual segment-wise estimates, with an example segment highlighted in black. black line shows the median. (h-i) subclonal-ratios and confidence intervals inferred by fitting a gaussian mixture model to the �cn distribution of sub- clonal segments. the components of the best fit with means �r and r are shown in green and magenta in (h). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . . . noise level (sigma) e rr or in s ub cl on al -r at io e st im at io n . . . . . . noise level (sigma) e rr or in p ur ity e st im at io n sigma= sigma= sigma= . sigma= . . . . . . . . . . . . . . . . true subclonal-ratio e st im at ed s ub cl on al -r at io d e f g x r: - % p: - %- ,- , ,+ ,+ , , , , ancestral cn s ub cl on al c n number of segments c p = . r = . x a b . - sigma= sigma= sigma= . sigma= . . . . . . . . . . . . . . . . . . true purity e st im at ed p ur ity figure : estimation of mixtures of synthetic cell populations. (a) parameters used to randomly sample synthetic datasets including simulated measurement noise. the font- size of copy number states indicates their probability. (b) a randomly generated sample. the heatmap depicts the distribution of segment cns in ancestral and subclonal cells, and the proportion of cell populations is shown on the pie-chart (red: subclonal, yellow: ancestral, grey: normal). (c) copy number profile of the sample in (b), with raw bin-wise and segmented copy number values shown in black and red, respectively. (d) estimated purity of , synthetic samples with varying levels of noise (�), plotted against the true theoretical purity. the y = x line is indicated with dashes. (e) error of purity estimation (absolute di↵erence to true purity) for samples with noise level indicated on the x axis. (f) true and estimated subclonal-ratio of synthetic datasets ( , samples) with varying levels of noise (�). (g) error in subclonal-ratio estimation for datasets with increasing noise level. box-plot elements in (e)(g) stand for: center line, median; box limits, upper and lower quartiles; whiskers, . x interquartile range; points, outliers. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . s s s s s sample s ub cl on al ra tio . . . . . . . . true purity e st im at ed p ur ity . . . . . . true subclonal-ratio e st im at ed s ub cl on al -r at io ba f g ancestral/sensitive cell line (b ) subclonal/resistant cell line (b ) c d e figure : estimation of mixtures of high grade serous ovarian cancer cell lines. (a) copy number profile of the ancestral/sensitive and subclonal/resistant hgsoc cell lines. raw bin-wise and segmented copy number values are shown in black and red, respectively. resistant-specific subclonal scnas are highlighted. (b) purity estimates of samples s - s . corrected values are computed using the linear fit in (c). theoretical purity values are indicated by maroon diamonds. (c) true (theoretical) and estimated tumour purity of in silico hgsoc cell line mixtures. y = x and the linear fit of the estimates (y = . x) are shown with dashed and solid lines, respectively. point shape and shade indicate total number of reads per sample. (d) subclonal-ratio estimates for samples s -s . shaded and empty bars indicate estimates derived using direct (gaussian fit) and two-step (from relative ratios in (f)) methods, respectively. error bars show % confidence interval of the direct estimate, maroon diamonds indicate theoretical values. (e) true and estimated subclonal-ratio of in silico datasets constructed of samples from (c) with million reads. (f) relative subclonal-ratio estimates for samples s -s , compared to s . estimates from each subclonal segment are shown with dots, the median estimates are indicated by black lines, and true values with maroon diamonds. (g) true and estimated relative subclonal-ratio in the datasets shown in (g). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / % subclonal cells % subclonal cells % subclonal cells baseline scna subclone-associated scna chromosome c op y nu m be r chromosome c op y nu m be r patient patient patient baseline week week end of therapy baseline week week end of therapy baseline week end of therapy a b c chromosome c op y nu m be r baseline scna subclone-associated scna baseline scna subclone-associated scna figure : estimation in cfdna samples from patient data. subclone-specific copy number changes and subclonal-ratio in lung cancer patients (a) , (b) , and (c) from [ ]. left: purity-corrected scna profiles. yellow bars show the cn of each segment in the baseline sample, and red bars indicate subclonal deviations from this value in non-baseline samples. regions of subclone-specific cnas are also indicated by darker shades. right: estimated resistant proportion of each sample with % confidence intervals. note that only samples with > % purity were analysed (c.f. s ). a bar of cn= on chromosome (indicated by asterisk) has been omitted from (c) for better visualisation. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods formal definition of the problem copy number measurements we consider a tumour that consists of two distinct cell populations, ancestral (a) and subclonal (s) tumour cells, and continuously sheds cell-free dna (cfdna) into the blood circulation. a typical scenario would be ancestral cells representing drug-sensitive tumour cells present before cancer therapy, and subclonal cells denoting the emerging subclone with resistance to therapy. the proportion of dna originating from these two cell types changes over time as we take measurements via blood samples (fig. ). since cell-free dna found in blood can also originate from normal (non-tumour) cells of the body, the measured dna is contributed by a mixture of the two tumour cell populations (a and s) and normal cells (n). at each time-point i the proportion of these three populations in the measured sample, si, depends on the proportion of all tumour-derived dna (the purity of the sample, pi) and the proportion of subclone-derived dna from the tumour (the subclonal-ratio, ri): ni = � pi; ai = pi · ( � ri); si = pi · ri. ( ) our aim is to track the dynamics of the subclonal (putatively resistant) population by determining the subclonal-ratio for each time-point, ri, or the change in subclonal-ratio between time-points, ri/rk = rik. to this end, we use the copy number values as typically measured by lpwgs of the sequential cfdna samples. let us consider distinct genomic regions with homogeneous copy number state, seg- ments. we assume that the copy number (cn) state of most segments stays constant over time in a particular population. therefore the jth segment is characterised by a set of three time-independent absolute cn states, c(n)j, c(a)j, c(s)j, corresponding to the local cn in normal, ancestral and subclonal cells, respectively. the copy number of segment j as measured in the ith sample, c j i , is the combination of these three absolute cns, weighted by the proportions of dna derived from the three cell populations at that time-point (ni, ai, si). we know that normal cells are at a diploid state, hence c(n) j = for all j. therefore, using the purity and subclonal-ratio defined in eq. ( ), c j i = + pi � ( � ri)c(a)j + ric(s)j � � . ( ) since all cells in a cell population share the absolute cn for a given segment, the values c(s)j and c(a)j are always integers. therefore in theory, measured cns from a given sample should be limited to a discrete set of values defined by these integer states, making it possible to solve the set of equations formed by eq. ( ) for pi and ri using linear algebra. however, we have to take into account that all real sequencing measurements have a level of imprecision introducing variation on top of this relationship. using the term �ij to represent the noise in the ith measurement of segment j, eq. ( ) becomes, c j i = + pi � ( � ri)c(a)j + ric(s)j � � + �ij. ( ) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / with the magnitude and family of this noise depending on the specifics of the technology used for cn measurement, especially the sequencing depth [ ]. this measurement noise – associated with a continuous distribution – broadens the set of c j i values, rendering a linear algebra solution impossible. hence, our aim becomes to derive an inference of pi and ri despite this unknown noise, �ij. segment classification each segment can fall into three categories depending on their respective copy number states in the two types of cells. (i) clonal segments have the same absolute cn in ancestral and subclonal tumour cells, c(a)j = c(s)j. a special case of clonal segments are segments of neutral cn, where c(a)j = c(s)j = . (ii) subclonal segments have di↵erent absolute cns in the ancestral and subclonal tumour population, c(a)j = c(s)j. these segments represent scnas that distinguish the subclone from its ancestor, even though they are not necessarily associated with a selective/phenotypic di↵erence (e.g. drug-resistance) directly. (iii) unstable segments are neither clonal nor associated with the emergent subclone, and therefore are best described by a time-dependent tumour-wide cn value, ⇣(t) j i , that does not depend on ri. these segments can arise if a genomic region cannot be measured reliably or if on-going genomic instability introduces novel scnas during the time tracked by our samples. we can assume that the number of such segments is small compared to the total number of measured segments. depending on whether segments are clonal, subclonal or unstable, their measured cn across samples will change according to the subclonal-ratio and purity of each sample. for simplicity, we omit the term �ij and its derivatives, but the reader should keep in mind that all equations are subject to measurement noise: c j i = + pi(c(a) j � ), if the segment is clonal, ( ) c j i = + pi � c(a)j � + ri(c(s)j � c(a)j) � , if the segment is subclonal, ( ) c j i = + pi(⇣(t) j i )), if the segment is unstable. ( ) figure illustrates how the measured cn of segments depend on the parameters ri and pi highlighted above. in the following sections, we use eqs. ( ) & ( ) to estimate the underlying parameters, pi and ri, via three steps (fig. ). estimation algorithm purity estimation purity estimation is carried out based on clonal (including neutral) segments. in general, we expect the majority of segments to fall into this category. consequently, for the ma- jority of segments their measured copy number follows eq. ( ). since c(a)j can take only integer values, the distribution of segment cns is expected to have distinct peaks at regular intervals of pi. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / using a peak-finder algorithm on the smoothed distribution of measured cn values, we directly compare the peaks to the values expected at a given purity, { � pi, , + pi, + pi, . . . }, as shown in fig. b. the error of the fit to a purity, pi, is evaluated as the summed squared distance between each peak and the closest observed peak, x c(a) min � ( + pi(c(a) � )) � peaks) � . ( ) as the detected peaks of the data depend on the smoothing kernel used on the distribution, we perform this computation for a wide range of smoothing bandwidths ( . ⇥ � . ⇥ the default value) and derive the purity estimate, p̂i, as the value that minimises the mean and/or median error across the range (fig. c). then, we use the derived p̂i to re-normalise the measured copy number values and thus eliminate normal contamination. we gain an estimate of the tumour-specific cn (c(t) j i ), a mixture of ancestral and subclonal cns: ĉ(t) j i = p̂i · (cji � ) + ⇡ c(a) j + ri(c(s) j � c(a)j). ( ) note that, due to the noise in measurements, peaks from close absolute cns can become indistinguishable in low-purity samples. therefore we expect purity values below % to be indistinguishable (unless high sequencing depth is available) and also advise to discard samples with low purity (typically pi < . ) as erroneous purity estimations can bias downstream computation. identifying subclonal segments and sample order next, we aim to identify the subset of segments with subclone-specific subclonal scnas that reflect the changes in subclonal-ratio over time. to easily assess the change in segment cns, we designate a sample as baseline, and compute the change in segment cn, �cn, between each sample and this baseline sample. typically, the sample taken upon diagnosis or before start of therapy (usually the first time-point, s ) can be used. we can assume that this sample has no or only negligible population of the emerging subclone, and therefore represents a pure ancestral population: r ⇡ �! c(t) j ⇡ c(a) j. hence the change in cn of a subclonal segment compared to the baseline becomes, �c(t) j i = c(t) j i � c(t) j = ri � c(s)j � c(a)j � . ( ) furthermore, eq. ( ) provides an informative quantity even if the baseline sample is not pure, as �c(t) j i nonetheless describes the change in subclone-specific scnas. in order to uncover which segments are truly subclonal, and how the subclonal-ratio changes over measurements, we need to identify a pervasive pattern across samples, and the subset of segments that consistently follows it. if the samples were taken so that .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the subclonal population increases over time-points, this pattern would be a monotone increase or decrease for all segments with subclone-specific scnas. while we cannot assume that the samples are taken in order of increasing subclonal proportions (e.g. a change of treatment between sampling times might lead to fluctuating population size in a resistance-associated subclone), we can aim to re-arrange them to follow this rule. consequently, we rephrase our aim as deriving (i) a set of subclonal segments that follow a monotone pattern across ordered samples; and (ii) an ordering of samples that is correlated with by the maximum number of (subclonal) segments. formally, we are looking for a subset of segments, {j , j , . . . } and a permutation of samples (starting from the designated baseline sample), s , si, . . . , sn, where for every segment j {j , j , . . . } either �c(t) j i+ � �c(t) j i > �✏, i or ( ) �c(t) j i+ � �c(t) j i < ✏, i holds for all i for a pre-defined accuracy level, ✏. we use an ✏ > accuracy level to allow for samples with near-equal subclonal-ratio measured with uncertainty. we find that, for typical lpwgs datasets, ✏ ⇡ . � . works well to account for the underlying measurement noise. figs. d-f illustrate the derivation of optimal sample order and subclonal segment set. we first separate clonal segments: since these have relative cn values of , apart from some measurement noise, we filter out any segment that has a standard deviation below a pre-defined threshold. we then evaluate eq. ( ) over all remaining segments and over all orderings of the samples. as we expect - time-points per dataset, an exhaustive search of all possible permutations is feasible. given a permutation, each segment is classified according to whether it follows eq. ( ) – these are candidate subclone-specific and unstable segments, respectively (fig. e). the optimal sample order is defined as the permutation that maximises the number of subclonal segments (fig. f). subclonal-ratio estimation finally, we use the set of segments identified as subclonal, and compute the subclonal- ratio of each time point. we derive the (absolute) subclonal-ratio, ri, for each sample using eq. ( ). as both c(a)j and c(s)j are assumed to be integers, and we know that c(a)j = c(s)j, �c(t) j i {. . . , � ri, �ri, ri, ri, . . . }, j {j , j , . . . }. ( ) to take into account that the measured �cns compared to the baseline, �ĉ(t) j i , are influenced by noise, we fit these values with a mixture of gaussian distributions where the mean of the gaussians follows eq. ( ), as illustrated in fig h. the subclonal-ratio of a sample is derived as the constrained mean parameter, ri, of the gaussian mixture .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / optimising the fit (fig. i). the % confidence interval of the inferred subclonal-ratio is computed based on the (shared) variance of the fitted constrained gaussians. the measurement noise propagated from segment cns can lead to high spread in values, making estimates less robust and rendering the resolution of low subclonal-ratios (ri  . ) challenging, occasionally leading to the gaussian-fitting step to fail. therefore we also derive relative subclonal-ratios, which allow for a more general application not limited to good quality samples. in particular, relative values are compared to the maximal sample since its subclonal-ratio is assumed to be the most robust against measurement noise. we compute the relative deviation of each normalised subclonal tumour segment cn, �c j in = �c(t) j i �c(t) j n = ri(c(s) j � c(a)j) rn(c(s)j � c(a)j) = ri rn , ( ) giving rise to a distribution of relative subclonal-ratio estimates (fig. g). we derive a point estimate for the relative ri of each sample as the median of this set, r̂in = median ⇣ �c j in ⌘ , j {j , j , . . . }. ( ) absolute subclonal-ratio estimates can then be derived using these relative estimates in a two-step estimation process (as opposed to the direct estimation above): we derive rn based on eq. ( ), and subsequently compute rin · rn to retrieve ri. generating synthetic datasets we constructed synthetic datasets of segments (of length varying between and bins) and time-points as illustrated in fig. a. for each segment, we generated sensitive segment copy number states (c(s)j) by randomly sampling from { , , , , }, with neutral and close-to-neutral states occurring with higher frequency. subclone-specific absolute cns (c(s)j) were assigned by randomly sampling from c(a)j +{� , � , , , }, with no change (giving rise to clonal segments) having a higher weight. for each sample, si, we assigned purity and subclonal-ratio randomly from the ranges . < pi < . and . < ri < . , with the exception of the baseline samples, where r < . . we then recreated the measurement procedure of computing noise-ridden raw cn values in a given segment, j, by adding a normally distributed noise. the magnitude (standard deviation) of the noise was controlled by the noise level parameter, � (representing di↵erences arising from e.g. sequencing depth) and the cn of the segment (reflecting higher variance in higher cn states): rawcbini = + pi � ( � ri)c(a)j + ric(s)j � � + normal( , f(�, c j i )). the final cn value of each segments, ĉ j i , was computed as the mean of all rawc bin i contained in the segment. in addition, we selected . - % of segments as unstable, and re- sampled their tumour-specific cn value to be independent of ri. fig. b-c show parameters of a synthetic sample and its copy number profile. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / generating in vitro and in silico cell line mixtures hgsoc cell line ovcar was obtained from prof fran balkwill (barts cancer institute, uk) and grown in dmem media containing % fbs and % penicillin/streptomycin. a resistant/subclonal hgsoc cell line (ov cis) was generated by culturing an aliquot of the ancestral ovcar cell line in increasing concentrations of cisplatin. for further details on cell culture and the celll lines, see [ ]. we then extracted genomic dna from both cell lines and from blood samples from healthy volunteers using qiaamp dna micro kit (qiagen, hilden, germany). genomic dna from the three sources was mixed in varying proportions (table s ), measured as the mass of dna inputted from each source, to a total of ng dna per sample and subjected to sonication with the covaris m system. libraries were prepared using the nebnext ultra ii kit (new england biolabs, hitchin, united kingdom) with cycles of pcr amplification, indexed with unique dual indexing primers and sequenced on illumina novaseq to a mean depth of . x. in silico mixtures were generated by bioinformatically mixing sequencing reads of dna derived from the ancestral/sensitive, subclonal/resistant tumour cell lines and healthy blood cells. similarly to synthetic samples, for each in silico sample we randomly assigned purity, . < pi < . , and subclonal-ratio, . < ri < . . we then sampled reads (using samtools view -s) from aligned read (bam) files of ‘pure’ ancestral, subclonal and normal samples (b , b and n ) in proportions to match pi( � ri), piri and � pi, respectively. we also varied the total number of reads per sample (as a proxy for sequencing depth and consequently measurement noise), and generated - samples with , , , and million total reads each. processing lpwgs samples fastq files derived from lpwgs samples (generated via sequencing cell line mixtures or obtained from [ ]) were aligned to the human reference genome (version hg , using bwa). we then processed bam files using the qdnaseq r package [ ] employing dnacopy for segmentation [ ]. qdnaseq produced two copy number values for each genomic bin: a raw pre-segmentation and a segmented value grouping bins of equal cn together. the cn of bins on the pre-defined blacklist of qdnaseq and of those with < % mappability was set to na. raw and segmented cn values for all cell line samples are available from https://github.com/elakatos/liquidcna_data. since qdnaseq returns normalised cn values (with neutral state at ), we multiplied all values by before proceeding with the estimation algorithm and re-normalised segment cn values to be centred at exactly. we then re-defined segment boundaries using the ensemble of samples as regions of constant cn in all samples. this way break-points present in only a sub-set of samples (such as a subclone-specific scna) gave rise to segments handled separately for all samples. updated segments with length below mega- bases ( bins of kb (cell line mixtures) or bins of kb (patient cfdna samples)) .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/elakatos/liquidcna_data https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / were excluded from the downstream analysis to filter out short segments sensitive to localised measurement biases. finally, we curated each segment cn by discarding bins with the most extreme . % of raw segment values, and re-calculating the segment cn value as the mean of normal distribution fitted to the remaining raw cns. we found that this curation had negligible e↵ect for most segments, but successfully improved assigned segment cn values for more error-prone genomic regions. data availability aligned sequencing data from hgsoc cell lines and in vitro mixtures (listed in table s ) are available from the european nucleotide archive (accession prjeb ). raw and post-segmentation copy number values for these samples are available from https: //github.com/elakatos/liquidcna_data. code availability estimation functions of liquidcna implemented in r (version . . ), an illustrative ex- ample in a jupyter notebook and code generating and analysing synthetic and in silico data are available from https://github.com/elakatos/liquidcna. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/elakatos/liquidcna_data https://github.com/elakatos/liquidcna_data https://github.com/elakatos/liquidcna https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] siravegna, g., marsoni, s., siena, s. & bardelli, a. integrating liquid biopsies into the management of cancer. nature reviews clinical oncology , – ( ). url https://doi.org/ . /nrclinonc. . . [ ] ng, s. b. et al. individualised multiplexed circulating tumour dna assays for monitor- ing of tumour presence in patients after colorectal cancer surgery. scientific reports , – ( ). url https://pubmed.ncbi.nlm.nih.gov/ . [ ] rothwell, d. g. et al. utility of ctdna to support patient selection for early phase clinical trials: the target study. nat med , – ( ). [ ] khan, k. h. et al. longitudinal liquid biopsy and mathematical modeling of clonal evolution forecast time to treatment failure in the prospect-c phase ii colorectal cancer clinical trial. cancer discov , – ( ). [ ] fernandez-garcia, d. et al. plasma cell-free dna (cfdna) as a predictive and prognostic marker in patients with metastatic breast cancer. breast cancer research , ( ). url https://doi.org/ . /s - - - . [ ] conteduca, v. et al. plasma tumour dna as an early indicator of treatment response in metastatic castration-resistant prostate cancer. british journal of cancer ( ). url https://doi.org/ . /s - - - . [ ] nakamura, y. et al. clinical utility of circulating tumor dna sequencing in advanced gastrointestinal cancer: scrum-japan gi-screen and gozila studies. nature medicine ( ). url https://doi.org/ . /s - - - . [ ] diaz, l. a. j. et al. the molecular evolution of acquired resistance to targeted egfr blockade in colorectal cancers. nature , – ( ). [ ] bettegowda, c. et al. detection of circulating tumor dna in early- and late-stage human malignancies. sci transl med , ra ( ). [ ] newman, a. m. et al. an ultrasensitive method for quantitating circulating tumor dna with broad patient coverage. nature medicine , – ( ). url https: //doi.org/ . /nm. . [ ] parkinson, c. a. et al. exploratory analysis of tp mutations in circulating tumour dna as biomarkers of treatment response for patients with relapsed high-grade serous ovarian carcinoma: a retrospective study. plos medicine , e ( ). url https://europepmc.org/articles/pmc . [ ] beroukhim, r. et al. the landscape of somatic copy-number alteration across human cancers. nature , – ( ). url https://doi.org/ . / nature . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /nrclinonc. . https://pubmed.ncbi.nlm.nih.gov/ https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . /nm. https://doi.org/ . /nm. https://europepmc.org/articles/pmc https://doi.org/ . /nature https://doi.org/ . /nature https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] hanahan, d. & weinberg, r. a. hallmarks of cancer: the next generation. cell , – ( ). url https://doi.org/ . /j.cell. . . . [ ] sansregret, l., vanhaesebroeck, b. & swanton, c. determinants and clinical impli- cations of chromosomal instability in cancer. nature reviews clinical oncology , – ( ). url https://doi.org/ . /nrclinonc. . . [ ] li, x. et al. temporal and spatial evolution of somatic chromosomal alterations: a case-cohort study of barrett’s esophagus. cancer prev res (phila) , – ( ). [ ] hieronymus, h. et al. tumor copy number alteration burden is a pan-cancer prog- nostic factor associated with recurrence and death. elife ( ). [ ] rubin, c. e. et al. dna aneuploidy in colonic biopsies predicts future development of dysplasia in ulcerative colitis. gastroenterology , – ( ). [ ] zaccaria, s. & raphael, b. j. accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. nature commu- nications , ( ). url https://doi.org/ . /s - - -y. [ ] scheinin, i. et al. dna copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. genome res , – ( ). [ ] adalsteinsson, v. a. et al. scalable whole-exome sequencing of cell-free dna reveals high concordance with metastatic tumors. nature communications , ( ). url https://doi.org/ . /s - - -y. [ ] van roy, n. et al. shallow whole genome sequencing on circulating cell-free dna allows reliable noninvasive copy-number profiling in neuroblastoma patients. clin cancer res , – ( ). [ ] hovelson, d. h. et al. rapid, ultra low coverage copy number profiling of cell-free dna as a precision oncology screening strategy. oncotarget , – ( ). [ ] chin, s.-f. et al. shallow whole genome sequencing for robust copy number profil- ing of formalin-fixed para�n-embedded breast cancers. experimental and molecular pathology , – ( ). url http://www.sciencedirect.com/science/ article/pii/s . [ ] chen, x. et al. low-pass whole-genome sequencing of circulating cell-free dna demonstrates dynamic changes in genomic copy number in a squamous lung cancer clinical cohort. clinical cancer research , – ( ). url https://clincancerres.aacrjournals.org/content/ / / . https:// clincancerres.aacrjournals.org/content/ / / .full.pdf. .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /j.cell. . . https://doi.org/ . /nrclinonc. . https://doi.org/ . /s - - -y https://doi.org/ . /s - - -y http://www.sciencedirect.com/science/article/pii/s http://www.sciencedirect.com/science/article/pii/s https://clincancerres.aacrjournals.org/content/ / / https://clincancerres.aacrjournals.org/content/ / / .full.pdf https://clincancerres.aacrjournals.org/content/ / / .full.pdf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] belic, j. et al. mfast-seqs as a monitoring and pre-screening tool for tumor-specific aneuploidy in plasma dna. adv exp med biol , – ( ). [ ] vanderstichele, a. et al. chromosomal instability in cell-free dna as a highly specific biomarker for detection of ovarian cancer in women with adnexal masses. clin cancer res , – ( ). [ ] taylor, f., bradford, j., woll, p. j., teare, d. & cox, a. unbiased detection of somatic copy number aberrations in cfdna of lung cancer cases and high-risk controls with low coverage whole genome sequencing. adv exp med biol , – ( ). [ ] wei, t. et al. genome-wide profiling of circulating tumor dna depicts landscape of copy number alterations in pancreatic cancer with liver metastasis. mol oncol , – ( ). [ ] hoare, j. et al. platinum resistance induces diverse evolutionary trajecto- ries in high grade serous ovarian cancer. biorxiv ( ). url https: //www.biorxiv.org/content/early/ / / / . . . . https:// www.biorxiv.org/content/early/ / / / . . . .full.pdf. [ ] nelson, l. et al. a living biobank of ovarian cancer ex vivo models reveals profound mitotic heterogeneity. nature communications , ( ). url https://doi. org/ . /s - - - . [ ] network, c. g. a. r. integrated genomic analyses of ovarian carcinoma. nature , – ( ). url https://pubmed.ncbi.nlm.nih.gov/ . [ ] soria, j.-c. et al. a phase ib dose-escalation study of the safety and pharmacoki- netics of pictilisib in combination with either paclitaxel and carboplatin (with or without bevacizumab) or pemetrexed and cisplatin (with or without bevacizumab) in patients with advanced non–small cell lung cancer. european journal of cancer , – ( ). url http://www.sciencedirect.com/science/article/pii/ s . [ ] housman, g. et al. drug resistance in cancer: an overview. cancers (basel) , – ( ). [ ] gatenby, r. a., silva, a. s., gillies, r. j. & frieden, b. r. adaptive therapy. cancer res , – ( ). [ ] enriquez-navas, p. m., wojtkowiak, j. w. & gatenby, r. a. application of evolu- tionary principles to cancer therapy. cancer res , – ( ). [ ] zhang, j., cunningham, j. j., brown, j. s. & gatenby, r. a. integrating evo- lutionary dynamics into treatment of metastatic castrate-resistant prostate can- cer. nature communications , ( ). url https://doi.org/ . / s - - - . .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.biorxiv.org/content/early/ / / / . . . https://www.biorxiv.org/content/early/ / / / . . . https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://www.biorxiv.org/content/early/ / / / . . . .full.pdf https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://pubmed.ncbi.nlm.nih.gov/ http://www.sciencedirect.com/science/article/pii/s http://www.sciencedirect.com/science/article/pii/s https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] choudhury, a. d. et al. tumor fraction in cell-free dna as a biomarker in prostate can- cer. jci insight ( ). url https://doi.org/ . /jci.insight. . [ ] phallen, j. et al. direct detection of early-stage cancers using circulating tumor dna. science translational medicine , eaan ( ). url https://pubmed.ncbi. nlm.nih.gov/ . [ ] mouliere, f. et al. high fragmentation characterizes tumour-derived circulating dna. plos one , – ( ). url https://doi.org/ . /journal.pone. . [ ] underhill, h. r. et al. fragment length of circulating tumor dna. plos genetics , – ( ). url https://doi.org/ . /journal.pgen. . [ ] mouliere, f. et al. enhanced detection of circulating tumor dna by fragment size analysis. science translational medicine ( ). url https://stm.sciencemag. org/content/ / /eaat . https://stm.sciencemag.org/content/ / / eaat .full.pdf. [ ] venkatraman, e. s. & olshen, a. b. a faster circular binary segmentation algorithm for the analysis of array cgh data. bioinformatics , – ( ). .cc-by-nc-nd . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /jci.insight. https://pubmed.ncbi.nlm.nih.gov/ https://pubmed.ncbi.nlm.nih.gov/ https://doi.org/ . /journal.pone. https://doi.org/ . /journal.pone. https://doi.org/ . /journal.pgen. https://stm.sciencemag.org/content/ / /eaat https://stm.sciencemag.org/content/ / /eaat https://stm.sciencemag.org/content/ / /eaat .full.pdf https://stm.sciencemag.org/content/ / /eaat .full.pdf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / analysis and forecasting of global rt-pcr primers for sars-cov- analysis and forecasting of global rt-pcr primers for sars-cov- gowri nayar ,*, edward e. seabolt , mark kunitomi , akshay agarwal , kristen l. beck , vandana mukherjee , and james h. kaufman ibm research, san jose, , usa *gowri.nayar@ibm.com +these authors contributed equally to this work abstract rapid tests for active sars-cov- infections rely on reverse transcription polymerase chain reaction (rt-pcr). rt-pcr uses reverse transcription of rna into complementary dna (cdna) and amplification of specific dna (primer and probe) targets using polymerase chain reaction (pcr). the technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. however the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. different primer sequences have been adopted in different geographic regions. as we rely on these existing rt-pcr primers to track and manage the spread of the coronavirus, it is imperative to understand how sars-cov- mutations, over time and geographically, diverge from existing primers used today. in this study, we analyze the performance of the sars-cov- primers in use today by measuring the number of mismatches between primer sequence and genome targets over time and spatially. we find that there is a growing number of mismatches, an increase by % per month, as well as a high specificity of virus based on geographic location. introduction as the sars-cov- pandemic grows, an essential method for controlling its spread and determining readiness for the re- opening of public life is through rapid testing. rapid tests for active sars-cov- infections are based on reverse transcription polymerase chain reaction (rt-pcr). these tests consist of a forward primer, reverse primer, and probe that together are used to amplify the signal from the targeted virus within a sample. the approach supports rapid and specific identification of the virus, and does not depend on tissue culture or animal cell models. however, rna viruses evolve over time and a specific pcr test may lose sensitivity as the genotypic distribution of the virus changes or shifts. phylodynamic studies suggest the mutation rate of sars-cov- is in the range . x – to . x – substitutions per site per year, approximately . % variation increase per month, consistent with mutation rates reported for other coronaviridae. – sequence drift also leads to geospatial differences in the virus, resulting in varying test sensitivity by region. this study investigates the effectivity of current sars-cov- pcr tests over the development of the virus in space and time, and projects how the performance of each may change as the virus undergoes mutation. by taking a global perspective, using specific pcr protocols from several different countries together with genomic data from around the globe, our analysis shows how the existing tests respond differently over both time and location. by analyzing the number of mismatches of the pcr primers with respect to the sequenced sars-cov- genomes, we can measure how the targeted proteins are mutating. this provides an understanding of possible shortcomings of current tests, and suggests how often we may need to update those tests in the future. through this work, we observe an average rate of amino acid sequence change of approximately % per month for the targeted proteins. furthermore, we see that the virus genotype is spatially differentiated to the point that inter-country pcr testing already leads to a much higher rate of mismatches. in support for global pandemic response, several countries have published their rt-pcr protocols. we have collected the primer sequences and protocols developed for six different regions – usa, germany, china, hong kong, japan, and thailand – as provided by the who . for all six protocols, we collect the forward, reverse, and probe sequences for each specific gene target. table details the different gene targets for each protocol. most commonly, the pcr tests target the nucleoprotein (np), followed by targets in the rna-directed rna polymerase (rdrp) gene, and the envelope small membrane protein (e protein). np is a structural protein that encapsidates the negative strand rna. for other rna viruses including influenza, the np sequence is often used for species identification . rna-dependent rna polymerase (rdrp) is an enzyme that catalyzes the replication of rna from an rna template. the membrane associated rdrp is an essential protein for coronavirus replication , and may be a primary target for the antiviral drug remdesivir . the e protein is a small membrane protein involved in assembly, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / budding, envelope formation, and pathogenesis . the sars-cov e protein also forms a ca + permeable ion channel that alters homeostasis within cells which leads to the overproduction of il- beta , . results primer comparison using these methods, we observed high sequence homology for at least % of all genomes for most of the pcrs, showing that each primer is able to detect most of the sars-cov- genomes sequenced at the time of this report. table shows the percent of genomes hit by each pcr test, labelled by the country and target gene region. the america rp is an additional primer/probe set to detect the human rnase p gene to control for non-viral genes in the sample, and therefore, as expected, % of the sars-cov- genomes match with this set. however, when we look at the number of mismatches for each pcr for those hit genomes, we can see that there is a significant difference in performance between each test. figure shows the number of mismatches for all genomes created by each pcr, where we can see the range varying from , created by the american n primer, to mismatches, created by the french ip primer. thus we observe that the measure of mismatches can be used as a proxy to identify the amount of variation found within the gene sequences that are being targeted by the worldwide tests. time analysis following the methods described in section , all genomes that fall within the day range are segmented by date of collection and analyzed for mismatches to the various primer tests. figure shows the average number of mismatches seen for all primers each day within this range, normalized by the number of genomes sampled in each day. from this analysis, we can see an average of . mismatches, with a % increase in mismatches over the day time range. this corresponds to a ∼ % increase per month. to estimate the mutation rate,from figure , we calculate the best-fit line using least squares, which results in an r value of . . this mutation rate is consistent with the expected rate of mutation of the sars-cov- virus. – figure shows the distribution of total, and time averaged, mismatches for each primer set over time. the figure indicates a larger distribution of mismatches for primer sets that target nucleoprotein regions. it is important to note that the total number of mismatches occurring is increasing and that many of these mismatches are being sustained in the evolving population. in order to identify a trend, genomes that occur close in time should have smaller change in mismatches than genomes that occur further apart in time. figure shows this comparison between delta time and delta mismatches for every pair of genomes for the france pcr targeting the rdrp gene (ip ). the graphs for the other pcrs may be found in the supplemental files. each point represents a pairwise comparison of the difference in mismatch plotted over the difference in time. we observe that the delta mismatches grows in variance as the genomes occur further apart in time. furthermore, the pearson coefficient is . between mismatches and the number of genomes sampled in a day for each pcr. this positive linear relationship between the number of genomes and the number of mismatches per day shows that the mismatches occur uniformly across the genomes sampled within a day (rather than a few genomes creating noise in the signal). the data indicates that the virus demonstrated sequence variability in the targeted gene regions and that this variability causes sequence mismatches to increase over time. geographical analysis geographical stratification is occurring as the sars-cov- virus mutates within each geographic location. following the methods described in section , geospatial analysis is conducted to identify patterns in mismatches found in genomes sequenced within versus outside the country of primer origin. figure shows the number of mismatches, normalized by the number of genomes within each category, for each pcr, grouped by same and other countries. there are countries in which the number of mismatches in the country is lower than the number of mismatches that occur with genomes sampled outside of the country. this shows that the virus displays localized tendencies within the targeted gene regions, in addition to the spike glycoprotein region. the two outliers, the hong kong and france primers, show a higher percent of mismatches within the country rather than from different countries. figure shows the average number of mismatches over time, grouped by the genomes sampled within and outside the country, for one american primer. while the in-country average number of mismatches shows low variability, the out-country average number of mismatches show an increasing diversity in these targeted regions. the full set of graphs for each pcr tested are available in the supplement. clade analysis figure shows the number of mismatches for each pcr per clade, normalized by the number of genomes in the pcr and clade. this shows definite trends which confirm the geographic specificity of the virus; for example, the american nucleoprotein / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / primers have the highest number of mismatches for clade a, which nextstrain defines as originating from predominantly asian genomes, while the chinese primer has the lowest number of mismatches for this clade. however, the clades are defined by specific mutations at nucleotide locations, which only overlaps with the primer bind region for . % of the genomes. therefore, the relationship between the primer mismatches and the genome clades are correlational rather than causational. discussion by taking a global perspective on both the sars-cov- genomes and the common rt-pcr protocols, we are able to highlight important trends within the data. we observe a an increasing number of mismatches between the primer and target genome sequence as time progresses. we can also see that the number of mismatches is higher when we compare genomes sampled outside of the country that designed the test compared to within the country. while these metrics do not quantify the performance of the test, they demonstrate a growing divergence between the targeted gene sequences and the test primers. as shown by d. bru et al. , a single mutation can result in an underestimation of the gene copy number by up to -fold. our results reveal, today, an average of . mismatches between the primer and target sequences, with a growth of % each month. understanding copy number is critical to correct interpretation of a pcr assay. if the genome being tested has sufficient mismatches this can lead to an erroneous copy number and, therefore, a misinterpretation of the assay result. in the case of sars-cov- , for each targeted gene sequence, there are at least different sequence variants and with this sequence diversity of the targeted genes, the mismatches in pcr primers may not be amplifying each example at the same rate, leading to false negatives. the given primers average a base length of primers, and it has been demonstrated for primers with such base pair length that to mismatches reduces the yield by approximately percent . our data indicates that this level of mismatches will be reached within months or fewer if the rate of infection, and thus mutation, increases significantly. the results of this study also demonstrate that each primer target develops a different number of mismatches over time (see: figure ). from the total number of mismatches created by primer target, we can see that the nucleoprotein targets from america, china, hong kong, and thailand develop the greatest number of mismatches. furthermore, when looking at the distribution of average number of mismatches over time, the primers targeting nucleoprotein have the largest distribution. the results indicate that primers targetting the envelope small membrane protein and the rna-dependent rna polymerase are the most resistant to mismatches. this may suggest more stable targets for future primer test designs. the mutations that lead to mismatches between gene pcr primers and their targets reflect the sequence evolution of the virus. comparing the difference in time of collection of two genomes with the number of mismatches by which they differ shows evidence for this evolution (figure ). genomes that occur on the same day (delta time= ) have approximately zero difference, while genomes that occur at delta time= [days] have an average of . mismatches per nucleotide. this is consistent with the observed increasing number of mismatches over time, and shows that evolution of sars-cov- genomes is being sustained. the continual branching of the genetic tree due to mutation is further supported by the analysis of the number of mutations within and outside the country that designed the particular primer. figure shows that most countries primers perform better when tested against genomes sequenced within the country rather than globally sequences genomes. in two cases, hong kong and france, the primers have a smaller percent of mismatches with genomes outside the country. for france, the ip , a region of the rdrp gene, primer target creates a disproportionate number of mismatches when compared to genomes sequenced within france. this suggests that this region of the genome has deviated more from the original reference used to generate the primer set. for hong kong, they have the least number of genomes sequenced within the country in this dataset, so it is possible that the larger percent of mismatches for genomes within versus outside the country is an artifact of bias in data. nextstrain categorizes the various genetic phylogenies by clade, which is designed to denote long-term genetic changes based on mutation. each clade defined requires significant geographical and frequency. this study shows that less than . % of the regions on the genome that define the clades overlap with the region that the primers target. this indicates that variations in the primer target sequences have not yet have reached large enough statistical significance to define a new clade in the nextstrain phylogeny, although the variants that are present in the primer region may cause a decrease in amplification signal within the assay. with the emergence of specific mutations that are spreading at faster rates, this analysis becomes more important in evaluating the possible need for primer re-design. the emergence of the b. . . strain contains mutation in the regions encoding for the envelope small membrane protein and the nucleoprotein, both targeted by the current primers. with the number of cases of sars-cov- globally, it is highly probable that the genome will mutate in the primer target regions. methods / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / data description gisaid has emerged as a leading source of sars-cov- genomes, containing the largest number of genomes sequences around the world with metadata about the location and time of collection . sars-cov- genomes from the gisaid repository were curated, collecting high quality genomes within the date range aug , – july , . while this date range precedes the start of the current outbreak, the genome sequences from the earlier points and time serve as a control for comparison. we define high quality genomes as those with less than % n within the sequence and less . % unique non-synonymous mutation. by taking these measures, we reduce the noise generated from random mutations or sequencing errors found within the genome. this resulted in a set of , sars-cov- genomes, for which we evaluated primer homology. the who has published primers from six countries - china, france, usa, japan, germany, hong kong, and thailand . each protocol published is a rt-pcr assay method, and for each primer set, a forward, reverse and probe sequence is provided . for this study, we use the sequences as provided with no modifications made. pcr primer comparison using the primer sequences and sars-cov- genomes described above, we perform a sequence comparison. specifically, we used blastn with parameters similar to primer-blast . this procedure was verified to account for full alignments of the forward, reverse, and probe sequences of primers . the blast results are then parsed, ensuring that the forward, reverse, and probe sequences match a given genome and that the probe sequence is matched spatially in the forward and reverse directions on the genome, and the number of mismatches is aggregated for each pcr sequence and genome. this metric does not necessarily predict whether the pcr test would generate a positive or negative outcome for the particular genome, but rather measures variability within the targeted gene region. since all genomes included in this corpus are associated with sars-cov- , its can be assumed that they were collected by a positive assay. mutations in the targeted gene region, over time, can affect the sensitivity of the primers. time analysis methods for each regional test, the primers each target a particular section of the genome derived from various reference genomes. however, as replication and mutation of the virus occurs, these targeted regions of circulating virus genomes accumulate sequence differences from the reference. thus, the efficacy of the primer may decrease over time. as more mutations accumulate, it is important to measure the rate of mismatch growth between primer sequence and targeted section as a function of time. from this rate it is possible to anticipate when target sequences used in a regional test should be updated. to estimate the mutation rate of the targeted genes over time, we group the genomes by their date of sampling and aggregate the number of mismatches for each day. in order to reduce noise from days with few genomes collected, for any time-based analysis, we consider only those days that have over unique genomes sequenced. with this restriction data is available for a time range between jan , - july , , for a total of days. this process removes outlier data that was sequenced prior to the start of the pandemic, including sequences that were collected from non-human hosts. geographical analysis methods as the virus has spread throughout the world, we see particular mutations that are specific to outbreaks by geospatial location. as studies using bayesian coalescent analysis have shown, high evolutionary rates and fast population growth of the sars-cov- virus results in increasing diversification of the virus by geographic location . to understand how the pcr tests respond differently for genomes collected by country, we first extract the country of sampling for each genome from the fasta header provided by gisaid and then group the number of mismatches found in the genome by in country versus out of country. clade analysis methods sars-cov- genomes have been categorized into clades to define groups of mutations. for this analysis, we use the clades as indicated by nextstrain, which are defined by frequency and geographic spread. their script to categorize genomes within the specific clade definitions was used to classify each genome within the dataset . furthermore, nextstrain publishes the genome locus that defines each clade, and these loci were compared to the genome location the primer targets bind to. by grouping the number of mismatches for each pcr by the genomes’ clade we see how different genetic variations affect the pcr test performance. references . hill v., r. a. phylodynamic analysis of sars-cov- | update - - . virological.org ( ). https://virological.org/t/phylodynamic-analysis-of-sars-cov- -update- - - / . / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . gytis, e. a., dudas. mers-cov spillover at the camel-human interface. elife ( ). . cotten, e. a., matthew. spread, circulation, and evolution of the middle east respiratory syndrome coronavirus. mbio ( ). . baric, e. a., ralph s. episodic evolution mediates interspecies transfer of a murine coronavirus. j. virology , – ( ). . organization, w. h. who in-house assays ( ). . burger h, e. a. sequence of the nucleoprotein gene of influenza a/parrot/ulster/ . virusres , – , doi: https: //doi.org/ . / - ( ) - ( ). . y gao, e. a. structure of the rna-dependent rna polymerase from covid- virus. science – ( ). . elfiky, a. ribavirin, remdesivir, sofosbuvir, galidesivir, and tenofovir against sars-cov- rna dependent rna polymerase (rdrp): a molecular docking study. life sci. ( ). . schoeman d, e. a. coronavirus envelope protein: current knowledge. virol j ( ). . surya w, e. a. mers coronavirus envelope protein has a single transmembrane domain that forms pentameric ion channels. virus res. ( ). . nieto-torres jl, e. a. severe acute respiratory syndrome coronavirus e protein transports calcium ions and activates the nlrp inflammasome. virology ( ). . d. bru, l. p., f. martin-laurent. quantification of the detrimental effect of a single primer-template mismatch by real-time pcr using the s rrna gene as an example. appl. environ. microbiol. doi: https://doi.org/ . /aem. - ( ). . cindy christopherson, s. k., john sninsky. phylodynamic analysis of sars-cov- genomes- - jan- . nucleic acids res. ( ). . shu, y. & mccauley, j. gisaid: global initiative on sharing all influenza data–from vision to reality. eurosurveillance , ( ). . seabolt, e., nayar, g. et al. ibm functional genomics platform, a cloud-based platform for studying microbial life at scale. ieee/acm transactions on comput. biol. bioinforma. doi: . /tcbb. . ( ). . camacho, c., coulouris, v., g.and avagyan et al. blast+: architecture and applications. bmc bioinforma. doi: . / - - - ( ). . ye, j., coulouris, g., zaretskaya, i. et al. primer-blast: a tool to design target-specific primers for polymerase chain reaction. bmc bioinforma. , doi: https://doi.org/ . / - - - ( ). . castells, e. a., m. evidence of increasing diversification of emerging sars-cov- strains. j med virol doi: https: //doi.org/ . /jmv. ( ). . hadfield, j. et al. nextstrain: real-time tracking of pathogen evolution. bioinformatics doi: https://doi.org/ . / bioinformatics/bty ( ). acknowledgements the authors would like to acknowledge the gisaid initiative and ncbi for the provision of data. author contributions statement g.n. conceived the experiment and analysis, m.k. verified the results, e.s. was the architect of the platform used, a.a and k.l.b. performed genome quality analysis, j.h.k and v.m. provided scientific guidance and domain specific knowledge, additional information competing interests the corresponding author is responsible for submitting a competing interests statement on behalf of all authors of the paper. this statement must be included in the submitted article file. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / - ( ) - https://doi.org/ . / - ( ) - https://doi.org/ . /aem. - . /tcbb. . . / - - - https://doi.org/ . / - - - https://doi.org/ . /jmv. https://doi.org/ . /jmv. https://doi.org/ . /bioinformatics/bty https://doi.org/ . /bioinformatics/bty http://www.nature.com/srep/policies/index.html#competing https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / country target usa nucleoprotein china orf ab, nucleoprotein germany rna-directed rna polymerase, envelope small membrane protein hong kong nucleoprotein thailand nucleoprotein france rna-directed rna polymerase (ip , ip ), envelope small membrane protein japan nucleoprotein table . targeted genes by name by primers from the countries in the study pcr percent of hit genomes america|rp * china|orf ab . japan|niid -ncov n . america| -ncov n . hongkong|hku-n . thailand|wh-nic-n . china|n . germany|e sarbeco . france|e sarbeco . france|ncov ip . america| -ncov n . france|ncov ip . america| -ncov n . table . percent of genomes that are hit by the described pcr test, identified by the country and target gene. *indicates that the primer is designed to separate the any errant samples within the assay. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . total number of mismatches each pcr test creates when tested against the full corpus of sars-cov- genomes. each pcr test is identified by the country of use and the targeted gene name. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . average number of mismatches for all genomes and all pcr primers separated by the day on which the genome is collected. the dates shown are aggregated over every day period. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . distribution of mismatches for each primer. a shows the total number of mismatches aggregated for each day within the time range. b shows the number of mismatches for each day averaged by the number of genomes that occur on a day within the time range. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . change in number of mismatches between two occurrences over delta time between the two occurrences for the ip primer developed in france. the increasing slope shows that mutations are being sustained as we compare genomes that occur further apart in time. graphs for all primers are included in the supplement. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . number of mismatches for each pcr test tested on all sars-cov- genomes, split between genomes collected within the same country as the test and outside the country. for japan, % of genomes, both in and out of the country, have mismatch, and therefore not shown in the figure. for out of the pcr tests, there are a higher number of mismatches for total genomes that occur outside the country than genomes that occur inside the country. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . number of mismatches in and out of country for an american nucleoprotein primer separated by time of genome collection. all other primers are included in the supplement. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . average number of mutations for each pcr test that occur within each clade, as defined by nextstrain. / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / primer comparison time analysis geographical analysis clade analysis data description pcr primer comparison time analysis methods geographical analysis methods clade analysis methods references comprehensive comparison of transcriptomes in sars-cov- infection: alternative entry routes and innate immune responses comprehensive comparison of transcriptomes in sars-cov- infection: alternative entry routes and innate immune responses yingying cao ∗, xintian xu , simo kitanovski , lina song , jun wang , pei hao , ∗, daniel hoffmann ∗ bioinformatics and computational biophysics, faculty of biology and center for medical biotechnology, university of duisburg-essen, essen , germany key laboratory of molecular virology and immunology, institut pasteur of shanghai, center for biosafety mega-science, chinese academy of sciences, shanghai , china translational skin cancer research, german consortium for translational cancer research, essen, germany the joint program in infection and immunity: a. guangzhou women and children’s medical center, guangzhou medical university, guangzhou , china; b. institut pasteur of shanghai, chinese academy of sciences, shanghai , china ∗to whom correspondence should be addressed; e-mail: daniel.hoffmann@uni-due.de, phao@ips.ac.cn, yingying.cao@uni-due.de. the pathogenesis of covid- emerges as complex, with multiple factors leading to injury of different organs. several studies on underlying cellular processes have produced contradictory claims, e.g. on sars-cov- cell en- try or innate immune responses. however, clarity in these matters is imper- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ative for therapy development. we therefore performed a meta-study with a diverse set of transcriptomes under infections with sars-cov- , sars- cov and mers-cov, including data from different cells and covid- pa- tients. using these data, we investigated viral entry routes and innate im- mune responses. first, our analyses support the existence of cell entry mech- anisms for sars and sars-cov- other than the ace route with evidence of inefficient infection of cells without expression of ace ; expression of tm- prss /tpmrss is unnecessary for efficient sars-cov- infection with ev- idence of efficient infection of a cells transduced with a vector expressing human ace . second, we find that innate immune responses in terms of inter- ferons and interferon simulated genes are strong in relevant cells, for example calu cells, but vary markedly with cell type, virus dose, and virus type. introduction coronaviruses are non-segmented positive-sense rna viruses with a genome of around kilobases. the genome has a ’ cap structure along with a ’ poly (a) tail, which acts as mrna for translation of the replicase polyproteins. the replicase gene occupies approximately two thirds of the entire genome and encodes non-structural proteins (nsps). the remaining third of the genome contains open reading frames (orfs) that encode accessory proteins and four structural proteins, including spike (s), envelope (e), membrane (m), and nucleocapsid (n) ( ). over the past years, three epidemics or pandemics of life-threatening diseases have been caused by three closely related coronaviruses – severe acute respiratory syndrome coronavirus (sars-cov), which emerged with nearly % mortality ( , ) in - and spread to countries before being contained; middle east respiratory syndrome coronavirus (mers-cov), with mortality around % ( , ) starting in and since then spreading to countries; (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sars-cov- , emerging in late ( ), which has caused many millions of confirmed cases and > million deaths worldwide ( ). infection with sars-cov, mers-cov or sars-cov- can cause a severe acute respiratory illness with similar symptoms, including fever, cough, and shortness of breath. sars-cov- is a new coronavirus, but its similarity to sars-cov (amino acid sequences about % identical ( )) and mers-cov suggests comparisons to these earlier epidemics. de- spite the difference in the total number of cases caused by sars-cov and sars-cov- ( , ) due to different transmission rates, the outbreak caused by sars-cov- resembles the out- break of sars: both emerged in winter and were linked to exposure to wild animals sold at markets. although mers-cov has high morbidity and mortality rates, lack of autopsies from mers-cov cases has hindered our understanding of mers-cov pathogenesis in humans. until now there are no specific anti-sars-cov- , anti-sars-cov or anti-mers-cov therapeutics approved for human use. there are several points of attack for potential anti- sars-cov- /sars-cov/mers-cov therapies, e.g. intervention on cell entry mechanisms to prevent virus invasion, or acting on the host immune system to kill the infected cells and thus prevent replication of the invading viruses. a better understanding of virus entry mechanisms and the immune responses can therefore guide the development of novel therapeutics. virus entry into host cells is the first step of the viral life cycle. it is an essential component of cross-species transmission and an important determinant of virus pathogenesis and infectivity ( , ), and also constitutes an antiviral target for treatment and prevention ( ). it seems that sars-cov and sars-cov- use similar virus entry mechanisms ( ). the infection of sars- cov or sars-cov- in target cells was initially identified to occur by cell-surface membrane fusion ( , ). some later studies have shown that sars-cov can infect cells through receptor mediated endocytosis ( , ) as well. both mechanisms require the s protein of sars-cov or sars-cov- to bind to angiotensin converting enzyme (ace ), and s protein of mers- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cov to dipeptidyl peptidase (dpp ) ( ), respectively, through their receptor-binding domain (rbd) ( ). in addition to ace and dpp , some recent studies suggest that there are possible other coronavirus-associated receptors and factors that facilitate the infection of sars-cov- ( ), including the cell surface proteins basignin (bsg or cd ) ( ), and cd ( ). recently, clinical data have revealed that sars-cov- can infect several organs where ace expression could not be detected in healthy individuals ( , ), which highlights the need of closer inspection of virus entry mechanisms. the binding of s protein to a cell-surface receptor is not sufficient for infection of host cell ( ). in the cell-surface membrane fusion mechanism, after binding to the receptor, the s protein requires proteolytic activation by cell surface proteases like tmprss , tpmrss , or other members of the tmprss family ( , , ), followed by the fusion of virus and target cell membranes. in the alternative receptor mediated endocytosis mechanism, the endocytosed virion is subjected to an activation step in the endosome, resulting in the fusion of virus and endosome membranes and the release of the viral genome into the cytoplasm. the endosomal cysteine proteases cathepsin b (ctsb) and cathepsin l (ctsl) ( ) might be involved in the fusion of virus and endosome membranes. availability of these proteases in target cells largely determines whether viruses infect the cells through cell-surface membrane fusion or receptor mediated endocytosis. how the presence of these proteases impacts efficiency of infection with sars-cov- , sars-cov and mers-cov, still remains elusive. when the virus enters a cell, it may trigger an innate immune response, a crucial compo- nent of the defense against viral invasion. compounds that regulate innate immune responses can be introduced as antiviral agents ( ). the innate immune system is initialized as pat- tern recognition receptors (prrs) such as toll-like receptors (tlrs) and cytoplasmic retinoic acid-inducible gene i (rig-i) like receptors (rlrs) recognize molecular structures of the in- vading virus ( , ). this pattern recognition activates several signaling pathways and then (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . downstream transcription factors such as interferon regulator factors (irfs) and nuclear factor κb (nf-κb). transcriptional activation of irfs and nf-κb stimulates the expression of type i (α or β) and type iii (λ) interferons (ifns). ifn-α (ifna , ifna , etc), ifn-β (ifnb ) and ifn-λ (ifnl - ) are important cytokines of the innate immune responses. ifns bind and induce signaling through their corresponding receptors (ifnar for ifn-α/β and ifnlr for ifn-λ), and subsequently induce expression of ifn-simulated genes (isgs) (e.g. mx , isg and oasl) and pro-inflammatory chemokines (e.g. cxcl and ccl ) to suppress viral repli- cation and dissemination ( , ). dysregulated inflammatory host response results in acute respiratory distress syndrome (ards), a leading cause of covid- mortality ( ). one attractive therapy option to combat covid- is to harness the ifn-mediated innate immune responses. clinical trials with type i and type iii ifns for treatment of covid- have been conducted and many more are still ongoing ( , ). in this regard, the kinetics of the secretion of ifns in the course of sars-cov- infection needs to be defined. unfortunately, some results on the host innate immune responses to sars-cov- are apparently at odds with each other ( – ), e.g. it is unclear whether sars-cov- infection induces low ifns and moderate isgs ( ), or robust ifn responses and markedly elevated expression of isgs ( – ). this has to be clarified. the use of ifns as a treatment in covid- is now a subject of debate as well ( ). thus, the kinetics of ifn secretion relative to the kinetics of virus replication need to be thoroughly examined to better understand the biology of ifns in the course of sars-cov- infection and thus provide guidance to identify the temporal window of therapeutic opportunity. we have collected and analyzed a diverse set of publicly available transcriptome data ( , – ): ( ) bulk rna-seq data with different types of cells, including human non-small cell lung carcinoma cell line (h ), human lung fibroblast-derived cells (mrc ), human alveo- lar basal epithelial carcinoma cell line (a ), a cells transduced with a vector expressing (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . human ace (a -ace ), primary normal human bronchial epithelial cells (nhbe), hetero- geneous human epithelial colorectal adenocarcinoma cells (caco ), and african green monkey (chlorocebus sabaeus) kidney epithelial cells (vero e ) infected with sars-cov- , sars- cov and mers-cov (table ); ( ) rna-seq data of lung samples, peripheral blood mononu- clear cell (pbmc) samples, and bronchoalveolar lavage fluid (balf) samples of covid- patients and their corresponding healthy controls (table and table ). using this collection, we systemically evaluated the replication and transcription status of virus in these cells, ex- pression levels of coronavirus-associated receptors and factors, as well as the innate immune responses of these cells during virus infection. results different infection efficiency of sars-cov- , sars-cov and mers-cov in different cell types the rna-seq data for all samples can be aligned to the genome of the corresponding virus to evaluate the infection efficiency in cells, estimated by the mapping rate to the virus genome, i.e. the percentages of viral rnas in intracellular rnas. to assess the infection efficiency of sars-cov- , sars-cov, and mers-cov in different types of cells, we collected and analyzed a comprehensive public datasets of rna-seq data of cells infected with these viruses at hours post infection (hpi) with comparable multiplicity of cellular infection (moi) (table ). moi refers to the number of viruses that are added per cell in infection experiments. for example, if viruses are added to cells, the moi is . our analysis shows that the infection efficiency of viruses can be both cell type dependent and virus dose dependent (fig. ). mers-cov can efficiently infect mrc and vero e cells. however, the infection efficiency is influenced strongly by moi in the same type of cells. cells infected with low moi, say . , have significantly lower mapping rates than those with high (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . moi, say (fig. ). for sars-cov and sars-cov- , the infection efficiency is influenced strongly by cell type. for sars-cov- , there is efficient virus infection in a -ace , calu , caco , and vero e cells, but not in a , h , or nhbe cells (fig. and table s ). the mapping rates in a , h , and nhbe cells are low even at high mois (fig. and table s ). similar to sars-cov- , the infection by sars-cov is also cell type dependent, vero e cells and calu cells show high mapping rates to sars-cov genome, but the mapping rates of sars-cov in mrc and h cells are close to zero even at the high moi of (fig. and table s ). since “total rna” (see methods/data collection) includes additional negative-strand templates of virus, the mapping rates are usually much higher than those that used the polya+ selection method in the same condition (fig. and table s ). evidence for multiple entry mechanisms for sars-cov- and sars-cov to examine the detailed replication and transcription status of these viruses in the cells, we calculated the number of reads (depth) mapped to each site of the corresponding virus genome (fig. ). for better comparison, these read numbers were log transformed. the replication and transcription of mers-cov, sars-cov- and sars-cov share an uneven pattern of expression along the genome, typically with a minimum depth in the first half of the viral genome, and the maximum towards the end. among the parts with very high levels, there are especially coding regions for structural proteins, including s, e, m, and n proteins, as well as the first coding regions with nsp and nsp . interestingly, there is an exception for balf samples in covid- patients, which show a more irregular, fluctuating behavior along the genome (fig. b). the deviation from the cellular expression pattern is not surprising because balf is not a well-organized tissue but a mixture of many components, some of which will probably digest viral rna. interestingly, the mentioned uneven transcription pattern of efficient infections with sars- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cov- , sars-cov, and mers-cov, is also visible for inefficient infection with sars-cov- in a , nhbe, and h cells, and sars-cov in h and mrc cells (fig. c, d), although there the total mapping rates to their corresponding virus genomes are much lower (fig. ). to further elucidate the corresponding entry mechanisms for different types of cells, we examined the expression levels of those receptors and proteases that have already been described as facilitating target cell infection (fig. ). our analysis shows that mers-cov can efficiently infect mrc and vero e cells (fig. and fig. e) that both express dpp (fig. a), though compared to vero e cells, mrc cells infected with mers-cov have higher expression levels of dpp (fig. a), but lower mapping rates to the virus genome (fig. ). these observations show that higher expression levels of the receptor (dpp ) do not guarantee higher mers-cov infection efficiency in cells. this is also true for sars-cov- receptor ace , which is expressed three orders of magnitudes higher in a -ace cells than in vero e cells (fig. b), while both cells produce about the same amount of virus (fig. ). although sars-cov- can efficiently infect a -ace cells (fig. and fig. ), there is no expression of tmprss or tmprss (fig. c, d), needed for the canonical cell-surface membrane fusion mechanism (fig. j). however, there are considerable expression levels of ctsb and ctsl (fig. e, f), which are involved in endocytosis (fig. j). in a , h , and mrc cells, which do express small amounts of sars-cov- and sars-cov virus (fig. , fig. c, d), there is no ace expression at all (fig. b). this could point to an alternative ace -independent entry mechanism for sars-cov- and sars-cov (fig. j). since there were already reports about alternative sars-cov- receptors such as bsg/cd and cd ( , ), we examined their expressions in these cells as well (fig. g, h). for all cells, the expression of bsg is at the same level of - (fig. g), and the expression (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . of cd is very low. certainly, cd and bsg alone cannot explain the differences in virus expression (fig. ), nor can we exclude other low efficiency entry mechanisms. it could e.g. be that relatively inefficient alternative entry paths are often present but in some cells masked by more efficient entry via ace /tpmrss. to gain a comprehensive overview we clustered cells with respect to gene expression levels of coronavirus-associated receptors and factors (fig. i), and summarized conceivable mecha- nisms accordingly (fig. j). since all cells show high expression levels of ctsb and ctsl, the major differences between these cells lie in the expression levels of ace , tmprss and tpmrss . cell-surface membrane fusion (fig. j, a) might be mainly used in sars-cov- infec- tion of calu , caco , and nhbe cells where there are low to moderate expression of ace and moderate expression of tmprss and tmprss . endocytosis (fig. j, b) might be mainly used in sars-cov- infection of a -ace cells where ace is expressed at high levels but there is no expression of tmprss or tmprss . an alternative ace -independent way (fig. j, c) in absence of ace , tmprss , or tmprss could be mainly employed in sars-cov- infection of mrc , a , and h cells. note that although the expres- sion pattern of coronavirus-associated receptors and factors of nhbe cells is similar to that in caco cells, nhbe cells are not infected efficiently by sars-cov- . vero e cells have mod- erate expression of ace , and low expression of tmprss and tmprss , so all these entry mechanisms mentioned above could contribute to sars-cov- infection of vero e cells. strength of ifn/isg response varies between cell lines and viruses, with strong response to sars-cov- in relevant cells as a virus enters a cell, it may trigger an innate immune response, i.e. the cell may start expres- sion of various types of innate immunity molecules at different strengths. there is currently (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . an intense debate about which of these molecules, especially ifns and isgs, are expressed how strongly ( – ). we therefore focused in our analysis on innate immunity molecules such as ifns, isgs, and pro-inflammatory cytokines. to broaden the basis for conclusions, we analyzed, apart from cell lines, bulk rna-seq data of lung, pbmc, and balf samples of covid- patients, and single-cell rna-seq data of balf samples from moderate and severe covid- patients; for each type of patient data, we also included healthy controls. gene ex- pressions were compared quantitatively in terms of tpm (transcripts per million), as well as log fold changes (logfc) with respect to healthy controls (human samples) or mock-infected cultures (cell lines) (fig. s , fig. s ). the heatmap and clustering dendrogram of the logfc of ifns, isgs and pro-inflammatory cytokines in fig. a reveal broadly two groups of samples with fundamentally different expres- sion of isgs, ifns, and pro-inflammatory cytokines. the top cluster in fig. a are samples that show weaker innate immune response, includ- ing the two pbmc samples of covid- patients, a , nhbe, caco , and h cells infected with sars-cov- and a -ace cells infected with sars-cov- at lower moi ( . ), mrc cells infected with sars, mrc and vero e cells infected with mers. the bottom cluster in fig. a are samples that show stronger innate immune response, including balf and lung samples of covid- patients, calu cells infected with sars-cov- , a - ace cells infected with sars-cov- at higher moi ( ), as well as vero e cells infected with sars-cov- and sars. most of the samples in the bottom part show markedly elevated levels of isgs and elevated pro-inflammatory cytokines. an exception in the bottom cluster are four samples, namely lung. / and balf. / , with a mixture of up- and down-regulation of isgs and pro-inflammatory cytokines. in this respect, these four samples from patients with un- known covid- severity differ from the balf samples from moderate and severe covid- patients. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the expression levels of ifns are not upregulated either in most of these lung, pbmc and balf samples of covid- patients where no information about the severity of infection of these covid- patients are available. however, we estimated the severity of their infection by aligning all the samples to sars-cov- virus genome. there are no ( . %) reads mapping to the sars-cov- genome in the pbmc samples. for the two balf samples, there are low mapping rates ( . % and . %) to sars-cov- genome. the expression levels of ace in these tissues (pbmc, lung and balf samples) of healthy individuals are around zero (fig. s ), which explains why there are almost no virus reads in these tissues. one of the two lung samples (accession number: samn ) has slightly upregulated ifnl (fig. s ), which had been ignored in the original publication ( ), although the total mapping rates to virus genome are both . % for these two lung samples. we then checked the detailed coverage along the virus genome. there were a small number of virus reads aligned to sars-cov- genome in this sample (fig. s ). different from other lung samples that did not express ace , this lung sample expressed ace at a considerable level ( . tpm, table s ). this result implies that when sars-cov- enters into lung successfully, or when the lung tissue chosen for sequencing are successfully infected by sars-cov- , ifns (at least ifnl ) can be upregulated. calu cells infected with sars-cov and sars-cov- , and a -ace cells infected with sars-cov- at a high moi of have upregulated ifnb , ifnl , ifnl and ifnl (fig. b-e). a , h , nhbe (fig. b-e), and mrc cells (fig. s ), which do not support efficient virus infection, show no upregulation of ifns. low levels of ifn expression are also observed in caco cells, which are efficiently infected with sars-cov and sars-cov- . the same is true for a -ace cells infected with sars-cov- at low moi of . . in vero e cells ifnl is upregulated as well in infected with sars-cov and sars-cov- , but not with mers-cov (fig. f). in balf samples of moderate and severe covid- patients, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . upregulation of ifns was not as obvious as in calu cells, but is still present in some patients. these observations demonstrate that the innate immune response depends in complex ways on cell line, viral dose, and virus. several studies ( – ) reported robust ifn responses and markedly elevated expression of isgs in sars-cov- infection of different cells and patient samples. conversely, the study by ( ) concluded that weak ifn response and moderate isg expression are characteristic for sars-cov- infection. this apparent contradiction can be resolved if we consider that ref. ( ) generalized from patient samples and cells that were only weakly infected, and that in such cases the host, in fact, responds with low levels of ifns and isgs. on the other hand, ref. ( ) treated efficiently infected cells, such as calu and a -ace (at moi f ) as exceptions. however, our meta-analysis shows that these are not exceptions but typical for severely infected target cells that have robust ifn responses and isg expressions (cluster in fig. a). discussion one attractive potential anti-sars-cov- therapy is intervention in the cell entry mechanisms ( ). however, the entry mechanisms of sars-cov- into human cells are partly unknown. during the last few months scientists have confirmed that sars-cov- and sars-cov both use human ace as entry receptor, and human proteases like tmprss and tmprss ( , , ), and lysosomal proteases like ctsb and ctsl ( ) as entry activators. since ace is beneficial in cardiovascular diseases such as hypertension or heart failure ( ), treatments tar- geting ace could have a negative effect. inhibitors of ctsl ( ) or tmprss ( ) are seen as potential treatment options for sars-cov and sars-cov- . however, recently alternate coronavirus-associated receptors and factors including bsg/cd ( ) and cd ( ) have been proposed to facilitate virus invasion. additionally, clinical data of sars-cov- infection (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . have shown that sars-cov- can infect several organs where ace expression could not be detected ( , ), urging us to explore other potential entry routes. first, our analyses here have shown that even without expression of tmprrs or tm- prss , high sars-cov- infection efficiency in cells is possible (fig. a, c) with consider- able expression levels of ctsb and ctsl (fig. e, f). this suggests receptor mediated endo- cytosis ( , , ) as an alternative major entry mechanism. given this tmprss-independent route, tmprss inhibitors will likely not provide complete protection. the studies designed to predict the tropism of sars-cov- by profiling the expression levels of ace and tmprss across healthy tissues ( , ) may need to be reconsidered as well. second, the evidence presented in our study suggests further, possibly undiscovered entry mechanism for sars-cov- and sars-cov (fig. ). although bsg/cd has been re- cently proposed as an alternate receptor ( ), later experiments reported there was no evidence supporting the role of bsg/cd as a putative spike-binding receptor ( ). the expression patterns of bsg/cd in different types of cells observed in our study could not explain the difference in virus loads observed in these cells either. cd and cd l were recently re- ported as attachment factors to contribute to sars-cov- infection in human cells as well ( ). however, cd expression in the cell lines included here is low. another reasonable hypoth- esis could be that the inefficient ace -independent entry mechanism we observed could be macropinocytosis, one endocytic pathway that does not require receptors ( ). until now there is still no direct evidence for macropinocytosis involvement in sars-cov- and sars-cov entry mechanism. to confirm such an involvment, specific experiments are needed. moreover, this ace -independent entry mechanism, only enables inefficient infection by sars-cov and sars-cov- (fig. ) and therefore cannot be a major entry mechanism. fig. j summarizes the outcomes of our study with respect to entry mechanisms. the ob- servations with the broad range of transcriptome data can only be explained if there are several (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . entry routes. this is certainly a challenge to be reckoned with in the development of antiviral therapeutics ( ). another attractive potential anti-sars-cov- point of attack is supporting the human innate immune system to kill the infected cells and, thus disrupt viral replication. not surprisingly, research in this area is flourishing but sometimes generates conflicting results, especially on the involvement of type i and iii ifns and isgs ( – ). the results of our analyses could help to dissolve the confusion on the involvement of ifns and isgs. we found that immune responses in calu cells infected with sars-cov and sars-cov- resemble those of balf samples of moderate and severe covid- patients, with elevated lev- els of type i and iii ifns, robust isg induction as well as markedly elevated pro-inflammatory cytokines, in agreement with recent studies ( – ). however this picture differs from the one reported by ( ) with low levels of ifns and moderate isgs. this latter study was partially based on a cells and nhbe cells with nearly no ace expression and very low map- ping rate to the viral genome, and lung samples of two patients (both show . % mapping rate to virus genome). hence, given that there was no efficient virus infection in theses cells, the low levels of ifns and isgs were to be expected. however, in one of the lung samples sequenced by ( ) (accession number: samn ), we observed a slight upregulation of ifnl (fig. s ), which was ignored in the original publication, together with considerable ace expression (table s ) ( . tpm), and a few virus reads aligned to sars-cov- genome (fig. s ). this results suggests that levels of ifns are isgs are associated with viral load and severity of virus infection. we found low induction of ifns and moderate expression of isgs in pbmc samples and balf samples of covid- patients (fig. , fig. s ). in these pbmc samples, there are no ( . %) virus reads mapping to the sars-cov- genome. the failure to detect virus reads in these three pbmc samples can be explained by the absence of efficient entry routes (e.g. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . no expression of ace in pbmc samples of healthy individuals, fig. s ), or with the cell types being otherwise incompatible with viral replication. this observation is consistent with the studies on sars-cov ( – ) with abortive infections of macrophages, monocytes, and dendritic cells; moreover, replication of sars-cov in pbmc samples is also self-limiting. however, due to the limited number of pbmc, balf and lung samples included in this study, and the lack of the information of infection stage and infection severity of these covid- patients, the assessment of ifns and isgs as well as the infection of sars-cov- in these samples may not be representative of host response against sasr-cov- . future studies that include also other affected organs of more patients with different infection stages and severity are necessary for a better understanding of the immune responses. several unexpected observations need further investigations. first, a -ace and caco cells are efficiently infected with low moi of . and . , respectively, (fig. ), but fail to upregulate inf expression (fig. b-e). their cellular immune responses are more similar to those of cells that cannot support efficient virus infection (fig. a). these results suggest that in caco and a -ace cells the invasion of sars-cov- or sars-cov at low moi shuts down or fails to activate the innate immune system. based on the results observed above, multiple factors including disease severity, different organs, cell types and virus dose contribute to the variability in the innate immune responses. for a better characterization of the innate immune responses, a more comprehensive profiling is necessary, including of patients with infections in different stages, different levels of severity, and different clinical outcomes of the infection. further, a larger array of cell types should be profiled over time after infection with different virus doses. in this way we would be better able to understand the kinetics of ifns and isgs in response to sars-cov- infection. in summary, our study has comparatively analyzed an extensive data collection from differ- ent cell types infected with sars-cov- , sars-cov and mers-cov, and from covid- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patients. we have presented evidence for multiple sars-cov- entry mechanisms. we could also dissolve apparent conflicts on innate immune responses in sars-cov- infection ( – ), by drawing upon a larger set of cell types and infection severity. the results emphasize the com- plexity of interactions between host and sars-cov- , offer new insights into pathogenesis of sars-cov- , and can inform development of antiviral drugs. materials and methods data collection after the successful release of the virus genome into the cytoplasm, a negative-strand genomic- length rna is synthesized as the template for replication. negative-strand subgenome-length mrnas are formed as well from the virus genome as discontinuous rnas, and used as the templates for transcription. in the public data we collected for the analysis, there are two main library preparation methods to remove the highly abundant ribosomal rnas (rrna) from to- tal rna before sequencing. one is polya+ selection, the other is rrna-depletion ( ). it is known that coronavirus genomic and subgenomic mrnas carry a polya tail at their ’ ends, so in the polya+ rna-seq, we have ( ) virus genomic sequence from virus replication, i.e. repli- cated genomic rnas from negative-strand as template, and ( ) subgenomic mrnas from virus transcription; in the rrna-depletion rna-seq we have ( ) virus genomic sequence from virus replication: both replicated genomic rnas from negative-strand as template and the negative- strand templates themselves, and ( ) subgenomic mrnas from virus transcription. polya+ selection was used if not specifically stated in this study, “total rna” is used to specify that the rrna-depletion method was used to prepare the sequencing libraries. the raw fastq data of different cell types infected with sars-cov- , sars-cov and mers-cov, and lung samples of covid- patients and healthy controls were retrieved from ncbi ( ) (https://www.ncbi.nlm.nih.gov/) and ena ( ) (https://www.ebi.ac.uk/ena) (acces- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sion numbers gse ( ), gse , gse ( ) and gse ( )). the raw fastq data of pbmc and balf samples of covid- patients and corresponding con- trols were downloaded from big data center ( ) (https://bigd.big.ac.cn/) (accession number cra ) ( ), and the raw fastq data for balf healthy control samples were down- loaded from ncbi (accession numbers srr , srr , and srr un- der project prjna ( )). the preprocessed single cell rna-seq data of balf sam- ples from severe covid- patients and moderate covid- patients were downloaded from ncbi with accession number gse ( ). the preprocessed single cell rna-seq data of balf sample from a healthy control was retrieved from ncbi (accession number gsm under project prjna ( )). detailed information about these public datasets are available in the supplementary file: supplementary.pdf for analysis, the human grch release transcriptome and the green monkey (chloro- cebus sabaeus) chlsab . release transcriptome and their corresponding annotation gtf files were downloaded from ensembl ( ) (https://www.ensembl.org). the reference virus genomes were downloaded from ncbi: sars-cov- (genbank: mn . ), sars-cov (genbank: ay . ), mers-cov (genbank: jx . ). data analysis workflow the workflow of this study is summarized in fig. s and fig. s in the supplementary file: supplementary.pdf. the quality of the raw fastq data was examined with fastqc ( ). trimmomatic- . ( ) was used to remove adapters and filter out low quality reads with param- eters “-threads -phred illuminaclip:adapters.fasta: : : headcrop: lead- ing: trailing: slidingwindow: : minlen: ”. the clean rna sequencing reads were then pseudo-aligned to reference transcriptome and quantified using kallisto (ver- sion . . ) ( ) with parameters “-b –single -l -s ” for single-end sequencing data (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and with parameter “-b ” for paired-end sequencing data. expression levels were calculated and summarized as transcripts per million (tpm) on gene levels with sleuth ( ), and logfc was then calculated for each condition. the single cell rna-seq data were summarized across all cells to obtain “pseudo-bulk” samples. r packages edaseq ( ) and org.hs.eg.db ( ) were used to obtain gene length, and tpm was calculated with the “calculatetpm” function of r package scater ( ). logfc was then calculated for each patient. the clean rna-seq data were also aligned to the virus genome with bowtie ( ) (version . . ) and the aligned bam files were created, and the mapping rates to the virus genomes were obtained as well. samtools ( ) (version . ) was then used for sorting and indexing the aligned bam files. the “samtools depth” command was used to produce the number of aligned reads per site along the virus genome. the heatmap in fig. i was made by pheatmap r package ( ), “complete” clustering method was used for clustering the rows and “euclidean” distance was used to measure the cluster distance. the heatmap in fig. a was made by complexheatmap r package ( ). “complete” clustering method was used for clustering the rows and columns and “euclidean” distance was used to measure the cluster distance. references . a. r. fehr, s. perlman, coronaviruses (springer, ), pp. – . . t. kuiken, r. a. fouchier, m. schutten, g. f. rimmelzwaan, g. van amerongen, d. van riel, j. d. laman, t. de jong, g. van doornum, w. lim, a. e. ling, p. k. chan, j. s. tam, m. c. zambon, r. gopal, c. drosten, s. van der werf, n. escriou, j. c. manuguerra, k. stöhr, j. s. peiris, a. d. osterhaus, newly discovered coronavirus as the primary cause of severe acute respiratory syndrome. the lancet , – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . who, summary of probable sars cases with onset of illness from november to july . . a. m. zaki, s. van boheemen, t. m. bestebroer, a. d. osterhaus, r. a. fouchier, isolation of a novel coronavirus from a man with pneumonia in saudi arabia. new england journal of medicine , – ( ). . who, middle east respiratory syndrome coronavirus (mers-cov) âăş saudi arabia. . f. wu, s. zhao, b. yu, y. m. chen, w. wang, z. g. song, y. hu, z. w. tao, j. h. tian, y. y. pei, m. l. yuan, y. l. zhang, f. h. dai, y. liu, q. m. wang, j. j. zheng, l. xu, e. c. holmes, y. z. zhang, a new coronavirus associated with human respiratory disease in china. nature , – ( ). . who, who coronavirus disease (covid- ) dashboard. . x. xu, p. chen, j. wang, j. feng, h. zhou, x. li, w. zhong, p. hao, evolution of the novel coronavirus from the ongoing wuhan outbreak and modeling of its spike protein for risk of human transmission. science china life sciences , – ( ). . s. belouzard, j. k. millet, b. n. licitra, g. r. whittaker, mechanisms of coronavirus cell entry mediated by the viral spike protein. viruses , – ( ). . z. lou, y. sun, z. rao, current progress in antiviral strategies. trends in pharmacological sciences , – ( ). . e. teissier, f. penin, e.-i. pécheur, targeting cell entry of enveloped viruses as an antiviral strategy. molecules , – ( ). . i. s. mahmoud, y. b. jarrar, w. alshaer, s. ismail, sars-cov- entry in host cells-multiple targets for treatment and prevention. biochimie ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . z. qinfen, c. jinming, h. xiaojun, z. huanying, h. jicheng, f. ling, l. kunpeng, z. jingqiang, the life cycle of sars coronavirus in vero e cells. journal of medical vi- rology , – ( ). . m. hoffmann, h. kleine-weber, s. schroeder, n. krüger, t. herrler, s. erichsen, t. s. schiergens, g. herrler, n. h. wu, a. nitsche, m. a. müller, c. drosten, s. pöhlmann, sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor. cell ( ). . z.-y. yang, y. huang, l. ganesh, k. leung, w.-p. kong, o. schwartz, k. subbarao, g. j. nabel, ph-dependent entry of severe acute respiratory syndrome coronavirus is mediated by the spike glycoprotein and enhanced by dendritic cell transfer through dc-sign. journal of virology , – ( ). . h. wang, p. yang, k. liu, f. guo, y. zhang, g. zhang, c. jiang, sars coronavirus entry into host cells through a novel clathrin-and caveolae-independent endocytic pathway. cell research , – ( ). . w. widagdo, s. sooksawasdi na ayudhya, g. b. hundie, b. l. haagmans, host determi- nants of mers-cov transmission and pathogenesis. viruses , ( ). . f. li, structure, function, and evolution of coronavirus spike proteins. annual review of virology , – ( ). . m. singh, v. bansal, c. feschotte, a single-cell rna expression map of human coronavirus entry factors. biorxiv ( ). . k. wang, w. chen, y.-s. zhou, j.-q. lian, z. zhang, p. du, l. gong, y. zhang, h.-y. cui, j.-j. geng, b. wang, x.-x. sun, c.-f. wang, x. yang, p. lin, y.-q. deng, d. wei, x.-m. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . yang, y.-m. zhu, k. zhang, z.-h. zheng, j.-l. miao, t. guo, y. shi, j. zhang, l. fu, q.-y. wang, h. bian, p. zhu, z.-n. chen, sars-cov- invades host cells via a novel route: cd -spike protein. biorxiv ( ). . r. amraie, m. a. napoleon, w. yin, j. berrigan, e. suder, g. zhao, j. olejnik, s. gum- muluru, e. muhlberger, v. chitalia, n. rahimi, cd l/l-sign and cd /dc-sign act as receptors for sars-cov- and are differentially expressed in lung and kidney epithelial and endothelial cells. biorxiv ( ). . f. hikmet, l. méar, Å. edvinsson, p. micke, m. uhlén, c. lindskog, the protein expression profile of ace in human tissues. molecular systems biology , e ( ). . l. zou, f. ruan, m. huang, l. liang, h. huang, z. hong, j. yu, m. kang, y. song, j. xia, q. guo, t. song, j. he, h. l. yen, m. peiris, j. wu, sars-cov- viral load in upper respiratory specimens of infected patients. new england journal of medicine , – ( ). . g. simmons, j. d. reeves, a. j. rennekamp, s. m. amberg, a. j. piefer, p. bates, char- acterization of severe acute respiratory syndrome-associated coronavirus (sars-cov) spike glycoprotein-mediated viral entry. proceedings of the national academy of sciences , – ( ). . r. zang, m. f. g. castro, b. t. mccune, q. zeng, p. w. rothlauf, n. m. sonnek, z. liu, k. f. brulois, x. wang, h. b. greenberg, m. s. diamond, m. a. ciorba, s. p. whelan, s. ding, tmprss and tmprss promote sars-cov- infection of human small intestinal en- terocytes. science immunology ( ). . p. zmora, m. hoffmann, h. kollmus, a.-s. moldenhauer, o. danov, a. braun, m. winkler, k. schughart, s. pöhlmann, tmprss a activates the influenza a virus hemagglutinin and (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the mers coronavirus spike protein and is insensitive against blockade by hai- . journal of biological chemistry , – ( ). . x. ou, y. liu, x. lei, p. li, d. mi, l. ren, l. guo, r. guo, t. chen, j. hu, z. xiang, z. mu, x. chen, j. chen, k. hu, q. jin, j. wang, z. qian, characterization of spike glyco- protein of sars-cov- on virus entry and its immune cross-reactivity with sars-cov. nature communications , – ( ). . y.-m. loo, m. gale jr, immune signaling by rig-i-like receptors. immunity , – ( ). . a. g. bowie, i. r. haga, the role of toll-like receptors in the host response to viruses. molecular immunology , – ( ). . c. chiang, m. u. gack, post-translational control of intracellular pathogen sensing path- ways. trends in immunology , – ( ). . a. park, a. iwasaki, type i and type iii interferons–induction, signaling, evasion, and ap- plication to combat covid- . cell host & microbe ( ). . q. ruan, k. yang, w. wang, l. jiang, j. song, clinical predictors of mortality due to covid- based on an analysis of data of patients from wuhan, china. intensive care medicine , – ( ). . i. f. n. hung, k. c. lung, e. y. k. tso, r. liu, t. w. h. chung, m. y. chu, y. y. ng, j. lo, j. chan, a. r. tam, h. p. shum, v. chan, a. k. l. wu, k. m. sin, w. s. leung, w. l. law, d. c. lung, s. sin, p. yeung, c. c. y. yip, r. r. zhang, a. y. f. fung, e. y. w. yan, k. h. leung, j. d. ip, a. w. h. chu, w. m. chan, a. c. k. ng, r. lee, k. fung, a. yeung, t. c. wu, j. w. m. chan, w. w. yan, w. m. chan, j. f. w. chan, a. k. w. lie, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . o. t. y. tsang, v. c. c. cheng, t. l. que, c. s. lau, k. h. chan, k. k. w. to, k. y. yuen, triple combination of interferon beta- b, lopinavir–ritonavir, and ribavirin in the treatment of patients admitted to hospital with covid- : an open-label, randomised, phase trial. the lancet , – ( ). . e. andreakos, s. tsiodras, covid- : lambda interferon against viral load and hyperin- flammation. embo molecular medicine p. e ( ). . d. blanco-melo, b. e. nilsson-payant, w. c. liu, s. uhl, d. hoagland, r. møller, t. x. jordan, k. oishi, m. panis, d. sachs, t. t. wang, r. e. schwartz, j. k. lim, r. a. albrecht, b. r. tenoever, imbalanced host response to sars-cov- drives development of covid- . cell ( ). . z. zhou, l. ren, l. zhang, j. zhong, y. xiao, z. jia, l. guo, j. yang, c. wang, s. jiang, d. yang, g. zhang, h. li, f. chen, y. xu, m. chen, z. gao, j. yang, j. dong, b. liu, x. zhang, w. wang, k. he, q. jin, m. li, j. wang, heightened innate immune responses in the respiratory tract of covid- patients. cell host & microbe ( ). . a. broggi, s. ghosh, b. sposito, r. spreafico, f. balzarini, a. lo cascio, n. clementi, m. de santis, n. mancini, f. granucci, i. zanoni, type iii interferons disrupt the lung epithelial barrier upon viral recognition. science ( ). . l. wei, s. ming, b. zou, y. wu, z. hong, z. li, x. zheng, m. huang, l. luo, j. liang, x. wen, t. chen, q. liang, l. kuang, h. shan, x. huang, viral invasion and type i inter- feron response characterize the immunophenotypes during covid- infection. available at ssrn ( ). . j. y. zhang, x. m. wang, x. xing, z. xu, c. zhang, j. w. song, x. fan, p. xia, j. l. fu, s. y. wang, r. n. xu, x. p. dai, l. shi, l. huang, t. j. jiang, m. shi, y. zhang, a. zumla, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . m. maeurer, f. bai, f. s. wang, single-cell landscape of immunological responses in pa- tients with covid- . nature immunology pp. – ( ). . e. sallard, f. x. lescure, y. yazdanpanah, f. mentre, n. peiffer-smadja, type interferons as a potential treatment against covid- . antiviral research p. ( ). . e. wyler, k. mösbauer, v. franke, a. diag, t. g. lina, r. arsie, f. klironomos, d. kopp- stein, s. ayoub, c. buccitelli, a. richter, i. legnini, a. ivanov, t. mari, s. d. giudice, p. p. jan, a. m. marcel, d. niemeyer, m. selbach, a. akalin, n. rajewsky, c. drosten, m. landthaler, bulk and single-cell gene expression profiling of sars-cov- infected human cell lines identifies molecular targets for therapeutic intervention. biorxiv ( ). . y. xiong, y. liu, l. cao, d. wang, m. guo, a. jiang, d. guo, w. hu, j. yang, z. tang, h. wu, y. lin, m. zhang, q. zhang, m. shi, y. liu, y. zhou, k. lan, y. chen, transcrip- tomic characteristics of bronchoalveolar lavage fluid and peripheral blood mononuclear cells in covid- patients. emerging microbes & infections , – ( ). . d. michalovich, n. rodriguez-perez, s. smolinska, m. pirozynski, d. mayhew, s. ud- din, s. van horn, m. sokolowska, c. altunbulakli, a. eljaszewicz, b. pugin, w. barcik, m. kurnik-lucka, k. a. saunders, k. d. simpson, p. schmid-grendelmeier, r. ferstl, r. frei, n. sievi, m. kohler, p. gajdanowicz, k. b. graversen, k. lindholm bøgh, m. ju- tel, j. r. brown, c. a. akdis, e. m. hessel, l. o’mahony, obesity and disease severity magnify disturbed microbiome-immune interactions in asthma patients. nature communi- cations , – ( ). . m. liao, y. liu, j. yuan, y. wen, g. xu, j. zhao, l. cheng, j. li, x. wang, f. wang, l. liu, i. amit, s. zhang, z. zhang, single-cell landscape of bronchoalveolar immune cells in patients with covid- . nature medicine pp. – ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . c. morse, t. tabib, j. sembrat, k. l. buschur, h. t. bittar, e. valenzi, y. jiang, d. j. kass, k. gibson, w. chen, a. mora, p. v. benos, m. rojas, r. lafyatis, proliferating spp /mertk- expressing macrophages in idiopathic pulmonary fibrosis. european respiratory journal ( ). . c. tikellis, m. thomas, angiotensin-converting enzyme (ace ) is a key modulator of the renin angiotensin system in health and disease. international journal of peptides ( ). . g. simmons, d. n. gosalia, a. j. rennekamp, j. d. reeves, s. l. diamond, p. bates, inhibitors of cathepsin l prevent severe acute respiratory syndrome coronavirus entry. pro- ceedings of the national academy of sciences , – ( ). . s. lukassen, r. l. chua, t. trefzer, n. c. kahn, m. a. schneider, t. muley, h. winter, m. meister, c. veith, a. w. boots, b. p. hennig, m. kreuter, c. conrad, r. eils, sars-cov- receptor ace and tmprss are primarily expressed in bronchial transient secretory cells. the embo journal , e ( ). . r. ueha, t. sato, t. goto, a. yamauchi, k. kondo, t. yamasoba, expression of ace and tmprss proteins in the upper and lower aerodigestive tracts of rats. biorxiv ( ). . j. shilts, g. j. wright, no evidence for basigin/cd as a direct sars-cov- spike binding receptor. biorxiv ( ). . j. mercer, a. helenius, virus entry by macropinocytosis. nature cell biology , – ( ). . d. l. mckee, a. sternberg, u. stange, s. laufer, c. naujokat, candidate drugs against sars-cov- and covid- . pharmacological research p. ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . h. k. law, c. y. cheung, h. y. ng, s. f. sia, y. o. chan, w. luk, j. m. nicholls, j. peiris, y. l. lau, chemokine up-regulation in sars-coronavirus–infected, monocyte-derived hu- man dendritic cells. blood , – ( ). . c. y. cheung, l. l. m. poon, i. h. y. ng, w. luk, s.-f. sia, m. h. s. wu, k.-h. chan, k.-y. yuen, s. gordon, y. guan, j. s. m. peiris, cytokine responses in severe acute respiratory syndrome coronavirus-infected macrophages in vitro: possible relevance to pathogenesis. journal of virology , – ( ). . l. li, j. wo, j. shao, h. zhu, n. wu, m. li, h. yao, m. hu, r. h. dennin, sars-coronavirus replicates in mononuclear cells of peripheral blood (pbmcs) from sars patients. journal of clinical virology , – ( ). . w. zhao, x. he, k. a. hoadley, j. s. parker, d. n. hayes, c. m. perou, comparison of rna-seq by poly (a) capture, ribosomal rna depletion, and dna microarray for expression profiling. bmc genomics , – ( ). . e. w. sayers, r. agarwala, e. e. bolton, j. r. brister, k. canese, k. clark, r. connor, n. fiorini, k. funk, t. hefferon, j. b. holmes, s. kim, a. kimchi, p. a. kitts, s. lathrop, z. lu, t. l. madden, a. marchler-bauer, l. phan, v. a. schneider, c. l. schoch, k. d. pruitt, j. ostell, database resources of the national center for biotechnology information. nucleic acids research , d –d ( ). . r. leinonen, r. akhtar, e. birney, l. bower, a. cerdeno-tárraga, y. cheng, i. cleland, n. faruque, n. goodgame, r. gibson, g. hoad, m. jang, n. pakseresht, s. plaister, r. rad- hakrishnan, k. reddy, s. sobhany, p. t. hoopen, r. vaughan, v. zalunin, g. cochrane, the european nucleotide archive. nucleic acids research , d –d ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . l. riva, s. yuan, x. yin, l. martin-sancho, n. matsunaga, l. pache, s. burgstaller- muehlbacher, p. d. de jesus, p. teriete, m. v. hull, m. w. chang, j. f. w. chan, j. cao, v. k. m. poon, k. m. herbert, k. cheng, t. t. h. nguyen, a. rubanov, y. pu, c. nguyen, a. choi, r. rathnasinghe, m. schotsaert, l. miorin, m. dejosez, t. p. zwaka, k. y. sit, l. martinez-sobrido, w. c. liu, k. m. white, m. e. chapman, e. k. lendy, r. j. glynne, r. albrecht, e. ruppin, a. d. mesecar, j. r. johnson, c. benner, r. sun, p. g. schultz, a. i. su, a. garcía-sastre, a. k. chatterjee, k. y. yuen, s. k. chanda, discovery of sars- cov- antiviral drugs through large-scale compound repurposing. nature , – ( ). . z. zhang, et al., database resources of the national genomics data center in . nucleic acids research , d ( ). . a. d. yates, et al., ensembl . nucleic acids research , d –d ( ). . s. andrews, fastqc: a quality control tool for high throughput sequence data ( ). . a. m. bolger, m. lohse, b. usadel, trimmomatic: a flexible trimmer for illumina sequence data. bioinformatics , – ( ). . n. l. bray, h. pimentel, p. melsted, l. pachter, near-optimal probabilistic rna-seq quan- tification. nature biotechnology , – ( ). . h. pimentel, n. l. bray, s. puente, p. melsted, l. pachter, differential analysis of rna-seq incorporating quantification uncertainty. nature methods , ( ). . d. risso, k. schwartz, g. sherlock, s. dudoit, gc-content normalization for rna-seq data. bmc bioinformatics , ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . m. carlson, s. falcon, h. pages, n. li, org. hs. eg. db: genome wide annotation for human. r package version ( ). . d. j. mccarthy, k. r. campbell, a. t. lun, q. f. wills, scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. bioinformatics , – ( ). . b. langmead, s. l. salzberg, fast gapped-read alignment with bowtie . nature methods , ( ). . h. li, b. handsaker, a. wysoker, t. fennell, j. ruan, n. homer, g. marth, g. abecasis, r. durbin, the sequence alignment/map format and samtools. bioinformatics , – ( ). . r. kolde, pheatmap: pretty heatmaps ( ). r package version . . . . z. gu, r. eils, m. schlesner, complex heatmaps reveal patterns and correlations in multi- dimensional genomic data. bioinformatics , – ( ). acknowledgements: the authors thank professor ke xu from wuhan university and professor dimitri lavillette from institut pasteur of shanghai for helpful conversations. funding: this work was partially funded by grant kl b (secovit) of the german federal ministry of education and research. author contributions: pei hao and yingying cao conceived the research. daniel hoffmann, pei hao, and yingying cao designed the analyses. yingying cao, xintian xu conducted the analyses. all authors wrote the manuscript. competing interests: the authors declare that they have no competing financial interests. data and materials availability: additional data and materials are available online. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figures and tables: table . data of cell lines (cells) included in this study virus virus strain virus dose (moi) time replicates species of origin cell type library preparation accession number sars-cov- usa-wa / h homo sapiens nhbe polya+ selection gse mock mock mock h homo sapiens nhbe polya+ selection gse sars-cov- usa-wa / . h homo sapiens a polya+ selection gse mock mock mock h homo sapiens a polya+ selection gse sars-cov- usa-wa / h homo sapiens a polya+ selection gse mock mock mock h homo sapiens a polya+ selection gse sars-cov- usa-wa / . h homo sapiens a -ace polya+ selection gse mock mock mock h homo sapiens a -ace polya+ selection gse sars-cov- usa-wa / h homo sapiens a -ace polya+ selection gse mock mock mock h homo sapiens a -ace polya+ selection gse sars-cov- usa-wa / h homo sapiens calu polya+ selection gse mock mock mock h homo sapiens calu polya+ selection gse sars-cov- munich/bavpat / . h homo sapiens calu rrna-depletion gse mock mock mock h homo sapiens calu rrna-depletion gse sars-cov- munich/bavpat / . h homo sapiens calu polya+ selection gse mock mock mock h homo sapiens calu polya+ selection gse sars-cov- munich/bavpat / . h homo sapiens caco polya+ selection gse mock mock mock h homo sapiens caco polya+ selection gse sars-cov- munich/bavpat / . h homo sapiens h polya+ selection gse mock mock mock h^ homo sapiens h polya+ selection gse sars-cov- usa-wa / . h * chlorocebus sabaeus vero e rrna-depletion gse mock mock mock h chlorocebus sabaeus vero e rrna-depletion gse sars-cov frankfurt strain . h homo sapiens calu polya+ selection gse sars-cov frankfurt strain . h homo sapiens calu rrna-depletion gse sars-cov frankfurt strain . h homo sapiens caco polya+ selection gse sars-cov frankfurt strain . h homo sapiens h polya+ selection gse sars-cov urbani strain . h homo sapiens mrc polya+ selection gse sars-cov urbani strain h homo sapiens mrc polya+ selection gse sars-cov urbani strain . h chlorocebus sabaeus vero e polya+ selection gse sars-cov urbani strain h chlorocebus sabaeus vero e polya+ selection gse mers-cov emc/ . h homo sapiens mrc polya+ selection gse mers-cov emc/ h homo sapiens mrc polya+ selection gse mers-cov emc/ . h chlorocebus sabaeus vero e polya+ selection gse mers-cov emc/ h chlorocebus sabaeus vero e polya+ selection gse mock mock mock h homo sapiens mrc polya+ selection gse mock mock mock h homo sapiens vero e polya+ selection gse ^no corresponding h mock control samples for h cells, h mock control samples were used instead. * there are three replicates, but when the manuscript was in preparation only two of them are available for downloading. table . data of covid- patients included in this study individuals tissue data type accession number bronchoalveolar lavage fluid from covid- patients bulk rna-seq cra bronchoalveolar lavage fluid from healthy negative control bulk rna-seq prjna ^ peripheral blood mononuclear cells from covid- patients bulk rna-seq cra peripheral blood mononuclear cells from healthy negative control bulk rna-seq cra lung biopsy from postmortem covid- patients bulk rna-seq gse lung biopsy from healthy negative control bulk rna-seq gse bronchoalveolar lavage fluid from covid- patients (severe) single cell rna-seq gse bronchoalveolar lavage fluid from covid- patients (moderate) single cell rna-seq gse bronchoalveolar lavage fluid from healthy negative control single cell rna-seq prjna * ^three samples under project prjna : srr , srr , and srr were used. * one sample with accession number gsm under project prjna was used. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ●● ● ●●● ●●● ● ●● ●● ●● ●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ●●● ●● ●●●●● ●●● ●●● ●●● m r c − . m o i m r c − m o i h − . m o i ve ro e − . m o i ve ro e − m o i c al u − . m o i c al u − . m o i− to ta lr n a c ac o − . m o i m r c − . m o i m r c − m o i ve ro e − . m o i ve ro e − m o i a − . m o i − a − m o i h − . m o i n h b e − m o i a − ac e − . m o i a − ac e − m o i c ac o − . m o i c al u − . m o i c al u − . m o i− to ta lr n a c al u − m o i ve ro e − . m o i.t ot al r n a m ap pi ng ra te to v iru s ge no m e (% ) ● ● ● mers−cov sars−cov sars−cov− fig. . mapping rate to virus genome. the dots represent the mapping rates to the virus genome for each individual replicate under the given conditions (cell line, moi, and virus). bar heights are mean mapping rates to the virus genome for each condition. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the number of reads mapped to the corresponding virus genome. (a-e) the dot plots show the number of reads mapped to each site of the corresponding virus genome. the annotation of the genome of each virus is from ncbi (sars: gcf_ . , sars-cov- : gcf_ . , mers: gcf_ . ). labels in grey title bars correspond to conditions as in fig. . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ●● ●●●● ● ● ● ●● ● ● ● ● ●●●● ●● ●● ● ●●● ●●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) dpp a ●●●●●● ●●●● ●● ●● ●● ●●● ●● ●●● ●●● ●●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) ace b ●●●●●● ●●●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) tmprss c ●●●●●● ●●●●●● ●● ●● ●● ● ●● ●●● ●●● ●●● . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) tmprss d ●●●●●● ●●●● ●● ●● ●● ●●● ●● ●●● ●● ● ●●● . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) ctsbe ●●● ●●● ●● ●●●● ●● ●● ●●● ●● ●●● ●●● ●●● . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) ctslf ●●●●●● ●●● ●●● ●● ●●●●● ●● ●●● ● ●● ●●● . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) bsgg ●●● ●●● ●● ●● ●● ●● ●●●● ● ●● ●●● ●●● ●●● . . . . . . a a . ac e c ac o c al u h m r c n h b e ve ro e lo g ( tp m + ) cd h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ac e b s g c d c ts b c ts l d p p tm p r s s a tm p r s s b tm p r s s d tm p r s s e tm p r s s f tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s tm p r s s a c b a .ace veroe mrc a h nhbe caco calu log (tpm+ )i j (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the expression levels of the receptors and proteases. (a-h) each dot represents the expression value in each sample. (i) heatmap of the expression levels of coronavirus as- sociated receptors and factors of different cell types. labels a, b, c mark cell clusters that likely share entry routes sketched in panel j. (j) entry mechanisms involved in sars-cov- entry into cells. schematic is based on a figure by vega asensio - own work, cc by-sa . , https://commons.wikimedia.org/w/index.php?curid= . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnb a −ace sars−cov− b ●●●●●● ●●● ●●● m oc k . m o i m o i ifnb a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnb calu sars−cov− ●● ●● m oc k . m o i ifnb caco sars−cov− ●● ●● m oc k . m o i ifnb h sars−cov− ●●● ●●● m oc k m o i ifnb sars−cov− nhbe ●● ●● m oc k . m o i ifnb sars−cov calu ●● ●● m oc k . m o i ifnb sars−cov caco ●● ●● m oc k . m o i ifnb sars−cov h ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnl a −ace sars−cov− c ●●●●●● ●●● ●●● m oc k . m o i m o i ifnl a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnl calu sars−cov− ●● ●● m oc k . m o i ifnl caco sars−cov− ●● ●● m oc k . m o i ifnl h sars−cov− ●● ● ●●● m oc k m o i ifnl nhbe sars−cov− ●● ●● m oc k . m o i ifnl sars−cov calu ●● ●● m oc k . m o i ifnl sars−cov caco ●● ●● m oc k . m o i ifnl sars−cov h ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnl a −ace sars−cov− d ●●●●●● ●●● ●●● m oc k . m o i m o i ifnl a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnl calu sars−cov− ●● ●● m oc k . m o i ifnl caco sars−cov− ● ● ●● m oc k . m o i ifnl h sars−cov− ●● ● ●●● m oc k m o i ifnl nhbe sars−cov− ●● ●● m oc k . m o i ifnl sars−cov calu ●● ● ● m oc k . m o i ifnl sars−cov caco ● ● ●● m oc k . m o i ifnl sars−cov h ●●●●●● ●●● ● ● ● m oc k . m o i m o i tp m ifnl a −ace sars−cov− e ●●●●●● ●●● ●●● m oc k . m o i m o i ifnl a sars−cov− ●●●●● ●● ●●● m oc k . m o i m o i ifnl calu sars−cov− ●● ●● m oc k . m o i ifnl caco sars−cov− ●● ●● m oc k . m o i ifnl h sars−cov− ●●● ●●● m oc k m o i ifnl nhbe sars−cov− ●● ●● m oc k . m o i ifnl sars−cov calu ●● ● ● m oc k . m o i ifnl sars−cov caco ●● ●● m oc k . m o i ifnl sars−cov h a f g ● ●● ● ●● ●● ● ● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e tp m ifnb ● ● ● ● ●●●● ● ● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e ifnl ● ●●● ●●● ●●● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e ifnl ● ●●● ●● ●● ● ● b a lf .h ea lth y b a lf .m od er at e b a lf .s ev er e ifnl ●●● ● ● m oc k . m o i tp m ifnl sars−cov veroe ●●● ● ● ● ●● ● m oc k . m o i m o i ifnl sars−cov veroe ●●● ●●● ●●● m oc k . m o i m o i ifnl mers−cov veroe sars−cov− _a .ace _ . moi sars−cov− _nhbe_ moi sars−cov− _a _ . moi mers−cov_veroe _ moi sars−cov− _a _ moi sars−cov_caco _ . moi sars−cov− _caco _ . moi sars−cov_mrc _ moi mers−cov_mrc _ . moi mers−cov_veroe _ . moi mers−cov_mrc _ moi sars−cov_mrc _ . moi sars−cov_h _ . moi sars−cov− _h _ . moi pbmc. pbmc. pbmc. balf.moderate. balf.moderate. balf.moderate. balf.severe. balf.severe. balf.severe. balf.severe. balf.severe. balf.severe. sars−cov− _calu _ . moi sars−cov− _calu _ . moi_totalrna sars−cov− _calu _ moi sars−cov_calu _ . moi_totalrna sars−cov_calu _ . moi sars−cov− _a .ace _ moi lung. lung. balf. balf. sars−cov_veroe _ . moi sars−cov_veroe _ moi sars−cov− _veroe _ . moi_totalrna d d x if ih d h x tl r tl r tl r tl r tl r tl r tl r tl r tl r ir f ir f ir f ir f ir f ir f ir f ir f ir f tb k n fk b n fk b if n a if n a if n a if n a if n b if n e if n g if n k if n l if n w if n a r if n g r if n g r if n lr ja k ja k ja k ty k s ta t s ta t s ta t s ta t s ta t a s ta t b s ta t is g is g is g l m x o a s o a s o a s o a s l if it if it b if it if it if it if it m if it m c c l c c l c c l c c l c c l c c l c c l c c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l c x c l logfc − − (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . expression levels of genes related to immune responses (a) heatmap of the logfc of ifns, isgs and pro-inflammatory cytokines. the clustering of samples produces a clus- ter (top) with little ifn/isg expression comprising mers infections and non-infectable cells/sars-cov- / (except for caco cells), and a cluster (bottom) strong ifn/isg ex- pression with sars-cov- / infectable cells and patient samples. (b-g) expression levels of ifns. each dot represents the expression value of a sample. bars indicate mean expression levels (in tpm) of respective ifn at different moi values. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary materials: additional information about public data all data can be downloaded from public repositories, the three main sources are ncbi ( ) (https://www.ncbi.nlm.nih.gov/) and ena ( ) (https://www.ebi.ac.uk/ena) and big data cen- ter ( ) (https://bigd.big.ac.cn/). gse dataset ( ) from this dataset we downloaded: biological triplicates of primary human lung epithelium (nhbe) which were mock treated or infected with sars-cov- (usa-wa / ) at an moi of ; biological triplicates of transformed lung alveolar (a ) cells which were mock treated or infected with sars-cov- (usa-wa / ) at an moi of . or ; biological triplicates of transformed lung alveolar (a ) transduced with a vector expressing human ace , which were also mock treated or infected with sars-cov- (usa-wa / ) at an moi of . or ; biological triplicates of transformed lung-derived calu- cells which were mock treated or infected with sars-cov- (usa-wa / ) at an moi of ; covid- patient samples: uninfected human lung biopsies derived from one male (age ) and one female (age ) and used as control biological replicates, and lung samples derived from a single male covid- deceased patient (age ) which were processed in technical replicates. library preparation method polya+ selection was used to remove rrnas before sequencing. gse dataset ( ) from this dataset we downloaded biological replicates of calu- , caco- and h cells which were mock treated or infected with sars-cov- (patient isolate betacov/munich/bavpat / /epi_isl_ ) or sars-cov (frankfurt strain) at an moi of . . library preparation method polya+ selection was used to remove rrnas before sequencing caco- and h cells. for calu- cells, two library preparation method polya+ selection and rrna-depletion were used respectively to remove rrnas before sequencing. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gse dataset ( ) from this dataset we downloaded rna sequencing data of vero e cells which were either mock-infected or infected with sars-cov- usa-wa / (moi = . ) with three repli- cates. however, when we downloaded the data one sample with accession number gsm was not available for downloading. cells were harvested at hours after infection, and rrna- depletion method was used to extract rna for sequencing. gse dataset from this dataset we downloaded: biological triplicates of mrc and vero e cells which were mock treated or infected with sars-cov (urbani strain) or mers-cov (emc/ ) at an moi of . or . library preparation method polya+ selection was used to remove rrnas before sequencing. cra dataset ( ) this dataset is public available in https://bigd.big.ac.cn/gsa/browse/cra . from this dataset we downloaded: the raw fastq data of pbmc and balf samples of covid- patients and corresponding pbmc controls. prjna dataset ( ) from this dataset we downloaded the raw fastq data for balf healthy control samples with accession numbers srr , srr , and srr . gse dataset ( ) from this dataset we downloaded the preprocessed single cell rna-seq data of balf samples from severe covid- patients and mild covid- patients. prjna dataset ( ) from this dataset we downloaded the preprocessed single cell rna-seq data of balf sample from a healthy control with accession number gsm . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . supplementary figures fig. s . workflow of bulk rna-seq. bulk rna-seq raw data fastqc trimmomatic align to virus genome pseudoalign to host transcriptome b ow tie kallisto samtools sleuth reads coverage along virus genome gene level tpm values bulk rna-seq clean data fig. s . workflow of single cell rna-seq data. count matrix of scrna-seq of covid- patients sum counts across all cells to obtain “pseudo-bulk” samples edaseq obtain gene length org.hs.eg.db scater obtain gene level tpm values (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in mrc cells infected with sars-cov and mers- cov. ●●● ●●● ●●● m oc k . m o i m o i tp m ifnb sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl sars-cov mrc ●●● ●●● ●●● m oc k . m o i m o i tp m ifnb mers-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl mers-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl mers-cov mrc ●●● ●●● ●●● m oc k . m o i m o i ifnl mers-cov mrc (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in balf samples of patients. ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnb a ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnl b ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnl c ●●● ●● h e a lth y. b a l f p a tie n t. b a l f t p m ifnl d (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in pbmc samples of patients. ●● ● ●●● h e a lth y. p b m c p a tie n t. p b m c t p m ifnb a ● ● ● ●● ● h e a lth y. p b m c p a tie n t. p b m c t p m ifnl b ●● ● ●●● h e a lth y. p b m c p a tie n t. p b m c t p m ifnl c ●●● ●●● h e a lth y. p b m c p a tie n t. p b m c t p m ifnl d (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . expression levels of ifns in lung samples of patients. ●● ●● h e a lth y. l u n g p a tie n t. l u n g t p m ifnb a ●● ● ● h e a lth y. l u n g p a tie n t. l u n g t p m ifnl b ●● ●● h e a lth y. l u n g p a tie n t. l u n g t p m ifnl c ● ● ●● h e a lth y. l u n g p a tie n t. l u n g t p m ifnl d (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . the number of reads mapped to the sars-cov- genome in lung samples of patients. ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● nsp nsp nsp nsp −nsp s orf a e m orf − n ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● nsp nsp nsp nsp −nsp s orf a e m orf − n samn samn genomic position s a r s − c o v − r e a d s (l o g ) sars−cov− (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. s . the expression levels of ace in the pbmc, lung and balf samples of healthy individuals. ●●● ●●● ● ● h e a lth y. b a l f h e a lth y. p b m c h e a lth y. l u n g t p m ace additional files that are too large to be embedded into the .tex file: table s to tables .xlsx table s to tables .csv (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . capsule network for protein ubiquitination site prediction capsule network for protein ubiquitination site prediction qiyi huang , ¶ jiulei jiang ¶ yin luo * weimin li & ying wang (school of computer science and engineering, north minzu university, yinchuan , ningxia, china) (school of life sciences, east china normal university, shanghai , china) (school of computer science and engineering, changshu institute of technology, suzhou , jiangsu, china) (school of computer engineering and science, shanghai university, shanghai , china) *corresporending author. e-mail: yluo@bio.ecnu.edu.cn(yl) ¶ these authors contributed equally to this work. & this author also contributed equally to this work. copyright: © huang et al. this is an open-access article distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. funding: this project is supported by the national key r&d program of china ( yfe ), national nature science foundation of china ( ), national statistical science research project ( ly ). competing interests: the authors have declared that no competing interests exist. abstract ubiquitination modification is one of the most important protein posttranslational modifications used in many biological processes. traditional ubiquitination site determination methods are expensive and time-consuming, whereas calculation-based prediction methods can accurately and efficiently predict ubiquitination sites. this study used a convolutional neural network and a capsule network in deep learning to design a deep learning model, “caps-ubi,” for multispecies ubiquitination site prediction. two encoding methods, one-of-k and the amino acid continuous type were used to characterize the sequence pattern of ubiquitination sites. the proposed caps-ubi predictor achieved an accuracy of . , a sensitivity of . , a specificity of . , a measure-correlate-prediction of . , and an area under receiver operating characteristic curve value of . , which outperformed the other tested predictors. introduction ubiquitination is an important posttranslational modification of proteins, consisting of the covalent binding of ubiquitin to a variety of cellular proteins. ubiquitin was discovered in by .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / goldstein et al. [ ]; it is a small protein composed of amino acids [ ]. ubiquitination is the process of covalently binding the lysine of a substrate protein to the small ubiquitin molecule under the action of a series of enzymes. three enzymes are involved in the process: e activation, e conjugation, and e ligation. ubiquitination modification plays a very important role in basic reactions such as signal transduction, cell diseases, dna repair, and transcription regulation [ – ]. due to the important biological characteristics of ubiquitination, identifying potential ubiquitination sites helps to understand protein regulation and molecular mechanisms. determining ubiquitination sites based on traditional biological experimental techniques such as mass spectrometry [ ] and antibody recognition [ ] is costly and time-consuming. therefore, it is necessary to develop a calculation method that can accurately and efficiently recognize protein ubiquitination. in recent years, some calculation methods have been developed to predict potential ubiquitination sites. huang et al. [ ] used amino acid composition (aac), a position weighting matrix, amino acid pair composition (aapc), a position-specific scoring matrix (pssm), and other information to develop a predictor called ubisite using a support vector machine (svm). nguyen et al. [ ] used an svm to combine three kinds of information: aac, evolution information, and aapc to develop a predictor. qiu et al. [ ] developed a new predictor called “iubiq-lys” to apply to sequence evolution information and a gray system model. chen et al. [ ] also applied svm to build a ubiprober predictor. wang et al. [ ] introduced physical–chemical attributes into an svm to develop the esa-ubisite predictor. radivojac et al. [ ] developed the predictor ubpred using a random forest algorithm. lee et al. [ ] developed ubsite using efficient radial basis functions. all of those machine learning-based methods and predictors have promoted the development of ubiquitination site prediction research and achieved good prediction performance. however, most of them rely on artificial feature selection, which may lead to imperfect features [ ], and their datasets are small despite the large volume of accumulated biomedical data. deep learning, the most advanced machine learning technology, can handle large-scale data well. it has multilayer networks and nonlinear mapping operations, which can fit the complex structure of data well. in recent years, deep learning has been developed rapidly [ ] and has been successfully applied in various fields of bioinformatics [ , ]. some methods based on deep learning have been used for ubiquitination site identification. for example, fu et al. [ ] applied one-hot and composition of k-spaced amino acid pairs encoding methods to develop deepubi with text-cnn. liu et al. [ ] used deep transfer learning methods to develop the deeptl-ubi predictor for multispecies ubiquitination site prediction. he et al. [ ] established a multimodel predictor using one-hot, physical–chemical properties of amino acids, and a pssm. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / although various ubiquitination site predictors and tools have been developed, there are still some limitations, and their accuracy and other performance elements must be further improved. in this paper, a deep learning model, “caps-ubi,” is proposed that uses a capsule network for protein ubiquitination site prediction. in caps-ubi, the protein fragments are first passed through one-of-k and amino acid continuous methods to encode them. then three convolutional layers and the capsule network layer are used as a feature extractor to obtain the functional domains in the protein fragments and finally to get the prediction result. relative to existing tools, the prediction performance of caps-ubi is a significant improvement. researchers could use the predictor to select potential ubiquitination candidate sites and do experiments to verify them, which will reduce the range of protein candidates and save time. materials and methods benchmark dataset the ubiquitination dataset came from the largest online protein lysine modification database, plmd . , which contains protein lysine modifications. the database has , proteins and , protein lysine modification sites, including , proteins and , ubiquitination sites. to eliminate errors caused by homologous sequences, we used cd-hit [ ] to filter out homologous sequences with sequence similarities greater than %. we obtained , proteins and , ubiquitination sites, which were used as a positive sample set. based on those annotated sequences, , nonubiquitinated sites were extracted from the proteins as a negative sample set, and cd-hit- d [ ] was used to filter out homologous sequences within the positive sample set that were greater than %. to establish a balanced training model, we randomly selected the same data as the positive sample set and selected % of it as the training and validation sets and % as the independent test set. finally, , data on ubiquitination sites and , data on nonubiquitination sites were obtained. the final data division is shown in table . table . data of protein ubiquitination sites dataset no. of positive data no. of negative data training , , validation , , testing , , .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / input sequence coding the coding method directly determines the quality of its prediction results; a good feature can extract the correlation between the ubiquitination feature and the targets from peptide sequences [ ]. after encoding the protein sequence, the sequence information is converted into digital information, and then deep learning is done on it. in this study, two methods were used to encode the amino acid sequence around the protein ubiquitination site; namely, one-of-k encoding and amino acid continuous encoding. one-of-k encoding the one-of-k encoding method was adopted for protein fragments, and each protein fragment was encoded into an m × k d matrix, where m is the number of amino acids in each sequence— that is, the length of the input sequence—and k is the type of amino acid. there are kinds of common amino acids. when the length of the input sequence did not reach the window length, it was filled in with a “-” on the left or right side of the protein fragment and was treated as another amino acid, so each sequence consisted of amino acids. continuous coding of amino acids the continuous amino acid coding method [ ] was proposed by venkatarajan; the coding uses physical-chemical properties to quantitatively characterize amino acids. they used five main components to characterize the changes in physica-chemical properties of amino acids. in this paper, each amino acid is represented by a d vector, wherein the first d represents the five principal components as shown in table of [ ], the last d represents the gap in the input protein fragment with a length of m. the gap is represented by a dash“-”, meaning that when the sequence length does not reach the window length, the bit is coded as ; otherwise, it is . finally, each protein fragment is coded into an m × d matrix. this continuous coding scheme can comprehensively consider the physical and chemical properties of protein amino acids and has a smaller dimension than that of one-of-k coding. the smaller input dimension will lead to a relatively simple network structure, which is beneficial to avoid overfitting. capsule network in a cnn, the pooling layer can extract valuable information from the data, but some location information is lost [ ]. also, a cnn outputs scalar values in neurons, and the information represented by scalar neurons is limited and cannot reflect the spatial position relation of the internal .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / features of the neural network. to solve the problems of scalar neurons, in hinton proposed a deep learning architecture called a capsule network [ ]. the main building module of a capsule network is the capsule [ ], which is a set of neuron vectors. the length of the capsule represents the probability of the existence of an entity; the longer the capsule, the greater the probability,and the direction of the capsule represents the state of the entity. the capsule network provides a unique and powerful deep learning building block that can better model the complex relations within a neural network. a cnn uses scalar input activation functions, such as the rectified linear activation function relu, a sigmoid, and a tanh, and the capsule network uses an activation function called a squash. the calculation equation is ( ) where 𝑣 𝑗 is the output of capsule 𝑗 , and 𝑠 𝑗 is the weighted sum of the input vectors of capsule 𝑗 . this function compresses the vector length to the interval [ , ], which can be regarded as a kind of compression and reallocation of the vector length. in addition to the first-layer capsule network, the input of the capsule 𝑠 𝑗 is obtained by the weighted sum of the prediction vector (𝑢 𝑗 | 𝑖 ) located in the lower-layer capsule, and the prediction vector (𝑢 𝑗 | 𝑖 ) is passed through the lower layer. the capsule is calculated by multiplying its output (𝑢 𝑖 ) and the weight matrix (𝑤 𝑖 𝑗 ): ( ) ( ) where 𝑐𝑖𝑗 is the coupling coefficient, which is obtained by a softmax transformation from 𝑏𝑖𝑗; its calculation equation is ( ) in eq. ( ), the sum of the coupling coefficients of all capsules and capsule 𝑖 in the previous layer is . the coupling coefficient is obtained through a dynamic routing mechanism; the pseudocode is as follows: procedure routing ( 𝑢𝑗|𝑖 ,r,l) || || || || || || j j j j j s s v s s   |ˆj i ij j is c u  |ˆ j i ij iu w u exp( ) exp( ) ij ij k ik b c b   .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / for all capsules i in layer l and capsules j in layer (l + ): 𝑏𝑖𝑗 . for r iterations do: for all capsules i in layer l:𝑐𝑖 softmax (𝑏𝑖) for all capsules j in layer (l + ): 𝑠𝑗 𝛴𝑐𝑖𝑗𝑢𝑗|𝑖 for all capsules j in layer (l + ): 𝑣𝑗 squashing (𝑠𝑗) for all capsules i in layer l and capsules j in layer (l + ):𝑏𝑖𝑗 𝑏𝑖𝑗 + 𝑢𝑗|𝑖. 𝑣𝑗 return 𝑣𝑗 the loss function of the capsule network is the margin loss function, and the calculation equation is ( ) where 𝐾 is the number of categories, 𝑇 𝐾 is the real label ubiquitinated to and nonubiquitinated to , | | 𝑉 𝑘 | | is the output length of the kth capsule, which is the probability of predicting the kth class. the boundary 𝑚 + is . , which is a penalty for false positives, and the lower boundary 𝑚 ― is . , which is a penalty for false negatives. 𝜆 is a proportional coefficient of . , which is used to control the loss caused when some categories do not appear , to prevent the capsule vector length of all categories from being reduced in the early stage of training,and the total loss is the sum of the losses of 𝐾 categories. architecture design as shown in figure , the structure of the proposed model contains two identical subnetworks that process one-of- and amino acid continuous encoding modes. after training in their respective network model, the two models merge the features as the final output. each subnetwork consists of the same three d convolutional layers (conv , conv , conv ) and a capsule network layer. the first convolutional layer (conv ) of the network is a d convolution kernel, which comprises convolution kernels with a size of and a step size of that use the relu activation function. a convolution kernel with a length of first appears in the network in network [ ]; a convolution kernel with a length of can reduce the complexity of the model and can make the network deeper l max( , || ||) ( ) max( ,|| || )k k k k kt m v t v m       .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / and wider. applied in this study, it acted as a feature filter and could pool features in two encoding modes. the second convolutional layer, conv , is a conventional convolutional layer with d convolution kernels with a length of and a step size of , which functions as a local feature detector to extract the protein sequence input and convert it to corresponding local features. conv is understood as the functional domain characteristics of the protein, and its output is used as the input of the next layer, conv . the third convolutional layer, conv , has d convolution kernels with a size of and a step size of . the activation function used is relu and a dropout mechanism with a random deletion rate of . . the dropout mechanism is used to prevent the model from overfitting and to increase the generalization ability of the model. these two convolutional layers are used to increase the feature representation ability of the capsule network and convert the original features of protein fragments into more advanced and abstract features. then the local features of conv are used as the input of the primarycapsule network layer. the dimension of each capsule in the primarycapsule is , the step size is , the convolution kernel length is , and the squash activation function is used. the last layer of labelcapsule is a capsule with a dimension of , which is used to represent the two states of the input protein fragment: the input sequence is ubiquitination site or non-ubiquitination site, and finally the output of the two subnetworks are merged as the final prediction result. figure . network structure structure of the proposed model model training for model training, we used the adam[ ] optimization algorithm. adam can automatically adjust the learning rate of the parameters, improve the training speed, and improve the stability of the model. the learning rate was . , the first-order estimated exponential decay rate was . , and the exponential decay rate estimated by the second moment was . . the dynamic routing .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / mechanism was consistent with that in the original paper [ ]. the number of routing iterations was , and the boundary loss function was used as the loss function of the model. the boundary loss function form is shown in eq. ( ). and the number of model training iterations was epochs. the deep learning framework used by this model was keras . . . keras is a highly modular deep learning framework based on theano and written in python; it supports both cpu and gpu. the programming language was python . , and the model was trained and tested on a windows system equipped with an nvidia rtx gpu. result model evaluation and performance indicators a confusion matrix is a visual display tool used to evaluate the quality of classification models. each row of the matrix represents the actual condition of the sample, and each column represents the sample condition predicted by the model. there are four values in the matrix, as shown in the following equations, where fn is the number of false negatives, fp is the number of false positives, tn is the number of true negatives, and tp is the number of true positives. the following indicators based on the confusion matrix are usually used to evaluate the prediction of the model performance: among them, sn stands for sensitivity, which is the evaluation of the prediction performance of negative samples; sp is the specificity, which is the evaluation of the prediction performance of positive samples; acc is the accuracy, which is the evaluation of the accuracy of the model; and mcc is the matthew’s correlation coefficient, which is the overall evaluation of the model. the receiver operating characteristic (roc) curve and the area under the curve (auc) for the roc curve are usually used to evaluate the pros and cons of binary classifiers: the larger the auc value, the better the model performance.   fn ( )( )( ) tp tn fp tn fp t mcc tp fn p fp tn fn         .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / experimental results first, we did many experiments on the selection of the window size of protein fragments. because the correlation information between amino acids had a direct effect on the prediction results, we needed to determine an appropriate window size. previous studies directly used empirical values such as , , or . however, different data models and classifiers tend to have different window sizes [ ]. therefore, a window length of n was selected from a range of to , and we did a series of experiments with the different window lengths. for each window length, we encoded all training data into two input modes and trained their respective subnetworks. according to the prediction results of the validation set, we selected each appropriate window size. figure shows the performance of various window sizes in one-of- and amino acid continuous encoding modes. figure . accuracy of the verification set for various window lengths in figure , the abscissa represents the window length, and the ordinate represents the accuracy of the model. it can be seen from figure that when the window length was , the two encoding modes had the highest accuracy. therefore, we set the window length of this model to . to compare the performance of the model under different encoding schemes, we compared the capsule network and the cnn with similar hierarchical structures of capsule networks and the same training set size. the cnn structure replaced only the primarycapsule layer with the conv layer. we set the labelcapsule layer to a × fully connected layer. the comparison results are shown in table . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . comparison of various coding schemes feature model acc (%) sn (%) sp (%) auc mcc capsnet . . . . . one-of- cnn . . . . . capsnet . . . . . amino acid continuous cnn . . . . . capsnet . . . . . one-of- and amino acid continuous cnn . . . . . accuracy of the model sensitivity of the model specificity of the model area under curve matthew’s correlation coefficient from table , it can be concluded that the capsule network’s accuracies were . %, . %, and . % percentage points higher than those of cnn under the one-of- , amino acid continuous, and combined one-of- and amino acid continuous types, indicating that the capsule network internally expressing the hierarchical relation modeling aspect has more advantages than cnn. among them, the performance under the combined one-of- and amino acid continuous encoding modes is the best on the capsule network: this proposed caps-ubi model achieved an accuracy, sensitivity, specificity, area under curve, and matthew’s correlation coefficient of . %, . %, . %, . , . respectively. the proposed caps-ubi was obtained from balanced data. the roc curve of caps-ubi on the test set is shown in figure , which shows that it was very close to the real situation. figure . receiver operating characteristic curve of caps-ubi on the test set when we used balanced data to train the model on an experimentally verified ubiquitination dataset and a nonubiquitination dataset [ ], the ratio of positive peptides and negative peptides was : , so we tested caps-ubi using natural-distribution data. the test results are shown in table . according to the test results, the performance was slightly worse than that under the balanced data. .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . results of testing caps-ubi under natural-distribution data protein fragment acc (%) sn (%) sp (%) auc mcc positive–negative ratio , . . . . . : , , . . . . : accuracy of the model sensitivity of the model specificity of the model area under curve matthew’s correlation coefficient comparison with other methods in the past years, many researchers have contributed to the prediction and research of protein ubiquitination sites. we compared the proposed model with other sequence-based prediction tools. the corresponding data and results are shown in table , which shows that the performance of the caps-ubi model exceeded that of the best-performing deep learning model deepubi and several other prediction models. the accuracy, sensitivity, specificity, area under curve, and matthew’s correlation coefficient of caps-ubi were . , . , . , . , and . respectively percentage points higher than those of deepubi. table . proposed caps-ubi compared with other methods predictor acc (%) sn (%) sp (%) auc mcc ubipred . . . . . ubsite . . , – – cksaap_ubsite . . . . . ubiprober – . . . . iubiq-lys . . . – . deepubi . . , . . caps-ubi . . . . . accuracy of the model sensitivity of the model specificity of the model area under curve matthew’s correlation coefficient conclusion and outlook in this paper, a new deep learning model for predicting protein ubiquitination sites is proposed, using one-of-k and amino acid continuous coding modes. we used the largest available protein ubiquitination site dataset, and the experimental results above verify the effectiveness of this model. the operation of the model has four main steps: encoding protein sequences, constructing convolutional layers, constructing a capsule network layer, and constructing an output layer. the capsule network introduces a new building block for deep learning. relative to cnn, the capsule network, which uses a dynamic routing mechanism to update parameters, requires more training time, but the time required for prediction is similar. the capsule network can also characterize the .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / complex relations among amino acids in various sequence positions and can explore the internal data distribution related to biochemical significance. the proposed caps-ubi prediction tool will facilitate the sequence analysis of ubiquitination and can also be used to identify other posttranslational modification sites in proteins. in the future, we will study other features that may better extract sample attributes to construct deeper models. references . goldstein g, scheid m, hammerling u, schlesinger dh, niall hd, boyse ea. isolation of a polypeptide that has lymphocyte-differentiating properties and is probably represented universally in living cells. proc natl acad sci u s a. ; : - . . wilkinson kd. the discovery of ubiquitin-dependent proteolysis. proc natl acad sci u s a. ; : - . . hicke l, schubert hl, hill cp. ubiquitin-binding domains. nat rev mol cell biol. ; : . . hicke l. protein regulation by monoubiquitin. nat rev mol cell biol. ; : - . . pickart cm. ubiquitin enters the new millennium. mol cell. ; : - . . haglund k, dikic i. ubiquitylation and cell signaling. embo j. ; : - . . peng j, schwartz d, elias je, et al. a proteomics approach to understanding protein ubiquitination. nat biotechnol. ; : - . . gentry ms, worby ca, dixon je. insights into lafora disease: malin is an e ubiquitin ligase that ubiquitinates and promotes the degradation of laforin. proc natl acad sci u s a. ; ( ): - . . huang ch, su mg, kao hj, jhong jh, weng sl, lee ty. ubisite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. bmc syst biol. ; suppl (suppl ): . . nguyen vn, huang ky, huang ch, lai kr, lee ty. a new scheme to characterize and identify protein ubiquitination sites. ieee/acm trans comput biol bioinform. ; : - . . qiu wr, xiao x, lin wz, chou kc. iubiq-lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. j biomol struct dyn. ; : - . . chen x, qiu jd, shi sp, suo sb, huang sy, liang rp. incorporating key position and amino acid residue features to identify general and species-specific ubiquitin conjugation .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / sites. bioinformatics. ; : - . . wang jr, huang wl, tsai mj, hsu kt, huang hl, ho sy. esa-ubisite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives. bioinformatics. ; : - . . radivojac p, vacic v, haynes c, et al. identification, analysis, and prediction of protein ubiquitination sites. proteins. ; ( ): - . . lee ty, chen sa, hung hy, ou yy. incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites. plos one. ; :e . . wang d, zeng s, xu c, et al. musitedeep: a deep-learning framework for general and kinase specific phosphorylation site prediction. bioinformatics. ; : - . . shaw d, chen h, jiang t. deepisofun: a deep domain adaptation approach to predict isoform functions. bioinformatics. ; ( ): - . . sun, d. , wang, m. , feng, h. , & li, a. . ( ). prognosis prediction of human breast cancer by integrating deep neural network and support vector machine: supervised feature extraction and classification for breast cancer prognosis prediction. th international congress on image and signal processing, biomedical engineering and informatics (cisp-bmei). ieee. . fu h, yang y, wang x, wang h, xu y. deepubi: a deep learning framework for prediction of ubiquitination sites in proteins. bmc bioinformatics. ; : . . liu y, li a, zhao xm, wang m. deeptl-ubi: a novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. methods. ;s - ( ) - . . he f, wang r, li j, bao l, xu d, zhao x. large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. bmc syst biol. ; (suppl ): . . huang y, niu b, gao y, fu l, li w. cd-hit suite: a web server for clustering and comparing biological sequences. bioinformatics. ; : - . . huang ch, su mg, kao hj, jhong jh, weng sl, lee ty. ubisite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. bmc syst biol. ; suppl (suppl ): . . plewczynski d, tkacz a, wyrwicz ls, rychlewski l. automotif server: prediction of single residue post-translational modifications in proteins. bioinformatics. ; : - . . venkatarajan m s , braun w . new quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties[j]. molecular modeling annual, , ( ): - . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . dombetzki la. an overview over capsule networks. network architectures and services . . sabour s , frosst n , hinton g e . dynamic routing between capsules[j]. . . hinton,g.e. et al. ( ) transforming auto-encoders. international conference on artifificial neural networks. springer, finland, pp. – . . lin m., chen q., yan s. network in network[j]. arxiv preprint arxiv: . , : . kingma,d. and ba,j. ( ) adam: a method for stochastic optimization, arxiv preprint arxiv: . .cc-by . international licensemade available under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / triplex and other dna motifs show motif-specific associations with mitochondrial dna deletions and species lifespan triplex and other dna motifs show motif-specific associations with mitochondrial dna deletions and species lifespan. authors kamil pabis . georg august university of göttingen, göttingen, germany. mail: kamil.pabis@gmail.com abstract the “theory of resistant biomolecules” posits that long-lived species show resistance to molecular damage at the level of their biomolecules. here, we test this hypothesis in the context of mitochondrial dna (mtdna) as it implies that predicted mutagenic dna motifs should be inversely correlated with species maximum lifespan (mls). first, we confirmed that guanine-quadruplex and direct repeat (dr) motifs are mutagenic, as they associate with mtdna deletions in the human major arc of mtdna, while also adding mirror repeat (mr) and intramolecular triplex motifs to a growing list of potentially mutagenic features. what is more, triplex motifs showed disease-specific associations with deletions and an apparent interaction with guanine-quadruplex motifs. surprisingly, even though dr, mr and guanine-quadruplex motifs were associated with mtdna deletions, their correlation with mls was explained by the biased base composition of mtdna. only triplex motifs negatively correlated with mls even after adjusting for body mass, phylogeny, mtdna base composition and effective number of codons. taken together, our work highlights the importance of base composition for the comparative biogerontology of mtdna and suggests that future research on mitochondrial triplex motifs is warranted. abbreviations bps, mtdna deletion break points dr, direct repeats er, everted repeats gq, guanine-quadruplexes ir, inverted repeats mls, species maximum lifespan mr, mirror repeats nbmst, non-b dna motif search tool nc, number of effective codons (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:kamil.pabis@gmail.com https://doi.org/ . / . . . pgls, phylogenetic generalized least squares sd, standard deviation trip, triplex forming motif xr, any repeat half-site or motif mtdna, mitochondrial dna (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction macromolecular damage to lipids, proteins and dna accumulates with aging (richardson and schadt , gladyshev ), whereas cells isolated from long-lived species are resistant to genotoxic and cytotoxic drugs, giving rise to the multistress resistance theory of aging (miller , hamilton and miller ). by extension of this idea, the “theory of resistant biomolecules” posits that lipids, proteins and dna itself should be resilient in long-lived species (pamplona and barja ). in support of this theory, it was shown that long-lived species possess membranes that contain fewer lipids with reactive double bonds (valencak and ruf ) and perhaps a lower content of oxidation-prone cysteine and methionine in mitochondrially encoded proteins (see aledo et al. for a discussion). mitochondrial dna (mtdna) mutations constitute one type of macromolecular damage that accumulates over time. point mutations accumulate in proliferative tissues like the colon and in some progeroid mice (kauppila et al. ), while the accumulation of mtdna deletions in postmitotic tissues may underpin certain age-related diseases like parkinson’s and sarcopenia (lawless et al. , bender et al. ). if the theory of resistant biomolecules can be generalized, the mtdna of long-lived species should resist both point mutation and deletion formation. however, we will focus on deletions because they are more pathogenic than point mutations at the same level of heteroplasmy (gamamge et al. ) and human tissues do not accumulate high levels of point mutations observed in progeroid mouse models (khrapko et al. ). since deletion formation depends on the primary sequence of the mtdna (sequence motifs) it is amenable to bioinformatic methods. ever since a link between direct repeat (dr) motifs and deletion formation became known, variations of the theory of resistant biomolecules have been tested, although not necessarily under this name. it was reasoned that long-lived species evolved to resist deletion formation and mtdna instability by reducing the number of mutagenic motifs in their mtdna (khaidakov et al. , yang et al. ). we aim to extend these findings by re-evaluating and establishing new candidate motifs, which we then correlate with species maximum lifespan (mls). studying multiple motif classes at once also allows us to reveal relationships between potentially overlapping mtdna motifs that may affect the data. we define candidate motifs as those that are associated with deletion formation inside the major arc of human mtdna, because during asynchronous replication the major arc is single stranded for extended periods of time (persson et al. ) which should favor the formation of secondary structures. finally, we test if these motifs correlate with the mls of mammals, birds and ray-finned fishes after correcting for potential biases, especially global mtdna base composition which is an important confounder (aledo et al. ) yet is neglected in some studies (yang et al. ). the choice of motifs to study is based on biological plausibility and published literature that will be briefly reviewed below. mutagenic motifs include repeats as well as guanine-quadruplex (gq)- and triplex-forming motifs. dr motifs can lead to dna instability through strand-slippage if two dr motifs mispair during replication (persson et al. ). whereas inverted repeat (ir), g-quadruplex and triplex motifs destabilize progression of the replication fork through the formation of stable secondary structures. some of the structures formed include hairpins for ir motifs (tremblay-belzile et al. ), triple stranded dna for triplex motifs and bulky stacks of guanines for g-quadruplex motifs (bacolla et (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . al. ; fig. ). mirror repeat (mr) and everted repeat (er) motifs, in contrast, do not allow stable watson-crick base pairing and are thus less likely to be mutagenic, although a subset of mr motifs may form triplex structures (kamat et al. ). thus, many motifs can be mutagenic in principle, but what is the evidence that these motifs are related to mtdna instability, particularly deletions, and mls? paradoxically, while drs are the motif most consistently associated with mtdna deletion breakpoints (bps), despite preliminary reports (khaidakov et al. , lakshmanan et al. , yang et al. ), no correlation with species mls was seen in recent studies (lakshmanan et al. ). in contrast, with the exception of one preprint (mikhailova et al. ), irs are not known to be associated with mtdna deletions (dong et al. ), although they do show a negative relationship with species mls (yang et al. ) and may contribute to inversions (tremblay‐belzile et al. ). whether age-related mtdna inversions underlie any pathology, however, requires further study. finally, g-quadruplex motifs are associated with both deletions (dong et al. ) and point mutations (butler et al. ), but no study tested if they correlate with mls. triplex motifs are poorly studied with one report finding no association between these motifs and deletions (oliveira et al. ). based on these studies we decided to test the theory of resistant biomolecules by quantifying dr, mr, ir, er, g-quadruplex- and triplex-forming motifs. we stipulate that if a motif class played a causal role in aging, it should be involved in deletion formation and its abundance should be negatively correlated with species mls. figure a. direct repeat, both half-sites have the same orientation. b. inverted repeat, the half-sites are complementary and has mirror symmetry. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . c. everted repeat, the half-sites are complementary. d. mirror repeat, the half-sites have mirror symmetry. e. triplex motifs can form a triple helical dna structure also called h-dna. f. in a g-quadruplex multiple g-quartets (depicted as blue rectangles) stack on top of each other. adapted from gurusaran et al. ( ) and khristich and mirkin ( ) with permission. half-sites shown in red. methods detection of dna motifs repeats were detected by a script written in r (vr- . . ). briefly, to find all repeats with n basepairs (bps), the mtdna light strand is truncated by to n bps and each of the n truncated mtdnas is then split every n bps. this generates every possible substring (and thus repeat) of length n. in the next step, duplicate strings are removed. afterwards we can find dr (a substring with at least two matches in the mtdna), mr (at least one match in the mtdna and on its reverse), ir (at least one match in the mtdna and on its reverse-complement) and er motifs (at least one match in the mtdna and on its complement). overlapping and duplicate repeats were not counted for the correlation between repeats and mls. the code for the analyses performed in this paper can be found on github (pabisk/aging_triplex ). unless stated otherwise, all analyses were performed in r. g-quadruplex motifs were detected by the pqsfinder package (v . . , hon et al. ). intramolecular triplex-forming motifs were detected by the triplex package (v . . , hon et al. ) and duplicates were removed. we also compared the data with two other publicly available tools, triplexator (buske et al. ), and with the non-b dna motif search tool (nbmst; cer et al. ). triplexator was run on a virtual machine in an oracle vm virtualbox (v . ) in -ss mode on the human mitochondrial genome and its reverse complement, the results were combined and overlapping motifs from the output were removed. we used the web interface of nbmst to detect mirror repeats/triplexes (v . ). association between motifs and major arc deletions the major arc was defined as the region between position and of the human mtdna (nc_ . ). the following deletions and their breakpoints were located in this region and included: deletions from the mitobreak database (damas et al. , mtdna breakpoints.xlsx), from persson et al. ( ) and from hjelm et al. ( ). each deletion is defined by two breakpoints. a breakpoint pair was considered to associate with a motif if the motif fell within a defined window around one or both breakpoints, depending on the analysis. the window size was chosen in relation to the length of the studied motifs ( bp for repeats and bp for other motifs). three different motif orientations relative to the breakpoints were considered. two orientations for motifs with half-sites (i.e. repeats), either both half-sites at any one breakpoint of a deletion, or one half-site per breakpoint of a deletion. motifs with overlapping half-sites were not counted. in the third case, distinct g-quadruplex and triplex motifs could associate with one or both breakpoints of a deletion, but were at most counted once, since the latter case is sufficiently rare. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . in order to exclude overlapping “hybrid” motifs, mr and dr motifs with the same sequence were removed whereas triplex and g-quadruplex motifs were removed if they were in proximity. to generate controls, the mtdna deletions as a whole were randomly redistributed inside the major arc which, because of the fixed deletion size, allowed us to approximate the original distribution of breakpoints (as suggested by oliveira et al. ). significance was determined via one-sample t-test in prism (v . ) by comparing actual breakpoints to such randomized controls. alternative controls were generated by shifting each breakpoint by bp towards the midpoint of the major arc or as in fig. s . cancer associated breakpoints we obtained all autosomal breakpoints available from the catalogue of somatic mutations in cancer (cosmic; release v , th august ), which includes deletions, inversions, duplications and other abnormalities (n= in total). after removing breakpoints whose sequences could not be retrieved (< . %), we quantified the number of predicted g-quadruplex and triplex motifs in a bp window centered on the breakpoints using default settings for the detection of these motifs. sequences of breakpoint regions were obtained from the grch build of the human genome using the bsgenome package (v . . ). each breakpoint shifted by + bps served as its own control. lifespan, base composition and life history traits we included three phylogenetic classes in our analysis for which we had sufficient data (n> ), mammals, birds and ray-finned fishes (actinopterygii). mls and body mass were determined from the anage database (tacutu et al. ) and, for mammals, supplemented with data from pacifici et al. ( ). the mtdna accessions were obtained from an updated version of mitoage (unpublished; toren et al. ). species were excluded if body mass data was unavailable, if the sequence could not be obtained using the genbankr package (v . . ), or if the extracted cytochrome b dna sequence did not allow for an alignment, precluding phylogenetic correction. the species data can be found in the supplementary (species data.xlsx). we analyzed the full mtdna sequence, heuristically defined as the mtdna sequence between the first and last encoded trna, excluding the d-loop, which is rarely involved in repeat-mediated deletion formation (yang et al. ). the effective number of codons was calculated using wright’s nc (smith et al. ). base composition was calculated for the light-strand. gc skew was calculated as the fraction (g − c)/(g + c) and at skew as (a − t)/(a + t). all correlations are pearson’s r. partial correlations were performed using the ppcor package (v . ). phylogenetic generalised least squares and phylogenetic correction observed correlations between traits and lifespan can be spurious due to shared species ancestry (speakman ). to correct for this, we use phylogenetic generalised least squares (pgls) implemented in the caper package (v . . ). species phylogenetic trees were constructed via neighbor joining based on aligned cytochrome b dna sequences using clustal omega from the msa package (v . . ) and in the resulting mammalian and bird tree, four branch edge lengths were equal to zero, which were set to the lowest non-zero value in the dataset. results (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . direct repeats and mirror repeats are over-represented at mtdna deletion breakpoints in order to define candidate mtdna motifs that could be linked with lifespan, we started by reanalyzing motifs that associate with mtdna deletion breakpoints reported in the mitobreak database (damas et al. ; fig. s ; mtdna breakpoints.xlsx). in the below analysis, we consider dr and ir motifs thought to be mutagenic, as well as mr and er motifs, so far not known to be mutagenic and we pool all to bp long repeats, since the data is similar between different repeat lengths (fig. s ). as shown by others, we found that dr motifs often flank mtdna deletions (fig. a). in contrast, no strong association was seen for er and ir motifs, even considering a larger window around the breakpoint to allow for the fact that irs could bridge and destabilize mtdna over long distances (persson et al. ; fig. s ). surprisingly, we also found mr motifs flanking deletion breakpoints more often than expected by chance (fig. a). however, dr and mr motifs are known to correlate with each other (shamanskiy et al. ; fig. b) and indeed we noticed a large sequence overlap between mr and dr motifs (fig. b), which could explain an apparent over-representation of mrs at breakpoints. removal of overlapping mr-dr hybrid motifs confirmed this suspicion. after this correction, the degree of enrichment was strongly attenuated (fig. c) and the total number of breakpoints flanked by mr motifs was reduced by > %. nevertheless, long mr motifs remained particularly over-represented around deletions (fig. s ). since the prior analysis only considered motifs that flank both breakpoints, we next tested the idea that ir and other motifs could be mutagenic if both half-sites are found at any of the breakpoints. however, in this analysis no motif class showed enrichment around breakpoints (fig. d). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure direct repeat (dr) and mirror repeat (mr) motifs are significantly enriched around actual deletion breakpoints (bps) compared to reshuffled bps, but the same is not true for inverted repeat (ir) and everted repeat (er) motifs (a, d). the surprising correlation between mr motifs and deletion bps is attenuated when mrs that have the same sequence as dr motifs are removed (b, c). controls were generated by reshuffling the deletion bps while maintaining their distribution (n= , mean ±sd shown). the schematic drawings above (a, d) depict the orientation of the repeat (xr) half-sites in relation to the bps. *** p < . ; ** p < . by one sample t-test. a) the number of deletions associated with dr, mr, ir or er motifs at both bps compared with reshuffled controls. b) venn diagram showing the number of mr, dr and hybrid mr-dr motifs that were identified within the major arc. c) the number of deletions associated with mr motifs, before (mr) and after removal of hybrid mr-dr motifs (mrdr-), compared with reshuffled controls. d) the number of deletions associated with dr, mr, ir or er motifs at either bp compared with reshuffled controls. predicted triplex-forming motifs are over-represented at mtdna breakpoints given the association between mr motifs and breakpoints we decided to analyze triplex motifs, a special case of homopurine and homopyrimidine mirror repeats (khristich and mirkin , bissler ), and their association with deletion breakpoints in the mitobreak database. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . here, we use the triplex package to predict intramolecular triplex motifs because it has several advantages compared to other software (hon et al. ). for example, using the nbmst tool, as in a previous study of mtdna instability (oliveira et al. ), we only identified two potential triplex motifs within the major arc that did not overlap with the six motifs identified by the triplex package (table s ). in contrast, using triplexator (buske al. ) we were able to detect four of the six triplex motifs and the motifs detected by triplexator were also enriched at breakpoints (table s ). we noticed that predicted triplexes are g-rich and thus could be related to g-quadruplex motifs (doluca et al. ). in a comparison of the two motif types, however, we found several differences (table s , s ). triplex motifs were shorter and less abundant than predicted g-quadruplexes, associated with fewer breakpoints altogether (fig. ) and, in contrast to g-quadruplexes almost exclusive to the g-rich mtdna heavy-strand, triplex motifs were also common on the light-strand. the six triplex motifs detected by the triplex package were significantly enriched around deletion breakpoints and when we excluded triplex-g-quadruplex hybrid motifs the result was attenuated but remained significant (fig. a). given the higher risk of spurious findings with only six motifs, we repeated the analysis using a relaxed definition of triplex and the results were fundamentally unchanged (fig. b). furthermore, our results were not sensitive to reasonable changes in the size of the search window around breakpoints (fig. s a, b), motif quality scores (fig. s c, d) or inclusion of overlapping motifs (fig. s e-g). analogous to the situation with mr motifs we tested if overlapping triplex-dr hybrid motifs could bias our results. given the rarity of triplex motifs and the many drs in the mitochondrial genome we choose an alternative approach rather than excluding triplex motifs that overlapped any dr half-site. we compared the fraction of triplex and g-quadruplex positive deletions associated with drs (gq+, dr+ and trip+, dr+) and not associated with drs (gq+, dr- and trip+, dr-). we considered a deletion to be dr+ if both breakpoints were flanked by the same dr sequence. in this case, only % of trip+ deletions associated with drs whereas % of gq+ deletions did (table s ). figure triplex motifs are significantly enriched around actual breakpoints (bps) compared to reshuffled bps (a, b) even after removal of g-quadruplex (gq)-triplex hybrid motifs (tripgq-). the number of unique triplex (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . motifs, gq motifs and of hybrid triplex-gq motifs, within the mtdna major arc, is shown in the venn diagrams above (a, b). enrichment of gq motifs around bps is shown for comparison in (c). controls were generated by reshuffling the deletion bps while maintaining their distribution (n= , mean ±sd shown). the schematic drawing above (c) depicts the orientation of the gq and triplex motifs (xr) in relation to the bps. *** p < . by one sample t-test. a) the number of deletion bps associated with triplex motifs compared with reshuffled controls. analysis including (left side) or excluding triplex-gq hybrid motifs (right side). b) same as (a) but with relaxed criteria for the detection of triplex motifs (min score= ) and gq motifs (min score= ). c) the number of deletion bps associated with gq motifs compared with reshuffled controls. relaxed settings (left side, min score= ) and default settings (right side, min score= ). triplex forming motifs may be associated with mitochondrial disease breakpoints next, we sought to validate our findings on two recently published next generation sequencing datasets (hjelm et al. , persson et al. ; mtdna breakpoints.xlsx; table s ). we were able to confirm the enrichment of dr (fig. s a, s a), mr (fig. s a, s a) and g-quadruplex motifs (fig. a, b; s c, d) around deletion breakpoints. additionally, we confirmed that hybrid mr-dr motifs are responsible in large part for the enrichment of mr motifs around breakpoints (fig. s b, s b). in contrast, we found that triplex motifs were not consistently enriched around breakpoints in the dataset of hjelm et al. (fig. s c, d), which is based on post-mortem brain samples from patients without overt mitochondrial disease, whereas we saw enrichment in the dataset by persson et al. (fig. a, b), which is based on muscle biopsies from patients with mitochondrial disease. this unexpected discrepancy prompted us to take a second look at the mitobreak data. in this dataset triplex motifs were significantly more enriched at breakpoints in the mtdna single deletion subgroup compared to the healthy tissues subgroup (fig. s ). in addition, we found more broadly that mitochondrial disease status might explain the heterogenous results across datasets we have seen (fig. c). further strengthening our findings, triplex motifs were enriched in the mitobreak and persson et al. dataset regardless of the breakpoint shuffling method chosen and of our statistical assumptions (fig. s ). what is more, triplex motifs were also enriched at breakpoints when we pooled all three datasets (fig. d), although to a lesser extent. finally, g-quadruplex motifs close to triplex motifs were more strongly enriched at deletion breakpoints than solitary g-quadruplex motifs (fig. e; fig. s ), suggesting that triplex formation may further contribute to dna instability. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure in the persson et al. ( ) dataset, triplex and g-quadruplex (gq) motifs are enriched around deletion breakpoints (bps), using either default (a) or relaxed scoring criteria (b). although triplex motifs predominate in mitochondrial disease datasets (c), we also find that triplex motifs are significantly enriched around bps (d) after pooling the data from mitobreak, persson et al. ( ) and hjelm et al ( ). finally, gq and triplex motifs show stronger enrichment around bps than either of them in isolation (e). controls were generated by reshuffling the deletion bps while maintaining their distribution (n= , mean ±sd shown). the schematic drawing above (d) depicts the orientation of the motifs (xr) in relation to the bps. *** p< . , **p< . by one sample t-test. a) the number of deletion bps associated with gq and triplex motifs compared with reshuffled controls (min score = default). b) the number of deletion bps associated with gq and triplex motifs compared with reshuffled controls (min score = relaxed). c) the number of deletion bps associated with triplex motifs (relaxed settings, min score= ) stratified by mitochondrial disease status. mitobreak data includes single and multiple mitochondrial deletion syndromes. d) the number of deletion bps associated with triplex motifs, or with triplex motifs excluding triplex-gq hybrid motifs (tripgq-), compared with reshuffled controls. default settings (left side, min score= ) and relaxed settings (right side, min score= ). e) the fold-enrichment of gq and triplex motifs around deletion bps is shown. motifs were considered overlapping if their midpoints were within bp. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . repeats and lifespan: no support for the theory of resistant biomolecules for our analysis, we focus on bp long repeat motifs as short repeats are less likely to allow stable base pairing and longer repeats are rare (fig. s ) and because results considering repeat motifs of different lengths usually agree with each other (table s ; yang et al. ). to allow comparability with other studies (lakshmanan et al. ) we analyzed non d-loop motifs, but results for major arc motifs are numerically similar (table s ). first, consistent with yang et al. ( ) we found that ir motifs show a negative correlation with the mls of mammals in the unadjusted model. in addition, we identified er motifs, a class of symmetrically related repeats, that show an even stronger inverse relationship with longevity (fig. a; table ). however, these inverse correlations vanished after taking into account body mass, base composition and phylogeny in a pgls model (table ). second, in agreement with lakshmanan et al. ( ) we found that dr motifs do not correlate with the mls of mammals. the same was true for the symmetrically related mr motifs. just as with ir motifs, modest inverse correlations vanished in the fully adjusted model (table ). we also found the same null results in two other vertebrate classes, birds and ray-finned fishes (table s ). to gain hints as to causality, we finally tested if longer repeats, allowing more stable base pairing, show stronger correlations with mls, but to our surprise we noticed the opposite (fig. s a-d). considering all four types of repeats together, we noticed that repeats with both half-sites on the same strand (dr and mr) or half-sites opposite strands (ir and er) were correlated with each other (fig. b) and with the same mtdna compositional biases (fig. c). thus, for dr and mr motifs, an apparent relationship with mls may be explained by their inverse relationship with gc content and for ir and er motifs by an inverse relationship with gc content and a positive relationship with gc skew. figure the number of everted repeat (er) motifs is negatively correlated with species mls in an unadjusted analysis (a). repeats with a similar orientation correlate with each other (b). direct repeat (dr) and mirror repeat (mr) motifs have a similar orientation since both half-sites are found on the same strand and in the case of er and inverted repeat (ir) motifs the half-sites are on opposite strands. finally, we show the major mtdna compositional biases that co-vary with the four repeat classes (c) and may (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted january , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . explain an apparent correlation with mls. data is for bp long repeats and pearson’s r is shown in (a- c). table . correlation between potentially mutagenic motifs and species lifespan motif type raw adjusted dr bp - . . mr bp - . - . ir bp - . . er bp - . - . triplex default - . - . ** triplex relaxed - . - . ^ gq default . . gq relaxed . - . ** the adjusted model takes into account body mass, gc content, gc skew, at skew and number of effective codons. significant correlations in the raw or adjusted model are bolded/underlined (p< . ). the pgls model additionally considers phylogeny. ^denotes p-values of .