key: cord-0812562-kevrp8rg authors: Luczak-Roesch, Markus title: Networks of information token recurrences derived from genomic sequences may reveal hidden patterns in epidemic outbreaks: A case study of the 2019-nCoV coronavirus. date: 2020-02-11 journal: nan DOI: 10.1101/2020.02.07.20021139 sha: d5f14e63b86239f491c9dd1749cff01dc4d633cb doc_id: 812562 cord_uid: kevrp8rg Profiling the genetic evolution and dynamic spreading of viruses is a crucial task when responding to epidemic outbreaks. We aim to devise novel ways to model, visualise and analyse the temporal dynamics of epidemic outbreaks in order to help researchers and other people involved in crisis response to make well-informed and targeted decisions about from which geographical locations and time periods more genetic samples may be required to fully understand the outbreak. Our approach relies on the application of Transcendental Information Cascades to a set of temporally ordered nucleotide sequences and we apply it to real-world data that was collected during the currently ongoing outbreak of the novel 2019-nCoV coronavirus. We assess information-theoretic and network-theoretic measures that characterise the resulting complex network and suggest touching points and temporal pathways that are of interest for deeper investigation by geneticists and epidemiologists. How far a virus will eventually spread at local, regional, national and international levels is of central concern during an epidemic outbreak due to the severe consequences large-scale epidemics have on human well-being [5] , the stability and coherence of social systems [11] , and the global economy [23] . A thorough understanding of the genetic characteristics of a virus is crucial to understand the way it is (or may be) transmitted (e.g. human-to-human transmissibility) in order to be able to anticipate the potential reach of an outbreak and develop effective countermeasures. The recent outbreak of a new type of coronavirus -2019-nCoV -provides plenty of examples of the way researchers seek to quickly approach the aforementioned problems of genetic decomposition of the virus [2, 8] and modelling of its transmission dynamics [15] . Here we present a novel approach to gain insights into the transmission dynamics of an epidemic outbreak. Our method is based on Transcendental Information Cascades [16, 17] and combines gene sequence analysis with temporal data mining to uncover potential relationships between a virus' genetic evolution and distinct occurrences of infections. We apply this new approach to publicly available genomic sequence data from the currently ongoing outbreak of the 2019-nCoV coronavirus. Our results show pathways of similarity between certain clusters of 2019-nCoV virus samples taken in different geographical locations, and indicate coronavirus cases that are candidates for further investigation into the human touch points of these patients in order to derive a detailed understanding of how the virus may have spread, and how and why it has genetically evolved. Our analysis provides a unique view to the dynamics of the 2019-nCoV coronavirus outbreak that complements knowledge obtained using other state-of-the-art methods such as Bayesian evolutionary analysis [9] . We obtained a total of 87 genomic sequences from human subjects that were available via the GISAID EpiFlu Database™ 1 as of Monday, February 10th 2020. These data were preprocessed using the R programming language. First, we derived a metadata object that allows to store the sequence identifier, the collection location, the collection date, and the raw nucleotide sequence. The metadata object was then filtered to keep only those sequences for which the collection date was not empty and which featured a length that fell within a margin of 1, 000 of the estimated 30, 000 nucleotides that are currently assumed to make up the genomic code of the 2019-nCoV coronavirus (i.e. we cover the range of nucleotide sequences from 29, 000 to 31, 000 in length). This left us with a total of 82 genomic sequences. We then exported a single CSV file containing only the raw nucleotide sequences in ascending order by the collection date of the sample. This structure is the standard input format for the genes-CODON-samplesequence tokeniser featured in the Transcendental Information Cascades R toolchain that is available as free and open scientific software 2 . Transcendental Information Cascades (TICs) are a method that first transforms any kind of sequential data into a network of recurring information tokens, to then exploit this network as a kind of analytical middle ground between views to the source data that would otherwise only be accessible individually through distinct analytical frameworks [16, 17] . The approach falls into the category of contemporary approaches that make use of network theory for the study of dynamical complex systems [4, 19, 21] . In the field of phylogenetic and phylogeographic analysis our approach can be compared to haplotype networks for example [1, 18] but with the benefit that the resulting network preserves access to temporal patterns at both macroscopic (e.g. whole generation time periods) and microscopic (e.g. short time period bursts) resolution. At an abstract level a Transcendental Information Cascade is constructed from some Figure 1 . Example of the unique codon tokenisation used in this study. The example sequence comprises three window steps and shows how the +1 (marked in blue), +2 (marked in orange) and +3 (marked in black) reading frames lead to seven unique codon tokens extracted from the nucleotide sequence. sequentially ordered source data in the following way: At first the data from each distinct sequence step is processed to extract unique information tokens. These tokens form the identifier set for that particular sequence step. Then a directed network is generated with one vertex for each sequence step and edges between any two vertices that have a particular token from their identifier set in common and that does not occur in any identifier set of vertices that represent sequence steps between these two. In other words, identifier sets represent information token co-occurrences and edges represent information token recurrences. As the information tokens of interest in this study we encode unique codon identifiers as (a) their position when sliding a window of size 3 in steps of size 3 over the nucleotide sequence (e.g. "pos1", "pos2", ...), (b) a flag that indicates the reading frame ("+1", "+2" or "+3") that captures the respective triplet in the current window, and (c) the actual matched nucleotide triplet that constitutes the matched codon. The use of this tokenisation is motivated by previous work on phylogenetic profiling and gene sequencing [6, 13] that suggested that codons suit well for assessing similarities (and dissimilarities) at the gene, chromosome, or genome levels [6, 14, 22] . An example of the particular encoding we use is visualised in Figure 1 . It is this tokenisation of the original nucleotide sequences that we use in the TIC approach to construct the information token recurrence network from. This means in our TIC instantiation the nodes represent distinct nucleotide sequences of the virus obtained from patient samples ordered by the date of their collection by researchers (random order in case of same date collections). The edges are unique codon identifier recurrences matched between different nucleotide sequences. The TIC network we derive in the way described before can be analysed in a variety of ways. Information-theoretic and network-theoretic measures have been introduced as the base framework to start the analysis from [16] . In particular we perform (1) an analysis of the information token entropy and evenness over time, (2) an analysis of network clusters of sequences that can be detected in the TIC network, and (3) an analysis of the intra-cluster and inter-cluster nucleotide similarity of the original sequences. We perform an analysis of information token entropy and information token evenness in order to understand whether the virus evolution can be considered an open or a closed system. Therefore we assess both measures in an accumulated fashion for each progression step through the ordered nucleotide sequences. In a closed system the information token entropy should strive towards an equilibrium of maximum entropy and not feature any wave-like up and down patterns. If we observe wave-like patterns it means nucleotide sequences at later progression stages feature codon identifiers that have not been observed in earlier nucleotide sequences. So to measure information token entropy we assess which unique codon identifiers were seen up to a progression step and how often these were seen. This allows to compute the probability for every codon identifier, which then can be used to determine the information token entropy up to that point following the definition of Shannon entropy given as follows: Network analysis The Transcendental Information Cascade network we constructed can be mathematically and visually analysed in order to find structures that are of particular significance for a single epidemic outbreak or to compare different outbreaks structurally. In this study we focus on the former due to the lack of comparative data about other epidemic outbreaks. We first perform a cluster analysis using the random walk apporach by Blondel et al. [3] on the weighted network that we can construct from our original TIC network by collapsing all edges between the same vertices to unique edges weighted by the sum of the collapsed edges (see Figure 2 for a schematic example and Figure 4 for the actual networks we construct following this approach). Afterwards, we analyse visually the network using the open source software Gephi. In Gephi we scale vertices by their degree and edges by their weight. Furthermore, we colour vertices by their cluster membership. We then run the Yifan Hu [12] layout algorithm (optimal distance: 10,000; relative strength: 0.2; initial step size: 20; step ratio: 0.95; quadtree max level: 50; theta: 0.8; convergence threshold: 1 * 10 −4 ; adaptive cooling: enabled) to visualise the network. We repeat the exact same process to construct and visualise three further networks (cf. Cluster evaluation In order to evaluate the meaningfulness of the TIC network against some baseline, we computer the intra-cluster and inter-cluster similarity of the raw nucleotide sequences. This is done by computing all pairwise comparisons of sequences within individual clusters (intra-cluster similarity) and comparison between all sequences from a cluster with all sequences that are not within that cluster (inter-cluster similarity) using the compareStrings function provided by the Biostrings R package [20] . The similarity of two sequences is then simply defined as follows: number of dif f erent nucleotides length of compared nucleotide sequence It is important to note that not all sequences we obtained had the exact same length (mean length of nucleotide sequences: 29861.95; standard deviation: 28.40705). Hence, we performed this comparison only on the sub-sequence from the first nucleotide position to the maximum nucleotide position of the shorter of the two sequences a and b. Information-theoretic analysis Figure 3 shows the graphs for the information token entropy and information token Informationtheoretic assessment. Both entropy and evenness graphs feature wave-like patterns, indicating that the set of observed unique codon identifiers grows with increasing progression stages while the distribution of all observed unique codon identifiers remains rather even. evenness. As one can see, these feature the aforementioned wave-like patterns, which means the system can be regarded open in the sense that over time the number of unique codon identifiers grows. It is important to note that the evenness of the distribution of codon identifiers is relatively high, converging around a value of 0.95. This means that those codon identifiers that have been observed are quite evenly distributed. These results allow the interpretation that (a) there is evolving variation in the genomic sequences we studied, but (b) there is no indication for the dominance of a particular variance. The weighted network derived from the original TIC network features a modularity of 0.762 and a diameter of 7, which mainly reflects its small size and its particular structure that results from preserving temporal order in the sense that edges can only ever direct forward in time (i.e. to vertices that represent a later stage). However, with an average degree of 8.671 we find that there is certainly structure that differs from just simple step-by-step forward paths that lead to the consecutively next node. This latter result suggests that some codon paths branch and merge, emphasising that there is an evolving variation in the studied genomic sequences. From the cluster analysis in combination with a visual inspection of the resulting graph (please refer to Figure 6 as well as Table 11 in the appendix) we find the following patterns that are noteworthy: • At the center of the network one cluster emerges (dark red, cluster 0) that has only few stronger connections within the cluster but features connections to all other clusters and within itself through lower weighted edges. This cluster has no temporal focus, it comprises samples that were collected early as well as later during the outbreak. While the genome sequences that represent the beginning of the progression stage path in this cluster are from China, it then transitions geographically to Australia, Japan, Korea, Germany, Taiwan and Singapore. • The light red cluster (cluster 1) comprises a coherent sequence of samples taken from patients in the original outbreak region (Wuhan) and is temporally focused on the early progression stages. • The orange cluster starting from the bottom center of the network (cluster 2) is again first a sequence of samples from the original outbreak region in Wuhan, transitioning to Thailand (Nonthaburi), then back to China (Guangdong), before finishing in France. • Cluster 4 (light green) is one of the smallest clusters but is spread fairly wide temporally. It features 4 samples taken in China and then transitions geographically to Singapore and France. • In dark green (cluster 5) we can identify a cluster of genome data that was collected in Chonggqing, the USA and Singapore. • The light blue cluster (cluster 6) is a temporally quite coherent sequence that links samples taken in Guangdong, Guangzhou and Foshan. • A kind of international cluster (samples from Australia, the USA, England, Finland and Belgium) is represented by the dark blue cluster (cluster 7). Figure 4 . Weighted TIC network. From the original TIC network we generate a weighted network by collapsing all edges between the same vertices to unique edges weighted by the sum of the collapsed edges. The colours represent the nine identified network clusters. Node size represents the degree and edge thickness represents edge weight (in the sense that higher edge weight indicates a higher amount of codon identifiers shared by two vertices). Numeric indices before the genome identifiers represent the progression stage index of the particular vertex in the sequence of all ordered nucleotide sequences we analysed. We suggest that while there are some obvious pathways of high similarity due to temporal order of the samples taken, there are still good indications for geographicaldependencies and transition phases that deserve further investigation. Transition points into and out of cluster 0 are particularly interesting in this regard, especially when the clusters linking in our out feature samples taken outside of China. We suggest that it may be worth to qualitatively assess the human-to-human connections the respective subjects had from which these samples were taken, and the places they have visited if there are no known links between the samples so far. Triangulating the full TIC network with the networks that focus on individual reading frame similarity only (please refer to Figures 6, 7 and 8 as well as Tables 13, 15 and 17 in the appendix) shows that the macroscopic structure is mostly consistent but much sparser in terms of low-weighted edges. Inter-and intra-cluster similarity From our measurement of the intra-cluster similarity (see Table 1 ) we find that the genomic similarity at the level of nucleotides is low in the central cluster 0, which is expected because this is the largest of all clusters with low temporal coherence. However, cluster 5 also features a slightly lower average similarity with comparably high standard deviation, which indicates that this cluster may be slightly less coherent. The barchart of the inter-cluster analysis results shown in Figure 5 reveals that there are some cluster pairs with high inter-cluster similarity, that the inter-cluster similarity is different between the +3 reading frame network and the other two reading frame networks, and that the pattern of the +3 reading frame network inter-cluster similarity is similar to the one for the entire TIC network. In particular we find that in the +1 and +2 reading frame networks the clusters 1, 4, 6 and 7 are quite similar at the nucleotide level compared to all other clusters that are pairwise distinct. For the entire TIC network and the +3 reading frame network instead we find that clusters 1, 3, 5 and 7 show this kind of similarity. Altogether, these results indicate that (a) in some instances the clustering based on structural characteristics of the TIC network overrides actual genomic similarity, and (b) the choice of the reading frames during TIC construction has an impact on the resulting network that needs to be further investigated. In this paper we presented a novel way to model, visualise and analyse the temporal dynamics of an epidemic outbreak. We applied Transcendental Information Cascades to capture the recurrence and co-occurrence of unique codon identifiers over distinct and temporally ordered nucleotide sequences derived from genetic samples taken from different human subjects. Our case study was focused on the recent outbreak of the novel 2019-nCoV coronavirus. We find clusters and evolution pathways of virus genomes in the constructed Transcendental Information Cascade that are candidates to be investigated further for potential human-to-human transmission paths and genetic evolution of the 2019-nCoV coronavirus. Our study was performed as a snapshot case study on a limited sample of virus data. While this was at the time of this study all data that was available, we will need to re-execute this analysis continuously as more data becomes available to check whether any of our results change. Furthermore, our results indicate that experiments with 7/33 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Figure 5 . Overview of inter-cluster similarities for (a) the full TIC network, (b) the +1 reading frame network, (c) the +2 reading frame network, and (d) the +3 reading frame network. . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint alternative tokenisations are needed, such as weighting the +1, +2 and +3 reading frames differently (or omitting certain frames entirely) depending on their functional role [10] . We also suggest that it will be important to perform a wider variety of analysis of the generated TIC data (e.g. testing of different network clustering methods, application of dynamical systems analyses techniques) to get more profound insights. Our analysis also does not take measurement biases into account that may results from different techniques used by different laboratories for genome extraction and analysis [7] . In summary, the approach presented in this paper is exploratory and hypotheses generating in nature. It does not confirm or reject any of the hypotheses we generated about the potential transmission and evolution pathways nor does it provide causal explanation for the patterns found. However, the purpose of our work is to provide an alternative view to the epidemic dynamics of the 2019-nCoV outbreak (and potential other outbreaks in the future) that is hidden to state-of-the-art methods, and that may trigger geneticists and epidemiologists to look at pathways in epidemic outbreaks they would not normally be aware of. The software developed for this particular analysis is part of the open scientific software "R toolchain to construct and analyse Transcendental Information Cascades" that is available via https://github.com/vuw-c2lab/transcendental-information-cascades/ tree/master/input/2020-02-03-2019nCoV. For reproduction purposes it is necessary to obtain the raw genomic data used in this study from the GISAID EpiFlu Database™. For further reference we provide the raw results of some of our analysis. We particularly point to the tables that provide references for relationships between nucleotide sequences and the cluster membership of nucleotide sequences. . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Table 9 . Edge weight between distinct nucleotide sequences in the Transcendental Information Cascade we constructed (contd.). . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . 1 801289 2 197353 3 166842 4 123821 5 92366 6 88539 7 87805 20 55715 10 51884 9 32851 22 27681 27 26443 24 24707 14 11886 26 7172 11 7013 13 6058 8 5205 15 3772 12 3304 17 3124 23 2893 16 2240 43 2168 25 2044 29 1968 21 1892 19 1811 28 1807 34 1682 31 1623 18 1601 45 1511 46 1445 30 1422 56 1402 44 1363 60 1344 72 1256 73 1176 38 1172 58 1159 36 1125 39 1071 37 1030 41 1011 55 961 63 911 57 899 59 836 48 832 47 732 71 713 35 685 74 617 42 616 62 529 64 505 53 477 49 474 40 468 33 459 54 405 65 401 61 363 70 358 32 351 78 342 77 323 79 284 75 265 51 4 50 2 67 2 52 1 68 1 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Table 11 . Cluster memberships of the nucleotide sequences (full TIC network). In-degree Out-degree Degree Cluster Figure 6 . +1 reading frame network. . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Table 13 . Cluster memberships of the nucleotide sequences (network for +1 reading frames). In . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Figure 8 . +3 reading frame network. . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Table 17 . Cluster memberships of the nucleotide sequences (network for +3 reading frames). In-degree Out-degree Degree Cluster . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint 28/33 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.07.20021139 doi: medRxiv preprint Median-joining networks for inferring intraspecific phylogenies The 2019-new coronavirus epidemic: evidence for virus evolution Fast unfolding of communities in large networks The structure and dynamics of multilayer networks Ebola virus disease 2013-2014 outbreak in west africa: an analysis of the epidemic spread and response Finding functional features in saccharomyces genomes by phylogenetic footprinting Towards standardisation of cell-free dna measurement in plasma: controls for extraction efficiency, fragment size bias and quantification Genomic and protein structure modelling analysis depicts the origin and infectivity of 2019-ncov, a new coronavirus which caused a pneumonia outbreak in wuhan How translational accuracy influences reading frame maintenance Evolved disease-avoidance mechanisms and contemporary xenophobic attitudes Efficient, high-quality force-directed graph drawing Sequencing and comparison of yeast species to identify genes and regulatory elements Similarity analysis of dna sequences based on codon usage Transmission dynamics of 2019 novel coronavirus (2019-ncov). bioRxiv What an entangled web we weave: An information-centric approach to time-evolving socio-technical systems From coincidence to purposeful flow? properties of transcendental information cascades A bayesian approach to phylogeographic clustering Detecting sequences of system states in temporal networks Biostrings: Efficient manipulation of biological strings Nonlinearity+ networks: A 2020 vision The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications -BetaCoV/Beijing/IV DC-BJ-005/2020 43 -BetaCoV/Shandong/IVDC We want to thank the laboratories and researchers who contributed the genomic research data used in this study and obtained via the GISAID EpiFlu Database™ 3 on Monday, February 10th 2020. The following tables contain information about the origination of the research data used in this study and give credit to the collecting and submitting researchers and laboratories. BetaCoV/Belgium/GHB-0302 1/2020