key: cord-0253348-n08wb3cs authors: Muñoz-Baena, Laura; Poon, Art F. Y. title: Clustering and visualizing the distribution of overlapping reading frames in virus genomes date: 2021-06-11 journal: bioRxiv DOI: 10.1101/2021.06.10.447953 sha: 569365be479094ac25312e5afab48975345e70c0 doc_id: 253348 cord_uid: n08wb3cs Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated reading frames in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in ds-DNA viruses. However, the longest overlaps involve no shift in reading frame (+0), increasing the selective burden of the same nucleotide positions within codons, instead of exposing additional sites to purifying selection. Next, we develop a new graph-based representation of the distribution of OvRFs among the reading frames of genomes in a given virus family. In the absence of an unambiguous partition of reading frames by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent reading frames are adjacent in one or more genomes, and (2) that the reading frames overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps. Viruses are an enormous part of the natural world, representing the majority of entities in our 1 planet that undergo organic evolution. For instance, a recent study estimated the existence of over the entire genome is the unit of observation. each family from the NCBI virus database. Our objective was to identify homology among protein 23 coding sequences that may be highly divergent and inconsistently annotated at the family level of 24 virus diversity. We also needed to be able to accommodate gene duplication and divergence in 25 DNA viruses, as well as unique reading frames with no homologs in other genomes (i.e., accessory 1 genes, ORFans [18]). As a result, we decided to use an alignment-free method to compute k-mer-2 based similarity scores between every pair of reading frames within a virus family ( Figure S2 ). We 3 used Python to map each protein sequence to a dictionary of k-mer counts for k = {1, 2, 3} as a 4 compact representation of the sparse feature vector. Let W (s) represent the set of all k-mers (words) 5 in a sequence s, and let f (s, w) represent the frequency of k-mer w in s. Using the quantities, we 6 calculated the Bray-Curtis distance [19] between sequences s and t: This k-mer distance performed relatively well at the task of protein classification in a recent bench-8 marking study of alignment-free methods [20] , where it was implemented as the intersection dis-9 tance in the AFKS toolkit [21] . Intuitively, this measure reflects the overlap of two frequency dis-10 tributions, normalized by the total area of each distribution. The resulting distance matrix was used 11 as input for the t-distributed stochastic neighbor embedding (t-SNE) method implemented in the 12 R package Rtsne [22] . This dimensionality reduction method embeds the data points into a lower-13 dimensional space in such a way that the pairwise distances are preserved as much as possible. 14 Next, we generated a new distance matrix from the coordinates of the embedded points and then 15 used hierarchical clustering using the R function hclust with Ward's criterion [23] ('ward.D2'). 16 Combining dimensionality reduction and clustering methods is frequently used in combination 17 because distance measures have unexpected properties in high dimensional feature spaces [24] . 18 Finally, we used the R function cutree to extract clusters by applying a height cutoff to the 19 dendrogram produced by hclust. Increasing the number of clusters by lowering this cutoff accom-20 modates more ORFans. Conversely, raising the cutoff reduces the number of false positive clusters 21 (reading frames that should not be classified as ORFans). To determine an optimal cutoff for a 22 given virus family, we selected the height that balances two quantities. Let f (i, j) be the number 23 of reading frames assigned to cluster j ∈ {1, . . . , K} in genome i ∈ {1, . . . , N}. First, we calculated 24 the mean proportion of reading frames with unique cluster assignments per genome: where I(x) is an indicator function that assumes a value of 1 if x is true, and 0 otherwise. 1 Second, we calculated the mean frequency of a cluster assignment across genomes: annotations in these records, we were able to assign 9,982 (79.2%) of the genomes to Baltimore gions, however; e.g., the simian hemorrhagic fever virus genome (Genbank accession NC 003092) 12 encodes 33 reading frames of which 36% are involved in an overlap. In contrast, the mean number of nucleotides in overlapping regions was negatively correlated 14 with genome length overall (Spearman's ρ = −0.52, P < 10 −12 ; Figure 1C ). We note that this The median overlap length for +1 OvRFs was 14 nt (IQR 8 to 26 nt 7 We used an alignment-free k-mer-based method [20] to partition all amino acid sequences from 8 genomes in a given virus family into clusters of homology (Supplementary Figure S2 ). In brief, for 9 each virus family we calculated a k-mer distance [19] between every pair of amino acid sequences. 10 We projected the resulting distance matrix into two dimensions by t-distributed stochastic neighbor 11 embedding (t-SNE), and then applied hierarchical clustering to the distances in the 2D plane (Fig-12 ure 3A). The clustering threshold was determined by balancing the mean frequency of a cluster 13 across genomes against the mean number of unique clusters per genome ( Figure 3B ). 14 We propose a graph-based approach to characterize the distribution of OvRFs in the context of 15 coding sequences in the genome. This approach provides a framework for quantifying overlapping 16 regions at a finer resolution within virus families, and is a natural method for visualizing differences 17 between them. Each node in the graph corresponds to a cluster of homologous coding sequences. Nodes are connected by two sets of directed edges (arrows; Figure 3C ). The first set represent Figure 3B displays the distribution of cluster assignments across coding sequences in the genomes. 1 We noted that one of the genomes (bovine adenovirus type 2, NC 002513) had an unusually long 2 non-coding region. We subsequently determined that this reference genome record was not com-3 pletely annotated, removed it from the dataset and repeated our analysis, resulting in 39 clusters. The adjacency graph for Adenoviridae features a relatively conserved gene ordering that cor-5 responds to clusters 10 to 20 ( Figure 3C ). In other words, this part of the graph has a mostly linear In Adenoviridae, OvRFs vary from 1 to 19 nt in length, with a median length of 10 nt (IQR 26 8 -12 nt). By visualizing clusters of homologous coding sequences as a graph, we can see that transitivity and Eigenvector centrality as edge-level attributes, and used a stepwise model selection 15 procedure to determine which combination of attributes was best supported by the data as predictor 16 variables. The results of fitting these regression models to each graph are summarized in Figure 17 5. Effect size estimates varied substantially among virus families. For example, centrality was 18 significantly associated with higher probabilities of overlapping reading frames for Geminiviridae, 19 Papillomaviridae and Rhabdoviridae, but with lower probabilities in Coronaviridae. A cluster 20 with high centrality is connected to many other clusters that are also of high degree size. In previous observation that clusters comprising a core 'backbone' in the adjacency graph tended to 1 be associated with fewer overlap edges. The number of genome sequences for different viruses in public databases is rapidly growing, virus genomes. We observed that the majority of non-splicing OvRFs are short (e.g., less than 10 10 nucleotides). However, the small overlaps in our study were predominantly by 1 or 4 nucleotides, 11 whereas previous work [8] reported peaks at slightly longer lengths (3 and 7 nucleotides, respec-12 tively). We also confirmed previous reports [2, 3, 5] that the number of OvRFs increases with 13 genome length, whereas OvRFs tend to be shorter in longer genomes. These trends are consistent 14 with the compression theory that proposes that overlapping genes are a significant mechanism for 15 reducing genome lengths [3]. However, we must be cautious about interpreting these patterns be-16 cause, like previous work, there is no adjustment for non-independence among observations due 17 to evolutionary homology, i.e., identity by descent. This can be mitigated in part by examining 18 correlations within virus families. At this level, we did not find significant evidence of a consistent 19 association between overlap and genome lengths (Supplementary Figure S3) . In the absence of a standard notation for frameshifts in overlapping reading frames, past stud- strands. However, we used −1 and −2 to indicate that the codons on the opposite strand are shifted 25 by one and two nucleotides, respectively, relative to the −0 frame, which we consider to be a more 1 intuitive notation. In an analysis of 701 RNA virus genomes, Belshaw and collaborators [3] previ-2 ously reported that most overlapping genes consist of +1 and +2 frameshifts (+1 and −1 in their 3 notation). We observed similar results in our analysis of 5,972 RNA virus genomes, which we 4 further stratified by Baltimore class (Figure 2 ). Note that, unlike some earlier work, we included 5 +0 overlaps in our database. Even though the reading frames share codons, they yield different 6 gene products such that the codons are exposed to different selective environments. These cases selection. This is countered by a narrower repertoire of protein sequences and structures. In our data, antisense frameshifts (i.e., −0, −1 and −2) account for only the 6.6% of all over- recently determined that the −1 frameshift (−2 in their notation) was the most constrained, in that 20 the codons used in one reading frame limit the amino acids that can be encoded in the other. It also 21 minimizes the expected frequency of stop codons in the opposing strand, but −1 overlaps were not 22 significantly longer than other types in our data. One of the key challenges for extending our comparative analysis to the level of individual 24 reading frames was the assignment of reading frames into clusters of homology. This is compli- genes were simply not annotated. We subsequently confirmed this using a gene prediction and 10 homology search analysis and discarded this reference genome from our analysis. Given the abundance of genomic diversity at the level of virus families, the assignment of 12 reading frames into homologous groups is not unambiguous. Thus, we utilized an alignment-free 13 approach to cluster the coding sequences in the reference genomes for each virus family. There are 14 a large number of alignment-free methods that extract k-mers from two input sequences (see [20] 15 for a recent review). We chose the Bray-Curtis distance (also known as the intersection distance) 16 because it performed comparatively well at the task of protein classification in a recent bench-17 marking study [20] . However, the classification analysis in that study was performed on protein 18 databases curated to span a broad range of relationships, including both cases of evolutionary and 19 structural homology. In addition, it was necessary to apply a threshold to our hierarchical clus- 3). On the other hand, cluster assignments tended to become more variable towards the 5' and 3' 10 ends of the genome in association with an increasing frequency of OvRFs. Similarly, OvRFs in 11 Coronaviridae tended to be associated with pairs of clusters representing accessory genes that were 12 less frequently adjacent in these genomes (Figure 4) to calculate the k-mer distance between proteins followed by an R script to designate each protein to a cluster according to homology. Finally, we used a Python script to generate dot files using Graphviz. B. Adjacency plot interpretation. Each one of the proteins that constitute a genome is assigned to a different cluster with homologous proteins from other species. From each cluster, we draw arrows in gray that represent adjacent proteins and arrows in blue that represent overlapping proteins. The width of the arrow is proportional to the number of proteins related between the two clusters. One cluster can have entries adjacent to proteins in different clusters. In this example, cluster 3 has proteins adjacent to proteins in cluster 2 and cluster 5. Figure S4 : Word clouds of protein names mapped to clusters for Adenoviridae. de novo metazoan mitochondrial genome annotation. Molecular phylogenetics and evolution Prokaryotic Virus Orthologous Groups (pVOGs): 4 a resource for comparative genomics and protein family annotation Biopython: freely avail-8 able Python tools for computational molecular biology and bioinformatics Database resources of the national center for biotechnology information The combinatorics of overlapping genes Vi-15 ralZone: a knowledge resource to understand virus diversity. Nucleic acids research Identification and investigation of ORFans in the viral world. Bmc Ge-18 nomics An ordination of the upland forest communities of southern Wisconsin. 20 Ecological monographs Benchmarking 22 of alignment-free sequence comparison methods A survey and evaluations of histogram-based statistics in 26. Hunter JD. Matplotlib: A 2D graphics environment Eigenvector-centralitya node-centrality? Social networks The VGAM package for categorical data analysis Evolution of genome 16 size and complexity in the Rhabdoviridae Nucleotide 18 sequence, genome organization, and transcription map of bovine adenovirus type 3 Genetic content and evolution of adenoviruses