key: cord-0759172-tijoafv4 authors: Campbell, K.; Gifford, R.J.; Singer, J.; Hill, V.; O’Toole, A.; Rambaut, A.; Hampson, K.; Brunker, K. title: Making Genomic Surveillance Deliver: A Lineage Classification and Nomenclature System to Inform Rabies Elimination date: 2021-10-14 journal: bioRxiv DOI: 10.1101/2021.10.13.464180 sha: b70e508c52024456c7feab2684960b30f6ba6b89 doc_id: 759172 cord_uid: tijoafv4 The availability of pathogen sequence data and use of genomic surveillance is rapidly increasing. Genomic tools and classification systems need updating to reflect this. Here, rabies virus is used as an example to showcase the potential value of updated genomic tools to enhance surveillance to better understand epidemiological dynamics and improve disease control. Previous studies have described the evolutionary history of rabies virus; however, the resulting taxonomy lacks the definition necessary to identify incursions, lineage turnover and transmission routes at high resolution. Here we propose a lineage classification system based on the dynamic nomenclature used for SARS-CoV-2, defining a lineage by phylogenetic methods, for tracking virus spread and comparing sequences across geographic areas. We demonstrate this system through application to the globally distributed Cosmopolitan clade of rabies virus, defining 73 total lineages within the clade, beyond the 22 previously reported. We further show how integration of this tool with a new rabies virus sequence data resource (RABV-GLUE) enables rapid application, for example, highlighting lineage dynamics relevant to control and elimination programmes, such as identifying importations and their sources, and areas of persistence and transmission, including transboundary incursions. This system and the tools developed should be useful for coordinating and targeting control programmes and monitoring progress as we work towards eliminating dog-mediated rabies, as well as having potential for broad application to the surveillance of other viruses. Author Summary The importance of the ability to track the diversity and spread of viruses in a universal way that can be clearly communicated has been highlighted during the SARS-CoV-2 pandemic. This, accompanied with the increase in the availability and use of pathogen sequence data, means the development of new genomic tools and classification systems can strengthen outbreak response and disease control. Here, we present an easy-to-use objective and transferable classification tool for tracking viruses at high resolution. We use rabies virus, a neglected zoonotic disease that causes around 59,000 human deaths each year, as an example use case of this tool. Applying our tool to a global clade of rabies virus, we find an over 200% increase in the definition at which we can classify the virus, allowing us to identify areas of persistence and transmission that were not previously apparent, and patterns of virus spread. Insights from the application of this tool should prove valuable in targeting vaccination campaigns and improving surveillance as countries work towards the elimination of dog-mediated rabies. Rabies virus (RABV) causes around 59,000 deaths [1] and costs in excess of $8.6 billion per year [2] , with a near 100% mortality rate after the onset of symptoms [3] . Post-exposure prophylaxis is highly effective in preventing rabies if administered quickly following exposure [4] but this is not always possible given its high cost and limited accessibility. Rabies can occur in all species of mammal, but up to 99% of human rabies cases arise from bites from infected domestic dogs [5, 6] . Vaccinating dogs to interrupt transmission is therefore paramount [7] and a major focus of 'Zero by 30', the global strategy to eliminate human deaths from dog-mediated rabies by 2030 [8] . The focus of 'Zero by 30' is on dog-mediated rabies, but spill over from dogs into other carnivores often occurs, generally causing only short-lived chains of transmission [9] . Genomic surveillance is therefore vital to identify and monitor any wildlife reservoirs that may emerge [10] . To achieve the 'Zero by 30' goal, effective and coordinated surveillance is essential. Genomic surveillance can complement routine epidemiological surveillance through the insights it can provide on the lineages circulating in an area and any sources of incursions [11] . Sequence data proved useful in understanding rabies dynamics in Bangui, the capital of the Central African Republic [12] . Instead of sustained transmission in the city, local extinction occurred on three occasions with introductions from surrounding areas reseeding circulation, showing the need to expand control efforts across a larger geographic area [12] . Genomic surveillance can also reveal host shifts and other unusual dynamics. When rabies in Arctic foxes underwent a host shift to infect red foxes (Vulpes vulpes), the virus spread rapidly throughout Canada, resulting in epidemics in the 1950s and 1960s before being apparently eliminated in 1990 [13] . However, between 2015 and 2017 a number of rabies cases occurred in wildlife which were genetically similar to sequences from these historic fox rabies epidemics [13] . Analysis revealed that although several lineages were eliminated, one persisted and underwent a host switch to skunks [13] . To monitor how an epidemiological situation is changing first requires characterization of the current situation. Several well-defined RABV clades circulate globally, within two major phylogenetic 5 groups; bat-related and dog-related [1] . The dog-related group is split into between 5 and 7 different clades, depending upon the classification [1, 14] . This includes the Cosmopolitan clade; over 9500 publicly available sequences (including sequences of all lengths) split into 22 subclades, present in over 100 countries [14, 16] . These discrepancies in clade numbers are illustrative of issues surrounding the interpretation and classification of RABV phylogenetic data as a universal naming system does not extend past high-level classification, with no set rules for defining lineages. Additionally, genomic data availability varies. Increasingly studies are focusing on whole genome sequences (WGS) given the greater resolution they provide, but the vast majority of studies thus far have generated shorter, partial gene sequences [15] [16] [17] . A lineage designation system therefore needs to be able to incorporate all of this data to provide maximal contextual information. These terms, and others relevant to this study, are defined in box 1. Clade -Monophyletic group of sequences with a single ancestor. Subclade -A smaller monophyletic group contained within a larger clade. Lineage -A group of genetically related sequences defined by statistical support of their placement in a phylogeny and genetic differences from a common ancestor. Major lineage -A lineage named with a letter -e.g. A1 or Cosmopolitan AF1b_A1, that can be the first iteration of a lineage, or a lineage that has undergone significant evolution to become a new major lineage. Minor lineage -A lineage named with numbers (following the major clade nomenclature) -e.g. A1.1.1 or Cosmopolitan AF1b_A1.1.1 This is a major lineage that has undergone evolution, but not enough to become a new major lineage. Lineage designation -An initial step to designate lineages based on a set of reference sequences, or the first set of data from an area, defined by an existing set of rules, and to name them accordingly. This is completed once to form a reference set of sequences to be used for lineage assignment, and may need to be updated as genetic diversity accumulates. Lineage assignment -Identifying which existing lineage, defined by the initial lineage designation step, a new sequence belongs to. [14] . The NEE subclade has been expanded to show major and minor lineages from the updated MADDOG classification system. Although the current phylogenetic classifications generally work well at a global level, they lack resolution for surveillance at more local or regional scales. The existing system generally represents where the different clades originally emerged and their early geographical distribution [1] . However, these clades do not always remain isolated to particular areas, and their dispersal is affected by human movements [18] , leading to frequent introductions of lineages to new areas and their subsequent cocirculation [18] . Moreover, control and elimination efforts will further affect the distribution and diversity of circulating lineages. In many regions, a limited number of subclades appear to circulate [19] , therefore, without higher lineage resolution it becomes difficult to identify patterns like lineage extinction that are to be expected, and therefore to monitor the impacts of control. Rambaut et al. (2020) propose a universal virus nomenclature system to address these issues, which they apply to SARS-CoV-2 [20] . The system outlines a set of rules for classifying and naming viral lineages to produce a standardised tool to identify viral diversity on global and local scales, to track lineage emergence and transmission, and to allow for coherent updates as lineage turnover occurs [20] . Here, we adapt and apply these rules to a large diverse major clade of RABV to produce an updated, dynamic, classification system for the virus. The increased definition provided allows for more detailed genomic surveillance that can be used to monitor the circulation of viral lineages and identify incursions, unusual transmission routes and potential host shift events. RABV-GLUE is a 'sequence data resource' developed to support rabies elimination efforts by enabling efficient dissemination and utilization of RABV sequence data using GLUE -a bioinformatics environment for managing and interpreting virus sequence data [21, 22] . GLUE supports development of virus-specific 'projects' comprising curated sets of sequences, genome feature annotations, alignments, reference clades and phylogenies with associated metadata, with options to upload sequences to GenBank. Loading projects into the GLUE 'engine' creates a relational database representing complex semantic links between data items so computational analyses can be precisely and widely replicated. RABV-GLUE projects can be installed locally on all commonly used computing platforms and are fully containerised via Docker [23] . GLUE's command layer provides a mechanism for retrieving and analysing data by coordinating interactions between the RABV-GLUE database and commonly used bioinformatics software tools (e.g. MAAFT, RAXML). This allows experienced bioinformaticians to quickly establish local RABV sequence databases and integrate these resources into existing bioinformatic pipelines, tailoring functionalities to their specific needs. Hosting the RABV-GLUE project in an openly accessible online version control system (e.g., GitHub) provides a mechanism for managing its ongoing development by multiple collaborators, following practices established in the software industry. GLUE projects can also be developed into interactive, user-friendly web services through a graphical user interface. An online interface to RABV-GLUE is available at http://rabv-glue.cvr.gla.ac.uk/ (Fig 2) . Here we use sequence data and classifications from RABV-GLUE, to provide the basis for the development of an updated classification system which will be integrated into RABV-GLUE, enhancing its capacity to provide detailed, high resolution lineage information about user input sequences. All available RABV sequences (n=23,386), irrespective of sequence length and clade, and their associated metadata were downloaded from the RABV-GLUE website (http://rabv.glue.cvr.ac.uk). Many metadata entries stripped from Genbank and added to the RABV-GLUE database were missing information necessary for analysis, including the sample collection year, host, and location (n=9650). Records with missing information were searched manually and any information that could be found from primary publications was updated into RABV-GLUE. The dataset of all sequences was then filtered to only include sequences designated to the Cosmopolitan clade by RABV-GLUE, excluding vaccine strains. For the purposes of these analyses we considered WGS, covering >90% of the genome (at least 10,000 nucleotides (nt)), and nucleoprotein (N) gene sequences (1300 nt), thus excluding smaller partial gene fragments. Sequences were aligned using MAFFT [24] with the FFT-NS-2 algorithm [24] . Maximum Likelihood trees were constructed using IQTREE2 with model selection [25] and 100 bootstrap replicates [26] . Ancestral sequence reconstruction was then undertaken using Treetime ancestral [27] . A custom R function for lineage designation was developed which requires the tree, corresponding alignment, ancestral sequences and metadata. This returns summary statistics about each sequence, its lineage designation, and details about each of the designated lineages. Lineage defining nodes are identified according to thresholds on bootstrap support (>70) and cluster size (>10 descendents of >95% coverage at genome level or of the gene of interest, excluding gaps and ambiguous bases). For each node still in consideration, the ancestral sequence is extracted. To define a new lineage, there must be at least one common mutation between all descendents that is different to the ancestral sequence. The algorithm for lineage designation is summarised in Fig 3. The parameters for designation of partial genome sequences were refined to optimise the comparability with the whole genome designations. Existing phylogenetic groupings are already in place for RABV [14, 28] and sequences are automatically defined by these clades in RABV-GLUE which acts as a useful baseline for further classification , and subsequent MAD DOG lineage designations are named according to the rules in Rambaut et al. [21] , starting from lineage A1 (which in this instance would be Cosmopolitan A1). Any lineages descended from this become A1.1 or A1.2 etc and any descended from these become Once the designation step has been performed to construct a set of reference sequences with defined lineages, new sequences can be compared to the reference set and assigned to lineages. New sequences (imported individually or in bulk) are added to the existing reference alignment using MAFFT and for each new sequence the two closest references are identified and used to transfer a lineage assignment. MAD DOG has been developed as a command-line based tool to undertake all the steps for lineage designation and assignment outlined above, given a set of input sequences (and corresponding metadata for designation).The assignment tool also outputs information about the assigned lineage including which countries it has been sampled in and when the first and most recent sequences for that lineage were recorded. Additionally, this can be implemented through an R package (MADDOG), full details of which can be found at DOI: 10.5281/zenodo.5503916. MAD DOG was used to designate lineages for sequence subsets from the Cosmopolitan clade, with its performance tested on whole and partial genome sequences. The resulting designations were explored on a global scale and on a specific geographic area (Tanzania) to test functionality and interpret lineage assignments. The Cosmopolitan dataset comprised 650 WGS, excluding vaccine strains, spanning 46 countries from 1950-2018. The N gene data comprised an additional 1476 sequences spanning an additional 25 countries from 1950-2020. Running the lineage designation on Cosmopolitan WGS resulted in 73 lineages, more than a 3-fold increase in definition compared to 22 previously defined subclades ( Fig 4) . The baseline MAD DOG designation agreed with the previously determined global phylogenetic groupings, with the exception of clusters containing <10 sequences (a required threshold for MAD DOG lineage designation). However, our tool provided a much deeper characterisation beyond the subclade level [1] . Some subclades are split into multiple lineages providing increased definition; four showed at least a 5-fold increase in definition, and two showed over a 10-fold increase, in line with sampling density (Table 1, Fig 4b) . The AF1b subclade (purple in Figs 4 and 5) is split into 20 lineages, and the ME1a (orange) subclade is split into 13, indicating more than a 1000% increase in definition in both cases. Some subclades show no further definition due to the small number of sequences available to represent this subset of RABV diversity. WGS capture greater variation, and are able to differentiate between samples that are identical at a partial genome level [16] . In this study, although all sequences were unique at the whole genome level, only 78% of these were unique at an N gene level (n=509). While whole genomes offer better phylogenetic resolution, the N gene is commonly targeted by diagnostic laboratories providing a greater number of sequences and wider spatio-temporal coverage to detect and define lineages not captured by the still limited number of WGS. In total, there are 1469 WGS from 82 countries available on RABV-GLUE but 7304 N gene sequences from 109 countries ( Figure S2 ). This highlights the volume of additional data available when using partial genome sequences. However, the proportion of studies sequencing RABV whole genomes is increasing, as are studies using at least N gene length sequences instead of smaller fragments ( Figure S2 ). The higher resolution of this classification system can be used to 'zoom in' on unusual cases and look at the historical and geographical context of a lineage. Here, we illustrate the example of a human rabies case (GenBank accession: KC737850) reported from the USA in 2011 [28] , where dog-associated rabies has been eliminated but wildlife variants are endemic. The case report explains that this individual moved from Brazil to the USA, and many years later developed rabies [28] . In this situation the most likely cause of rabies was either exposure to wildlife rabies in the USA, or exposure to rabies in Brazil, which seemed unlikely given the long incubation period. Sequence data shows that the latter scenario is correct (Fig 6) , with the RABV sequence from the patient showing high similarity to dog RABV sequences in Brazil as reported by Boland et al. [29] . Using additional N gene sequences, we can examine this lineage in more detail. A sequence from Peru in 1999 (accession: KU938904) (Fig 6) , was listed in RABV-GLUE as being from the common vampire bat, Desmodus rotundus but the original GenBank record states 'livestock case infected by bat'. In this case it appears to have been misassigned as a bat host by GLUE's automatic curation from Genbank. However, the identification of a dog lineage in a bat is unusual. Follow up with researchers from the original study revealed this was mislabelled on Genbank and was in fact most likely a livestock case infected by a dog [29] . The lineage classification system allows resolution of these unusual cases, and identification of sequences that are incorrectly labelled. Rabies was first documented in Tanzania in the 1930's [30] . Although present in North Africa for centuries, the virus is thought to have become endemic and spread within sub-Saharan Africa following European colonisation in the twentieth century [30] . The literature suggests at least two historical introductions to Tanzania; from the North and the South of the country [18] . With the classification system, it is possible to identify spread over time on a finer scale, and therefore evaluate potential support for these theories. The lineage AF1b_A1.2 appears to support the spread of RABV into Tanzania from southern Africa (Fig 7a) , with the first sequence from South Africa in 1981 (KX148103). By 1992 the lineage was detected in Zimbabwe (KT336434), as part of a study looking at potential incursions between South Africa and Zimbabwe [31] , before being sequenced in Tanzania in 2010 [20] . Routine monitoring also allows us to identify that the lineage has persisted in South Africa, as the sequence MT454644 from 2017 is designated to the AF1b_A1.2 lineage [32] . The Tanzania dataset comprised 205 WGS from 1996-2018. Running the lineage designation on these sequences alone i.e. without wider geographic context, limited the designation of less commonly sampled lineages that may be more prevalent in other countries. The inclusion of sequences from Eastern and Southern Africa (known to belong to the same global subclade) resulted in an additional 2 lineage designations that could not be defined using Tanzania sequences alone. Therefore, additional relevant context to provide an informative reference set for initial lineage designation proved to be important for consistent, higher resolution lineage designations. This revealed 16 lineages present in Tanzania (Table S5) We aimed to apply the dynamic nomenclature system proposed by Rambaut There are challenges in adapting the universal classification system developed for SARS-CoV-2 to RABV [20] . The spatial and temporal density of sequences is very different. More than 4 million WGS from over 185 countries are available for SARS-CoV-2 in a time span of less than two years [34, 35] , whereas RABV sequences span 7 decades but are much sparser. Moreover, SARS-CoV-2 lineages were designated from the emergence of this pathogen in human populations and have been updated as variation accumulated. In contrast the RABV lineage designation is retrospective; accounting for greater diversity and time. As the naming system calls for a new major lineage after 4 iterations of minor lineages (A1. Asian, Africa-2, Africa-3, Arctic and Indian Subcontinent) [14] . However, since then publicly available RABV genomes have tripled and the way in which genomic surveillance is utilised has advanced from (usually) retrospective studies defining broad phylogenetic patterns to the scale of informing local control efforts in real-time. Therefore, we used this study as the baseline for designating coarse phylogenetic groups (using the existing clade names) but showcase the added value of our high resolution lineage designation tool for detailed epidemiological investigations. Sequence data has become increasingly accessible and we use 650 WGS just from the Cosmopolitan clade (over 50% of total dog related WGS available), to provide an improved characterisation of RABV diversity that supports our aim of interpreting sequence data for local and national surveillance. This approach follows from other local studies, for example, Brunker et al. (2015) identify 5 lineages circulating in Tanzania based on phylogenetic methods and genetic distances, showing their co-circulation and identifying historical introductions and human-mediated movement of infected animals [20] . However, different studies define and name lineages in different ways. A study from Cameroon used geographic area as part of its definition of viral lineages [34] while Talbi et al. [37] identify 8 "subtypes" within the Africa 2 clade (named A-H) defined by phylogenetic placement with Bayesian posterior support; thus splitting the subclade further, but not to a high resolution [37] . These examples highlight how the lack of a universal lineage definition and naming system makes comparison between studies challenging, as the scale and method for lineage assignment and designation are not comparable. An area that highlights the use of MAD DOG is the whole genome analysis of the Africa-1b subclade (AF1b), which is seen to be present in 9 African countries, with sequences spanning 1981-2018. Eleven of the 20 AF1b lineages have only been seen in Tanzania, 2 have only been seen in South Africa and the remainder have been detected in multiple countries (Table S1 ). This subclade has good definition due to the large number of sequences, with over 10% of these WGS from a large South African study [32] and several studies in Tanzania [11, 18, 36] that generated over 80% of the available AF1b WGS. When incorporating N gene data for AF1b, the limited diversity compared to WGS means additional N gene sequences do not translate into designated lineages (Table 1) , as also evident in the ME2 and NEE subclades. The definition of designated lineages therefore depends both on the breadth of the data (the length of the sequences) and the depth of the data (number of sequences). Overall, a 25% decrease in definition, and in some instances up to 50%, is seen when using N gene sequences alone compared to WGS of the same sequence set. Additionally, higher bootstrap values are required for robust designations at this level due to the limited diversity. Despite the lower resolution provided by N gene data, the large volume of partial genome sequences makes their inclusion essential for representing previously identified diversity. As with the WGS designated lineages, those subclades not seen are due to having fewer than 10 sequences available (table 1) . A significant area of difference between the whole genome and N gene designations, that highlights the importance of using all available data, is found in the Africa-1a (AF1a) subclade. When only using whole genomes (21 WGS), the AF1a subclade splits into 2 lineages seen across multiple countries. With all the available N gene data (310 sequences) AF1a is split into 14 lineages, with distinct lineages present in Algeria and Morocco [37] . The first N gene sequence in the AF1a subclade is from 1982, whereas the first WGS is from 2 years later. Although the difference here is only two years, in other subclades, such as AM2b, the earliest available N gene sequences are from two decades earlier, thus excluding N gene sequences would exclude considerable historical data, and limit inference about historical patterns of dispersal. Applying the MAD DOG lineage designation on a local scale allows in-country transmission routes and persistence of lineages to be seen at new depth. Tanzania is used as an example, which has a large number of WGS with detailed metadata. This provides greater inference to identify the uses and potential issues of Vital Signs: Trends in Human Rabies Deaths and Exposures -United States Estimating the Global Burden of Endemic Canine Rabies Global epidemiology of canine rabies: past, present, and future prospects Characterization of rabies post-exposure prophylaxis in a region of the eastern Amazon, state of Pará, Brazil The origin and phylogeography of dog rabies virus Rabies Control and Treatment: From Prophylaxis to Strategies with Curative Potential Against Rabies launches global plan to achieve zero rabies human deaths Exploring reservoir dynamics: a case study of rabies in the Serengeti ecosystem Using host traits to predict reservoir host species of rabies virus Landscape attributes governing local transmission of an endemic zoonosis: Rabies virus in domestic dogs Revealing the Microscale Signature of Endemic Zoonotic Disease Transmission in an African Urban Setting Origins of the arctic fox variant rabies viruses responsible for recent cases of the disease in southern Ontario Large-Scale Phylogenomic Analysis Reveals the Complex Evolutionary History of Rabies Virus in Multiple Carnivore Hosts Defining objective clusters for rabies virus sequences using affinity propagation clustering Molecular diversity and evolutionary history of rabies virus strains circulating in the Balkans Application of high-throughput sequencing to whole rabies viral genome characterisation and its use for phylogenetic reevaluation of a raccoon strain incursion into the province of Ontario Portable Rabies Virus Sequencing in Canine Rabies Endemic Countries Using the Oxford Nanopore MinION Elucidating the phylodynamics of endemic rabies virus in eastern Africa using whole-genome sequencing A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology GLUE: a flexible software system for virus sequence data Docker: lightweight Linux containers for consistent development and deployment MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform ModelFinder: Fast model selection for accurate phylogenetic estimates Von Haeseler A. Ultrafast approximation for phylogenetic bootstrap TreeTime: Maximum-likelihood phylodynamic analysis Molecular Inferences Suggest Multiple Host Shifts of Rabies Viruses from Bats to Mesocarnivores in Arizona during Phylogenetic and Epidemiologic Evidence of Multiyear Incubation in Human Rabies Host-pathogen evolutionary signatures reveal dynamics and future invasions of vampire bat rabies Emergence of Lyssaviruses in the Old World: The Case of Africa Wildlife and Emerging Zoonotic Diseases: The Biology, Circumstances and Consequences of Cross-Species Transmission Complete Genome Sequences of Six South African Rabies Viruses Complete Coding Sequences of 23 South African Domestic and Wildlife Rabies Viruses One million coronavirus sequences: popular genome site hits mega milestone Molecular characterization and phylogenetic relatedness of dog-derived Rabies Viruses circulating in Cameroon between Evolutionary history and dynamics of dog rabies virus in western and central Africa Rapid incountry sequencing of whole virus genomes to inform rabies elimination programmes Phylodynamics and Human-Mediated Dispersal of a Zoonotic Virus Potential for Rabies Control through Dog Vaccination in Wildlife-Abundant Communities of Tanzania Saudi Arabia