key: cord-0020006-x5pf930f authors: de Oliveira Andrade, Felipe; Cucco, Marina Silveira; Borba, Melina Mosquera Navarro; Neto, Reinaldo Conceição; Gois, Luana Leandro; de Almeida Rego, Filipe Ferreira; Santos, Luciane Amorim; Barreto, Fernanda Khouri title: An overview of sequencing technology platforms applied to HTLV-1 studies: a systematic review date: 2021-08-20 journal: Arch Virol DOI: 10.1007/s00705-021-05204-w sha: 06bed62d99908fa5e74d19140400a4a914fb5434 doc_id: 20006 cord_uid: x5pf930f Human T-lymphotropic virus type 1 (HTLV-1) was the first human retrovirus described. The viral factors involved in the different clinical manifestations of infected individuals are still unknown, and in this sense, sequencing technologies can support viral genome studies, contributing to a better understanding of infection outcome. Currently, several sequencing technologies are available with different approaches. To understand the methodological advances in the HTLV-1 field, it is necessary to organize a synthesis by a rigorous review. This systematic literature review describes different technologies used to generate HTLV-1 sequences. The review follows the PRISMA guidelines, and the search for articles was performed in PubMed, Lilacs, Embase, and SciELO databases. From the 574 articles found in search, 62 were selected. The articles showed that, even with the emergence of new sequencing technologies, the traditional Sanger method continues to be the most commonly used methodology for generating HTLV-1 genome sequences. There are many questions that remain unanswered in the field of HTLV-1 research, and this reflects on the small number of studies using next-generation sequencing technologies, which could help address these gaps. The data compiled and analyzed here can help research on HTLV-1, assisting in the choice of sequencing technologies. It is estimated that 5-10 million people worldwide are infected with human T-lymphotropic virus type 1 (HTLV-1) [1] [2] [3] [4] . Infected individuals can develop HTLV-1-associated pathologies such as adult T-cell leukemia/lymphoma (ATLL in 2-5% of patients), HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP in 0.25-3.8% of patients), HTLV-1-associated infectious dermatitis (IDH), and other inflammatory diseases such as uveitis and pneumonitis, or they can be classified as asymptomatic carriers [5] [6] [7] . The factors involved in the development of a particular clinical manifestation have not yet been elucidated, and HTLV-1-infected individuals remain without specific treatment [8] [9] [10] . The HTLV-1 genome structure is composed of two flanking regions, known as long terminal repeats (5' and 3' LTR), and the structural genes gag, pol, and env. There is also a non-structural region, pX, adjacent to the 3' LTR that encodes the regulatory and accessory proteins Tax, Rex, and HBZ [11] . Molecular characterization of the viral genome, based on sequencing combined with bioinformatics analysis, provides information on genomic regions such as viral integration sites and allows identification of mutations and epigenetic changes [12] . This information is important for the development of HTLV-1 specific vaccines and therapies. Although HTLV-1 was the first human retrovirus described, the number of HTVV-1 sequences that have been determined is considerably smaller than for other important retroviruses, such as human immunodeficiency virus 1 (HIV-1). In March of 2021, there were 1,048,465 published HIV-1 sequences, while for HTLV-1 there were only 9,980 sequences available in the GenBank database. To perform some specific studies of virus modifications that could be associated to different manifestations in human hosts, it would be necessary to have clinical and epidemiological information about the patients. However, most studies do not give all the information necessary to connect viral mutations with the clinical status of the patient. Even with some sequences already published, an investment in the generation of more HTLV-1 sequences would allow the identification of new mutations that affect infection, which might be helpful for developing new diagnostic strategies. In 1975, Sanger presented the first DNA sequencing technique, which was widely adopted and is still being used today. This technique is based on the use of modified chain terminators, which are dideoxynucleotides (ddNTPs) [13] . Sequencing techniques later evolved further, resulting in the emergence of next-generation sequencing (NGS), starting with second-generation technology. This technology brought new methodologies for determining nucleotide sequences with greater efficiency and speed, using systems such as 454 from Roche Applied Science, Solexa from Illumina, and Ion Torrent, which expanded the ways of sequencing genetic material [14, 15] . The main examples of secondgeneration technology are pyrosequencing and sequencing by synthesis (SBS). In this generation, the DNA polymerase acts in conjunction with a chemiluminescent enzyme, which, when complementing a template of a DNA strand, emits chemiluminescent signals, allowing the determination of the sequence [16] . Recently, a third generation has emerged, represented by nanopore sequencing (Oxford Nanopore Technologies) and Pacific Biosciences (PacBio) methodologies [15] . Unlike other sequencing technologies, these methods can be used to sequence unique DNA molecules and to produce longer read lengths in a shorter time than was possible in the previous generations [17] . The nanopore method stands out not only for generating long nucleotide chains through larger devices such as GridION and PromethION but also through small portable devices such as MinION and Flongle. This technique is based on the passage of genetic material through a nanopore membrane, which detects the electrical signals emitted during the passage of each nucleotide [18] . It should be noted that, in recent years, there has been significant technological diversification in genome sequencing, with more efficient, cheaper, and faster devices. Investigating which sequencing technology is most used to determine HTLV-1 genome sequences allows us to understand the limitations and possibilities of research carried out on the viral genome. This may help to fill the gaps in our knowledge about this virus, such as the factors involved in the development of HTLV-1-associated diseases. Considering the importance of the technological choice for sequencing, in this article, we review the different technologies used to generate HTLV-1 sequences and the contributions of these techniques to new investigations of this retrovirus. This study consists of a systematic literature review carried out in accordance with the guidelines of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). A systematic search was conducted for studies in which performed partial or total sequencing of the HTLV-1 genome was performed. The articles were searched in the PubMed, Lilacs, SciELO, and EMBASE databases in May 2021. The search algorithm used was composed of subjects from the DeCS/MeSH database and additional keywords: ("Human T lymphotropic virus 1" OR "HTLV-1") AND "sequence*" AND ("molecular sequence data" OR "sequencing"). Through the search algorithm, all titles were cross-checked to identify possible duplicate studies. For the selection of articles, the following inclusion criteria were applied to select studies: (i) the articles were in Portuguese, English or Spanish; (ii) they were original studies, and (iii) the articles presented complete or partial genome sequences of HTLV-1. Articles published since 2000 were included. The exclusion criteria were as follows: (i) studies not specifying the sequencing method, (ii) studies that did not generate HTLV-1 sequences or did not specify the number of sequences generated, (iii) animal studies, and (iv) studies in which genome sequencing was performed using a cell line. The HTLV-1 subtypes were not considered as a criterion for selection of articles. The articles found on the platforms were initially filtered and selected by reading the title and abstract. Subsequently, a new selection was made by reading the full text. After reading and analyzing the selected articles, the data were collected and included in this review. The search for published studies was performed independently by two authors (F.O.A. and M.S.C.), and disagreements about all outcomes were resolved by consensus among all authors. After reading the selected articles, the following content was extracted from each one: (1) basic information (title, authors, year, objectives), (2) study design, (3) sequencing technology and method, (4) subjects (sample origin and region of the HTLV-1 genome sequenced), and (5) number of sequences generated. The data collected from the articles were tabulated using Microsoft Excel. The figures generated in this work were produced using the programs Adobe Photoshop and Microsoft PowerPoint (2019 versions). This study was registered with the International Prospective Register of Systematic Reviews (PROSPERO) under the number CRD42020218387. The search for studies identified a total of 574 articles, of which 350 were available in PubMed, 213 in EMBASE, one in SciELO, and 10 in Lilacs. Of these, 65 were excluded due to duplication, 404 were excluded after selection by title and abstract, and 43 after reading the full text. Ultimately, 62 articles were included in the systematic review (Fig. 1) . The articles indicated the use of three sequencing methodologies: Sanger, Illumina, and Ion Torrent. Sanger sequencing, which is the first generation of sequencing, was the most frequently used technique. Even after the emergence of NGS methodologies in 2004, it was observed that most published HTLV-1 studies continued to use the Sanger method preferentially. Among the 62 articles used in this review, 59 used the Sanger method, and, of these, 40 were carried out after 2004 (Table 1 ). In most of the studies, a partial HTLV-1 genome sequence was determined. Of the 59 articles that used the Sanger method , 56 reported partial genome sequences, and three reported complete sequencing of the HTLV-1 genome. Of the four articles that used NGS [12, 75, 78, 79] , two reported partial genome sequences and two reported complete genome sequences. Another important aspect of these articles was the difference in the number of sequences generated for each region of the HTLV-1 genome: 1258 sequences of LTR, 89 sequences of gag, 124 of pol, 777 of env, and 1420 of the pX region (Fig. 2 ). It is important to highlight that there are four different overlapping open reading frames (ORFs) in the pX region that encode regulatory proteins and the HTLV-1 bzip domain gene (hbz),which is transcribed in the antisense direction from a promoter present in the 3'LTR. The number of sequences generated for each ORF and hbz are as follows: ORF I, 311; ORF-II, 54; ORF-III, 54; ORF-IV, 1153; hbz, 10. In addition, 14 partial genome sequences with the precise regions not described were found, and 228 complete HTLV-1 genome sequences were reported. Brazil is the country with the largest number of sequences generated, distributed through 26 sequencing studies, followed by Japan, with seven. Colombia and France had four studies each, and Argentina and Chile each had three. Two studies each were performed in Gabon and Spain, and in Cuba, India, Israel, Italy, Mozambique, the UK, Portugal, and Russia, only one study was performed. Finally, there were also six articles that did not provide information about the origins of the sequences (Fig. 3 ). It is important to note that, despite being a state-of-the-art technique employed for genome sequencing of other retroviruses such as HIV [80] , no published study using thirdgeneration sequencing for HTLV-1 was found. In addition, most of the studies refer to information generated more than 10 years ago, in which more than 3,000 sequences were generated, while the most recent studies generated only approximately 800 sequences (Fig. 4) . Another interesting topic to be highlighted is the lack of clinical information about the patients included in the studies. Of the 62 articles included, 60% did not report the clinical status of the studied population. In the 41 years since the discovery of HTLV-1, no effective therapeutic treatments or vaccines have been developed, and it is still not clear what determines different infection outcomes. During this period, diverse sequencing technologies have become available. The central aim of this systematic review was to summarize the different technologies used in the HTLV-1 field in order to guide the decision-making processes on the generation of new HTLV-1 genome sequences. The Sanger method was the most commonly used for generating HTLV-1 sequences, followed by Illumina and Ion Torrent. All of these techniques have advantages and disadvantages. The characteristics of HTLV-1, as well as the specific aspects of each method must be taken into consideration. One important aspect in HTLV-1 infection is that, after infection, the virus integrates into the host cell DNA as a provirus. Unlike HIV, in HTLV-1 infection, the circulating viral RNA is not easily detected in the plasma or serum, and additional techniques are usually needed prior to sequencing, such as PBMC separation and nested PCR [81, 82] . In this sense, the HTLV-1 sample extraction and preparation steps are an important point to consider during the choice of the sequencing platform to use. Among the sequencing technology platforms, considering its low error rate, Sanger sequencing is considered the gold standard, despite being first-generation and having a high cost. Furthermore, it is possible to assess the sequencing quality based on other parameters, such as sequence length, sequencing depth, and GC content. One study reported that sequencing quality is more stable and GC depth distribution is better with Ion Torrent than with HiSeq 2000 [83] . Importantly, even when the goal is to sequence larger regions and/ or the complete proviral genome, technologies such as Ilumina and Ion Torrent produce small sequence reads. This read size, as well as the polymerase chain reaction (PCR) step, can impair the understanding of an essential aspect of HTLV-1 infection: clonality. While in patients with ATLL there is a monoclonal pattern, in patients with IDH or HAM/ TSP, and in asymptomatic carriers, a polyclonal pattern is found [84, 85] . Therefore, the small size of the genome sequencing readout may make it difficult to identify viral quasispecies and may give an unrealistic biological picture. It is important to highlight that sequencing of viral genomes is important for understanding the infection process [86] . Therefore, the use of few and old sequencing methods, despite the emergence of more innovative, faster, and often less expensive technologies, makes the goal of developing better alternatives for infection control and the understanding of viral pathogenesis increasingly distant. In addition, animal models are important in HTLV-1 research and have allowed significant advances in the understanding of viral infection and pathogenesis. Each animal model has its advantages. Rats are used in studies involving HAM/TSP, and non-human primates are used in studies analyzing the immune response and viral persistence [87] . The emergence of new sequencing protocols has led to a reduction in the time required and production costs [88] . Despite that, no article included in this study used more recent technologies, such as the third-generation sequencing. MinION and PacBio could be an interesting alternative, due to their shorter processing time, despite providing sequences with regular quality, when compared to older methods. These methods can be useful in HTLV-1 research, increasing the number of partial and/or complete sequences available on the platforms and contributing to a better understanding of the virus-host relationship. In addition to the predominance of the older techniques, most of the studies focused on sequencing specific regions of the genome, with few studies generating complete genome sequences. The LTR and pX regions were the most frequently sequenced. This could be because of the importance of the LTR for the subtyping and the fact that pX encodes the HTLV-1 regulatory proteins. In this context, it is relevant to point out that complete genome sequencing is essential for the identification of gene functions and their involvement in disease as well as for vaccine development. This systematic review demonstrated a deficit in the number of HTLV-1 sequences. However, this study has an important limitation, since sequences can be deposited in databases such as GenBank without being necessarily associated with a published article. However, our data corroborate an ongoing study carried out by our group that highlights the deficit of complete HTLV-1 genome sequences available in the GenBank database. In this study, we verified that only 242 complete HTLV-1 genome sequences were available in the GenBank database, and most of these sequences did not include clinical and epidemiological information about the patient. On the other hand, the majority of studies provided geographical information about the samples sequenced. Most of them were from endemic regions such as Japan and Brazil. Another country that deserves attention is Colombia. The Colombian island of Tumaco has a high population density and a very high prevalence of HAM/TSP, which is why this region is a focus of study of HTLV-1 [33] . Moreover, few articles from Africa were found, despite being the continent with the highest endemicity of HTLV-1 [1] . The European continent also contributes to the generation of HTLV-1 sequences, although relatively few articles describe the sequencing. Some studies did not report the origin of the sequence, which limits their epidemiological value. The sum of studies from each country does not correspond to the number of articles included, because some studies include samples from different countries, such as Bandeira et al., 2018 [71] . Interestingly, only 21 articles included in this review were published in the last 10 years, which is equivalent to almost 30% of the total number of studies, revealing that there is still low investment in research in the HTLV-1 field. The encouragement of more investments in HTLV-1 studies may contribute to an increased number of HTLV-1 sequences generated in different geographic regions, and this can assist in the understanding of the global and regional distribution of this virus [1] . There are gaps to be filled in relation to information on HTLV-1 infection. Although it was the first human retrovirus described and has been proven to be associated with the development of diseases, studies on the pathogenesis and treatment of this virus are not encouraged, and worse, investment in research is decreasing [89] , demonstrating that HTLV-1 is still a neglected virus [90, 91] . Thus, more investment in HTLV-1 research and the implementation of worldwide prevention strategies will be the main motor for the eradication of these infections. The analysis of the articles selected for this systematic review showed that the number of studies sequencing the HTLV-1 genome is much lower than for other retroviruses, and most of these studies still opt for Sanger sequencing despite the emergence of new methodologies. This demonstrates a lack of investment in this field. It is important to note that Sanger sequencing has advantages over other methods. However, NGS techniques also have characteristics that may be important for answering questions that remain about HTLV-1 infection. Investments in HTLV-1 research are needed, mainly in the use of more current methodologies, since they are methodologies that have been developed through lessons learned and improved by the previous generation. Epidemiological aspects and world distribution of HTLV-1 infection Detection and isolation of type C retrovirus particles from fresh and cultured lymphocytes of a patient with cutaneous T-cell lymphoma Comparative seroepidemiology of HTLV-I and HTLV-III in the French West Indies and some African countries Adult T-cell leukemia: antigen in an ATL cell line and detection of antibodies to the antigen in human sera Antibodies to human T-lymphotropic virus type-I in patients with tropical spastic paraparesis Clinical, pathologic, and immunologic features of human T-lymphotrophic virus type I-associated infective dermatitis in children Isolation and characterization of retrovirus from cell lines of human adult T-cell leukemia and its implication in the disease Analyses of HTLV-1 sequences suggest interaction between ORF-I mutations and HAM/TSP outcome Assessment of genetic diversity of HTLV-1 ORF-I sequences collected from patients with different clinical profiles Molecular characterization of HTLV-1 genomic region hbz from patients with different clinical conditions A fully annotated genome sequence of human T-cell lymphotropic virus type 1 (HTLV-1) The nature of the HTLV-1 provirus in naturally infected individuals analyzed by the viral DNA-capture-seq approach A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase Next-generation DNA sequencing techniques Generations of sequencing technologies: from first to next generation Genome sequencing PacBio sequencing and its applications The oxford nanopore MinION: delivery of nanopore sequencing to the genomics community Host sequences flanking the human T-cell leukemia virus type 1 provirus in vivo Up-regulation of human T lymphotropic virus type 1 (HTLV-1) tax/rex mRNA in infected lung tissues Comparative molecular analysis of HTLV-I proviral DNA in HTLV-I infected members of a family with a discordant HTLV-I-associated myelopathy in monozygotic twins Chimeric matrix proteins encoded by defective proviruses with large internal deletions in human T-Cell leukemia virus type 1-infected humans Ancient HTLV type 1 provirus DNA of Andean mummy HTLV-I/HTLV-II coinfection in an AIDS patient from São Paulo, Brazil Somatic mutation in human T-cell leukemia virus type 1 provirus and flanking cellular sequences during clonal expansion in vivo Existence of escape mutant in HTLV-I tax during the development of adult T-cell leukemia HTLV-1 proviruses encoding non-functional TAX in adult T-cell leukemia Envelope sequence variation and phylogenetic relations of human T cell lymphotropic virus type 1 from endemic areas of Colombia Amazon region Seroprevalence and molecular epidemiology of HTLV-1 isolates from HIV-1 co-infected women in Feira de Santana, Bahia, Brazil Genetic characterization of human T-cell lymphotropic virus type 1 in Mozambique: transcontinental lineages drive the HTLV-1 endemic Genome epidemiology and tropical spastic paraparesis associated with human T-cell lymphotropic virus type 1 Molecular characterization of human T cell leukemia virus type 1 subtypes in a group of infected individuals diagnosed in Portugal and Spain Complete genome sequence of Central Africa human T-cell lymphotropic virus subtype 1b Tax gene characterization of human T-Lymphotropic virus type 1 strains from Brazilian HIV-coinfected patients Phylogenetic and similarity analysis of HTLV-1 isolates from HIV-coinfected patients from the South and Southeast regions of Brazil Molecular characterization of HTLV-1 gp46 glycoprotein from health carriers and HAM/TSP infected individuals Molecular study of HBZ and gp21 human T cell leukemia virus type 1 proteins isolated from different clinical profile infected individuals Phylogenetic analysis of human T cell lymphotropic virus type 1 isolated from Cuban individuals High prevalence of HTLV-1 infection among Japanese immigrants in nonendemic area of Brazil Prevalence and phylogenetic analysis of HTLV-1 in a segregated population in Iran Human T-lymphotropic virus 1aA circulation and risk factors for sexually transmitted infections in an Amazon geographic area with lowest human development index (Marajó Island, Northern Brazil) HTLV-1 and -2 in a first-time blood donor population in Northeastern Brazil: prevalence, molecular characterization, and evidence of intrafamilial transmission Complete sequence of human T cell leukemia virus type 1 in ATLL patients from Northeast Iran, Mashhad revealed a prematurely terminated protease and an elongated pX open reading frame III Provirus mutations of human T-lymphotropic virus 1 and 2 (HTLV-1 and HTLV-2) in HIV-1-coinfected individuals Deep sequencing analysis of human T cell lymphotropic virus type 1 long terminal repeat 5' region from patients with tropical spastic paraparesis/human T cell lymphotropic virus type 1-associated myelopathy and asymptomatic carriers The origin of HTLV-1 in southern Bahia by phylogenetic, mtDNA and β-globin analysis Human T-cell leukemia virus type 1 infection among Japanese immigrants and their descendants living in Southeast Brazil: a call for preventive and control responses Molecular characterization of human T-cell lymphotropic virus type 1 full and partial genomes by illumina massively parallel sequencing technology Complete genome sequence of human T-cell lymphotropic type 1 from patients with different clinical profiles, including infective dermatitis Dynamic nanopore long-read sequencing analysis of HIV-1 splicing events during the early steps of infection Detection of human T-cell lymphotropic virus type 1 in plasma samples HTLV-1 viral RNA is detected rarely in plasma of HTLV-1 infected subjects Comparison of next-generation sequencing systems HTLV-1 clonality in adult T-cell leukaemia and non-malignant HTLV-1 infection Clonal expansion of human T-cell leukemia virus type I-infected cells in asymptomatic and symptomatic carriers without malignancy Next-generation sequencing technology in clinical virology Animal models utilized in HTLV-1 research The sequence of sequencers: the history of sequencing DNA Time to eradicate HTLV-1: an open letter to WHO Sequence note: nucleotide sequence analyses of partial envgp46 gene of human T-lymphotropic virus type I from inhabitants of Fujian Province in Southeast China Nucleotide sequence analysis of a full-length human T-cell leukemia virus type I from adult T-cell leukemia cells: a prematurely terminated PX open reading frame II