key: cord-0266265-bbgoya97
authors: Agostinetto, Giulia; Sandionigi, Anna; Chahed, Adam; Brusati, Alberto; Parladori, Elena; Balech, Bachir; Bruno, Antonia; Pescini, Dario; Casiraghi, Maurizio
title: ExTaxsI: an exploration tool of biodiversity molecular data
date: 2020-11-06
journal: bioRxiv
DOI: 10.1101/2020.11.05.369983
sha: 5bd56832366d8a16ab771b582362145035dee78e
doc_id: 266265
cord_uid: bbgoya97

Background The increasing availability of multi omics data is leading to continually revise estimates of existing biodiversity data. In particular, the molecular data enable to characterize novel species yet unknown and to increase the information linked to those already observed with new genomic data. For this reason, the management and visualization of existing molecular data, and their related metadata, through the implementation of easy to use IT tools have become a key point for the development of future research. The more users are able to access biodiversity related information, the greater the ability of the scientific community to expand the knowledge in this area. Results In our research we have focused on the development of ExTaxsI (Exploring Taxonomies Information), an IT tool able to retrieve biodiversity data stored in NCBI databases and provide a simple and explorable visualization. Through the three case studies presented here, we have shown how an efficient organization of the data already present can lead to obtaining new information that is fundamental as a starting point for new research. Our approach was also able to highlight the limits in the distribution data availability, a key factor to consider in the experimental design phase of broad spectrum studies, such as metagenomics. Conclusions ExTaxI can easily produce explorable visualization of molecular data and its metadata, with the aim to help researchers to improve experimental designs and highlight the main gaps in the coverage of available data.

In recent years, studies investigating biodiversity at large scale started to create and 2 incorporate molecular data. In particular, the spread of metagenomic studies (e.g. 3 metabarcoding) have contributed to an exponential increase in genomic data availability. 4 Thanks to this large amount of new information it is possible to expand our knowledge 5 and enhance our scientific investigation capacity in many fields of research [47] , ranging 6 from macro-ecology and ecosystem monitoring, to food safety control, forensics punctually updated exist only for few molecular markers, such as SILVA for 16S and 23 18S genes [48] , BOLD for animals and plants [50] or UNITE for Fungi domain [44] . 24 However, these data resources are not representative of all the genomic and taxonomic 25 diversity collected to date. On the other hand, although GenBank still resumes the 26 majority of genetic data and their related metadata currently available [3, 5, 30] , such 27 information is not always easy to access without specific bioinformatics skills, which is a 28 limiting factor to a large audience of scientists. 29 With the aim to help biologists to improve experimental designs and to encourage 30 data exploration and exploitation, we have developed a tool, ExTaxsI (Exploring 31 Taxonomies Information), to facilitate the molecular data integration with its associated 32 metadata, eventually retrieved from heterogeneous sources. Moreover, its ease of use 33 interface will help researchers and practitioners in the visualization phase. ExTaxsI can 34 both query NCBI Nucleotide database for molecular data and accept data from an 35 external source, exploiting the standard taxonomy notation. The tool is linked to NCBI 36 taxonomy database [17] and ETE toolkit [26] , in order to produce standard formats 37 readable by most common software that deal with taxonomic 38 information [4, 7, 8, 38, 51, 56] , such as QIIME2 platform [7] . The tool is applicable to 39 any marker, gene name or taxonomic group, so it is possible to create non-standard 40 marker genes database usable in metagenomic/metabarcoding taxonomic assignment 41 tools [7] . In addition, thanks to the integration of the NCBI query tool [11] , ExTaxsI 42 can reorganize personal datasets in a standardized format in order to easily describe 43 taxonomic variability and geographic provenance of records. 44 2 ExTaxsI 45 ExTaxsI is a bioinformatic tool aimed to elaborate and visualize molecular and format) dataset and iii) their related taxonomy classification paths/datasets, thanks to 51 the integration of NCBI taxonomy data, iv) the creation of genetic markers lists coming 52 from different studies and finally v) the production of interactive plots starting from 53 NCBI query search results or directly from offline taxonomic files, including 54 representative graphs for the exploration of taxonomy and refinement of biogeographical 55 data, creating geographical maps with the locations of the species analyzed ( Figure 1 ). 56 It is important to note that ExTaxsI outputs are compatible with other tools for 57 2/16 Figure 1. ExTaxsI pipeline: module 1 (orange) searches and creates files and databases; module 2 (green) processes georeferenced or taxonomic data for the creation of graphs and plots; module 3 (blue) converts taxonomic data into taxonomic ID (TaxID) and vice versa.

taxonomic assignment purposes such as QIIME2 platform [7] . 58 The communication with NCBI server is mediated by the Entrez module [11] , 59 implemented in Biopython library [10] , which allows to search, download and parse 60 query results. To help NCBI interaction, when the requests are less than 2500, the 61 search key is composed by a single query, otherwise the query will be splitted in groups 62 of 2500 generating temporary files, which are then merged into single output file at the 63 end of the process.

Regarding taxonomy handling, ETE toolkit was exploited [26] . In particular, ETE 65 allows to create and maintain a local taxonomy database up to date by extrapolating 66 the 6 main ranks (phylum, class, order, family, genus and species). If the organism is 67 poorly described or it is an unknown species, the Taxonomy ID (i.e. TaxID) of its 68 ancestor (known as parent TaxID) in ETE taxonomic tree is then used and converted 69 into its scientific correspondent name. It is important to underline that all queries are 70 carried out locally, avoiding unnecessary delays and allowing the integration of the tool 71 in genomic and metagenomic pipelines. Being ExTaxsI a taxonomy focused data exploration tool, we designed three possible 77 scenarios of increasing complexity, to challenge it with increasing taxonomic variability 78 and dimension of accession entries. The first scenario hypothesizes a query to explore 79 data with i) low taxonomic variability and a high number of expected entries (1 species, 80 more than 300,000 entries). The second scenario provides ii) a high taxonomic 81 variability and a large expected number of entries (about 500 species, more than 300,000 82 entries). The third and more complex scenario explores a iii) complete case study with 83 taxonomic input intersected by molecular data. As case studies of the first two 84 scenarios, we focused on taxa of interest in marine fisheries: 1) the cod fish species 85 (Gadus morhua), for which a worldwide economic interest exists, and 2) its taxonomic 86 group at order level -the Gadiformes order -which supports long-standing commercial 87 fisheries and aquaculture. These two case studies evaluate the capacity to explore data 88 and to fill the geographic distribution of a species, prospecting also the available genes 89 information to perform a genetic survey (e.g. DNA metabarcoding study). With the 90 third use case, we aimed at demonstrating the flexibility of ExTaxsI in different 91 contexts: a genetic exploration of the available data in NCBI associated to 92 SARS-CoV-2 virus -a very recent topic that involved many research groups, leading to 93 huge amounts of data collected and deposited in public sources [6] . A large scale 94 exploration of data related to this topic can potentially improve the reliability of results 95 and can provide valuable evidence to inform decisions on public health protection, both 96 now and most importantly in the future. The first scenario is the case of Gadus morhua species, also called Atlantic cod. In 99 detail, Gadus morhua is a large, cold-adapted teleost fish that supports long-standing 100 commercial fisheries and aquaculture [27, 28, 33, 34, 54] .

ExTaxsI retrieved a total of 366,963 accessions using the taxonomy ID through the 102 following query: "txid8049[ORGN]" (where 8049 is the specific Gadus morhua TaxID; 103 18 of June, 2020). Only 53,695 entries showed a 'gene' tag investigable by ExTaxsI. As a 104 unique species, we decided to represent the results obtained from genes survey ( Figure 105 2) and the world map plot ( Figure 3 ). Regarding gene distribution, the most abundant 106 gene is CYTB (with 985 accessions), followed by COI (434) and ND2 (311). The 107 remaining most abundant genes are the other ND portions and Cytochrome Oxidase 108 fragments (COIII and COII), belonging to the mitochondrial genome. These results

show the increased effort in sequencing "standard" barcoding markers, while moderately 110 sequencing whole mitochondrial genomes. The remaining genes in the retrieved list and 111 their relative accession frequencies distribution (see the complete list in Additional file 112 1) demonstrate that the entire genome of this species was sequenced). These results are 113 in line with those obtained by Knudsen and colleagues (2019), where they personally 114 developed specific primers for CYTB amplification, as it is a widely used marker in fish 115 molecular characterization.

Regarding the geographic area, the Gadidae family has a circumpolar distribution, 117 comprising species occurring principally in northern and cool seas [28] . Further, as 118 reported by Jorde and colleagues (2018), in Norway we can recognize four distinct 119 stocks of the Atlantic cod: (1) the oceanic Northeast Arctic cod, (2) coastal cod north 120 of 62°N, (3) coastal cod south of 62°N, and (4) a North Sea/Skagerrak stock, the most 121 densely populated region in Norway [28] . This geographic distribution is partly visible 122 

The second scenario takes as an example the Gadiformes Order (phylum: Chordata; 125 class: Actinopterygii ), a major group of organisms belonging to marine fisheries. It 126 includes many important food fishes, variously marketed as cods, hakes, grenadiers, 127 moras, moray cods, pelagic cods, codlets and eucla cods [43] . As a vast group, it 128 comprises more than 500 species, which contribute to more than a quarter of the world's 129 marine fish catch [13, 43] . As it is shown, Gadidae is the most abundant family, considering the number of 139 accessions available. In fact, a total of 380,658 accessions populate this group, followed 140 by Merlucciidae (3, 196) and Macrouridae (1,581) families. These results are in 141 accordance with the literature, a Gadidae family is a primary marine, bottom-dwelling 142 family of fishes in the Order of Gadiformes with great commercial power [33, 43] . 143 Further, considering the ScatterPlot in Additional file 3, the interactive visualization 144 allowed us to explore the taxonomy distribution among the accessions available, 145 changing dynamically the rank that we want to explore. This feature allows us to 146 disclose that the genus Gadus is the most abundant of the entire dataset, highlighting 147 that Gadus morhua species composed 94,43% of all the data. This is an expected result, 148 as Gadus morhua is documented to be a key species both in the North Atlantic 149 ecosystem and commercial fisheries, with an increasing aquaculture production in several countries [28] . Considering the genetic information reached by ExTaxsI, a total 151 of 28,839 unique genes were found from the 60,703 completely tagged accessions. A 152 classification of the most ten abundant genes is reported in Figure 2 . As shown in the 153 figure, at the first position we can find the COI gene, a widely used marker gene in 154 metabarcoding projects (Knudsen et al. 2019) , that deal mainly with animals 155 detection [47] , followed by CYTB and ND2 [47] . 156 Concluding with these two case studies, the tool was able to accurately portrait the 157 state of the art of the genetic information available in NCBI. Comparing the most 158 abundant genes found among the records, it is possible to see a thin discrepancy 159 between the two taxa explored (Figure 2) , highlighting the disclosures that the survey 160 can report. In general, the detection of mitochondrial genes, coding for Cytochrome [23, 24, 42] . To date, considering the subjects of our case studies, diverse studies 164 have used COI or CYTB barcoding to identify seafood products and explore broad 165 patterns in fish mislabelling [9, 16, 18, 40, 49, 59, 61] . 166 Regarding the extraction of geographic metadata from NCBI records, the 167 completeness and collection of data can improve drastically the biogeographic and 168 ecological research, allowing not only to explore sampling areas, but also to improve 169 phylogeography investigations, biodiversity monitoring and environmental genomics 170 strategies [12, 47] .

The unbalance between the number of records and the number of genes explorable is 172 in some cases due to the incompleteness of the 'gene' tag. In the very recent years 173 genome sequences started playing a key role into public repositories, making sequences 174 available for sharing and reuse. Submission process can be challenging and errors can 175 affect the availability of the data. For this reason, there is a wide interest to integrate 176 standardized procedures into the annotation process [19] . The promotion of FAIR 177 principles and best practices can certainly avoid the error propagation in sequence 178 databases [46, 58] , making the data fully explorable in the future. [36] . The pandemic linked to SARS-CoV-2 highlighted hidden 187 virus reservoirs in wild animals and their potential to occasionally spillover into human 188 populations [36] . A detailed understanding of this process is crucial to prevent future 189 spillover events. As reported in the seminal paper of Andersen and colleagues (2020) [2] , 190 the risk of future re-emergence events increases if SARS-CoV-2 pre-adapted in another 191 animal species. SARS-CoV-2 probably originated from Rhinolophus affinis bats, with 192 pangolin (Manis javanica) as intermediate host [2] . Recently, other animal species were 193 supposed to be possible intermediate hosts in between bats and humans. To date, ACE2 194 (Angiotensin-converting enzyme 2), the receptor which binds to the receptor-binding 195 domain (RBD) of SARS-CoV-2 S protein [35] , is reported as crucial in host invasion.

To test our approach and explore the genetic information available in NCBI, we Lastly, we explored the data available for SARS-CoV-2 ( Figure 4 ) using the following 220 query "txid2697049" (where 2697049 is the specific severe acute respiratory syndrome 221 coronavirus 2 TaxID; 29 of June, 2020). We obtained a total of 8,137 accessions. The 222 top ten genes retrieved are shown in Figure 4c . In particular, the number of genes 223 detected is quite similar among the top ten datasets and this is probably due to a high 224 collection of genomes deposited into the database. The three most represented genes in 225 the database are: ORF1AB (7892), followed by two important structural proteins: S 226 (7829), the spike or surface glycoprotein, and N fragments (7817), the nucleocapsid 227 protein. Considering the ORF1AB, several studies demonstrated its pivotal role among 228 coronaviruses [55] , providing a clinical target to break down SARS-CoV-2 infection [31] . 229 Regarding the second and third results, the nucleocapsid phosphoprotein is involved in 230 packaging the RNA into virus particles and protects the viral genome. For these 231 reasons, it has been suggested as an antiviral drug target [20, 60] . The spike 232 glycoprotein, instead, is located outside the virus particle, mediating its attachment and 233 promoting the entry into the host cell. It also gives viruses their crown-like appearance. 234 In the very last research, the S protein was found as an important target for diagnostic 235 antigen-based tests, antibody therapies and vaccine development [45, 53] . The entry of 236 SARS-CoV-2 is mediated by further processes, for example the activity of the protease 237 TMPRSS2 [25] . Also in this case, the use of ExTaxsI can unearth similar proteases in 238 possible intermediate hosts, revealing new insights into the mechanism of infection.

As also documented in Khailany et al., 2020 [31] , the emergent and huge amounts of 240 data collected in the last few months necessitates a large scale exploration of the data. 241 The rapid increment of data releases may give some important insights about 242 SARS-CoV-2 behaviour in its host species, helping in improving not only our 243 knowledge, but also models to predict COVID-19 outcomes and new drug targets. ExTaxsI provides an easy-to-use standalone tool able to interact with NCBI databases 246 and personal datasets, offering instruments to standardize taxonomy information and 247 visualize vast quantities of data widespread on different taxonomic levels. It also 248 provides interactive visualization plots, easily shareable through HTML formats.

The user-oriented interrogation of NCBI databases may help researchers involved in 250 environmental genomics fields, from phylogeographic studies to DNA metabarcoding 251 surveys, and also in projects related to the human health, as we demonstrated with the 252 SARS-CoV-2 case study.

With this work, we hope to meet the needs of a vast group of researchers, providing 254 an instrument easy to install on common laptops and directly connected with NCBI 255 databases. In our opinion, ExTaxsI data management ability with its visual interactive 256 exploration can really improve the experimental design phase and the awareness of 

No specific system requirements are needed for the installation of ExTaxsI, however for 261 the correct functioning of the software we suggest a minimum of 4GB of RAM.

Moreover, to successfully run ExTaxsI, the following python libraries must be installed: 263 NumPy, SciPy, Matplotlib, ipython, Pandas, SymPy, nose, genutils and Plotly, in 264 addition to ETE toolkit [26] . To install all the dependencies compatible versions, we 265 provide a requirement list at the GitHub page 266 ( https://github.com/qLSLab/extaxsi), with a detailed guideline to set directly a 267 conda environment.

Regarding the organization of the tool, ExTaxsI is designed in separate modules, 269 albeit interconnected, in order to work directly from different points of its workflow and 270 to allow greater simplicity in the integration of additional modules in the future. The module 'Database' allows users to create multi FASTA files composed of nucleotide 273 sequences, taxonomic lists, genes names and their related accessions, starting from 274 either a single query or a batch mode using csv/tsv files (Figure 1 ). After indicating the 275 9/16 type of input, the tool asks, with the exception of the file accession, whether or not the 276 user wants to integrate the query with one or more gene name/s (or other details). This 277 step allows the user to restrict the research in NCBI if needed. In general, the output 278 formats are i) a multi-FASTA file (widely used format for molecular sequences) and ii) 279 text file in TSV format, with two columns composed by the accessions code followed by 280 the taxonomy path of each accession at the six main levels separated by semicolons: 281 phylum, class, order, family, genus and species. When requested by the user, the output 282 file of genes names is in TSV format consisting of a table with two columns, one with 283 the list of genes and the other with the frequency values of the respective genes found in 284 the analyzed records. The tool also provides a summary table containing the most 285 frequent genes from a list of taxid, accessions or organisms. In addition, it is possible to 286 create a barplot with the top ten of this summary table, downloadable as a PNG file. format files. In detail, ScatterPlot uses taxonomy as input to produce a graph that 298 indicates the quantity of each individual taxonomic unit; the interactive plot enables the 299 user to: i) choose the taxonomic level to be displayed using the buttons located under 300 the graph; ii) hover over points to show details, such as the number of records within 301 taxa, names of selected taxa and name of the higher taxon from which they derives.

The plot allows also to compare more data on mouse-over, highlight an area of interest 303 with zoom function and view a specific group or remove taxa from the graph. SunBurst, 304 instead, from a taxonomy input creates an Expansion Pie that allows to explore 305 taxonomy by clicking on the taxonomic group of interest and showing the underlying 306 taxa within a new SunBurst. Also in this case, hovering over points shows the number 307 of records within taxa. Regarding world map plot, the initial input is processed in order 308 to obtain geographic data. The tool exploits the 'Country' metadata stored in the 309 NCBI records to produce a map indicating the position of each entry. In this step, 310 based on the type of geographic data obtained, ExTaxsI divides results into two 311 different arrays: i) a specific array of coordinates (if the coordinates are present in the 312 record) or ii) a specific array of states names (if the coordinates are not present in the 313 record). Also external sources can be processed and added to the map. In each map 314 created, coordinates are indicated by green X signs, while States by red circles.

Thinking of multiple taxa plotting, each symbol can have a legend that summarizes the 316 data downloaded with the same country or coordinates description. Further, it is 317 possible to see both genes and counts available among the accessions represented. This module allows to convert TaxID to the main six ranks taxonomy and vice versa 320 (phylum, class, order, family, genus and species); it can convert single manual inputs or 321 multiple inputs from a tsv/csv file complete of a TaxIDs list. 

A new genomic blueprint of the human gut microbiota

The proximal origin of sars-cov-2

Its2 database v: Twice as much

Metaxa2: improved identification and taxonomic classification of small and large subunit rrna in metagenomic data

Genbank nucleic acids res. jan

Connecting data, tools and people across europe: Elixir's response to the covid-19 pandemic

Reproducible, interactive, scalable and extensible microbiome data science using qiime 2

Blast+: architecture and applications

Marketplace substitution of atlantic salmon for pacific salmon in washington state detected by dna barcoding

Biopython: freely available python tools for computational molecular biology and bioinformatics

Database resources of the national center for biotechnology information

Ecosystems monitoring powered by environmental genomics: a review of current strategies with an implementation roadmap

Global coordination and standardisation in marine biodiversity through the world register of marine species (worms) and related databases

Sars-cov-2: Structural diversity, phylogeny, and potential animal host identification of spike glycoprotein

Environmental dna metabarcoding: Transforming how we survey animal and plant communities

Dna barcoding for detecting market substitution in salted cod fillets and battered cod chunks

The ncbi taxonomy database

Dna barcoding coupled to hrm analysis as a new and simple tool for the authentication of gadidae fish species

Genome annotation generator: a simple tool for generating and correcting wgs annotation tables for ncbi submission

A sars-cov-2 protein interaction map reveals targets for drug repurposing

Skills and knowledge for data-intensive environmental research

A decadal view of biodiversity informatics: challenges and priorities

Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species

Comparison of dna extraction and pcr setup methods for use in high-throughput dna barcoding of fish species

Sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor

Ete 3: reconstruction, analysis, and visualization of phylogenomic data

Large-scale sequence analyses of atlantic cod

Who is fishing on what stock: population-of-origin of individual cod (gadus morhua) in commercial and recreational fisheries

Issues and suggestions for the development of a biodiversity data visualization support tool

Bcdatabaser: on-the-fly reference database creation for (meta-) barcoding

Genomic characterization of a novel sars-cov-2

The architecture of sars-cov-2 transcriptome

Species-specific detection and quantification of environmental dna from marine fishes in the baltic sea

Cod: a Biography of the Fish that Changed the world

Functional assessment of cell entry and receptor usage for sars-cov-2 and other lineage b betacoronaviruses

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Sars-cov-2 spike protein favors ace2 from bovidae and cricetidae

Swarm v2: highly-scalable and high-resolution amplicon clustering

Ecoinformatics: supporting ecology as a data-intensive science

Smoke, mirrors, and mislabeled cod: poor transparency in the european seafood industry

Mgnify: the microbiome analysis resource in 2020

Development of a cox1 based pcr-rflp method for fish species identification

The unite database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications

Gene of the month: the 2019-ncov/sars-cov-2 novel coronavirus spike protein

Ncbi-compliant genome submissions: tips and tricks to save time and money

Scaling up: A guide to high-throughput genomic approaches for biodiversity analysis

Silva: a comprehensive online resource for quality checked and aligned ribosomal rna sequence data compatible with arb

Dna-based methods for the identification of commercial fish and seafood species. Comprehensive reviews in food science and food safety

Bold: The barcode of life data system

Vsearch: a versatile open source tool for metagenomics

Past, present, and future perspectives of environmental dna (edna) metabarcoding: A systematic review in methods, monitoring, and applications of global edna

Sars-cov-2 spike protein: an optimal immunological target for vaccines

The genome sequence of atlantic cod reveals a unique immune system

Receptor recognition by the novel coronavirus from wuhan: an analysis based on decade-long structural studies of sars coronavirus

Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy

Nine simple ways to make it easier to (re) use your data

The fair guiding principles for scientific data management and stewardship. Scientific data

Dna barcoding detects market substitution in north american seafood

A new coronavirus associated with human respiratory disease in china

Potential use of dna barcodes in regulatory science: applications of the regulatory fish encyclopedia