key: cord-0874243-qqyg9um9 authors: Zhu, Xun; Chang, Ti-Cheng; Webby, Richard; Wu, Gang title: idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates date: 2020-10-09 journal: bioRxiv DOI: 10.1101/2020.10.08.330456 sha: 71b3d58a2ae82662a5dd7f945e3e1ffd6d21daf5 doc_id: 874243 cord_uid: qqyg9um9 idCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md. The on-going Coronavirus disease 2019 (Covid-19) pandemic has resulted in over 734,000 deaths, affect-20 ing more than 188 countries and territories (CSSE, 2020; Dong, et al., 2020) . The general impact of the 21 disease calls for a quick response from the local health system at every stage of its transmission. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2; previously known as 2019-nCoV) 23 is the pathogenic cause of Covid-19 (Lescure, et al., 2020) . Quickly can also provide more granular information to contact tracing and for epidemiological studies to under-28 stand disease severity and host genetic susceptibility to different lineages of SARS-CoV-2. In order to quickly identify the clade of an isolate of SARS-Cov-2 given its sequencing FastQ files, 30 we have developed a bioinformatics pipeline and a user-friendly web interface. An equivalent command-31 line interface is also supplemented for batch-processing many samples in a terminal environment. We 32 introduce this system as idCOV. Table D) shows the marker coverages. When a marker is covered by fewer than 10 130 reads, the marker is deemed as undetermined and denoted as a question mark (?). CSSE. COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Nextflow enables reproducible computational workflows An interactive web-based dashboard to track COVID-19 in real time GISAID -Clade and lineage nomenclature aids in genomic epidemiology of active hCoV-19 139 viruses Haplotype-based variant detection from short-read sequencing Introductions and early spread of SARS-CoV-2 in the New York City area Year-letter Genetic Clade Naming for SARS-CoV-2 on Nextstain.org Datahike is a durable Datalog database powered by an 147 efficient Datalog query engine Clinical and virological data of the first cases of COVID-19 in Europe: a case series The Lancet Infectious Fast and accurate short read alignment with Burrows-Wheeler transform The sequence alignment/map format and SAMtools Tracking the COVID-19 pandemic in Australia using genomics A new coronavirus associated with human respiratory disease in China