key: cord-0925330-2c9mzdb3 authors: Navarro Gonzalez, Jairo; Zweig, Ann S; Speir, Matthew L; Schmelter, Daniel; Rosenbloom, Kate R; Raney, Brian J; Powell, Conner C; Nassar, Luis R; Maulding, Nathan D; Lee, Christopher M; Lee, Brian T; Hinrichs, Angie S; Fyfe, Alastair C; Fernandes, Jason D; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Benet-Pagès, Anna; Barber, Galt P; Haussler, David; Kuhn, Robert M; Haeussler, Maximilian; Kent, W James title: The UCSC Genome Browser database: 2021 update date: 2020-11-22 journal: Nucleic Acids Res DOI: 10.1093/nar/gkaa1070 sha: dc6b2d73daf632e42079a4bc27629c7792f4f4a8 doc_id: 925330 cord_uid: 2c9mzdb3 For more than two decades, the UCSC Genome Browser database (https://genome.ucsc.edu) has provided high-quality genomics data visualization and genome annotations to the research community. As the field of genomics grows and more data become available, new modes of display are required to accommodate new technologies. New features released this past year include a Hi-C heatmap display, a phased family trio display for VCF files, and various track visualization improvements. Striving to keep data up-to-date, new updates to gene annotations include GENCODE Genes, NCBI RefSeq Genes, and Ensembl Genes. New data tracks added for human and mouse genomes include the ENCODE registry of candidate cis-regulatory elements, promoters from the Eukaryotic Promoter Database, and NCBI RefSeq Select and Matched Annotation from NCBI and EMBL-EBI (MANE). Within weeks of learning about the outbreak of coronavirus, UCSC released a genome browser, with detailed annotation tracks, for the SARS-CoV-2 RNA reference assembly. Since the debut of the UCSC Genome Browser (1) in 2001, the web-based data visualization tool has served as a digital microscope to cross-reference, interpret and analyze genome assemblies. From base pairs to contigs to chromosomes, the visualization tool allows for genome annotations to be positioned alongside the genomic DNA itself for a large number of vertebrate species and other clades of life. In this era of big data, the UCSC Genome Browser team as-pires to quickly incorporate and contextualize vast amounts of genomic information. Apart from incorporating data from researchers and consortia, the Browser also provides tools available for users to view and compare their own data with ease. Custom tracks allow users to quickly view a dataset, and track hubs allow users to extensively organize their data and share it privately using a URL. Saving a session and sharing the session URL with a colleague allows easy access to the pre-configured views of an interactive Browser image (2) . Public data access also enables creators to submit their hub to our list of available 'public hubs' (https://genome.ucsc.edu/cgi-bin/ hgHubConnect) or 'public sessions' (https://genome.ucsc. edu/cgi-bin/hgPublicSessions). Accessing the underlying track data can be achieved in a variety of ways. The Table Browser (3) and RESTful API are useful to extract data from a region in many file formats such as BED or wiggle. The public MySQL server allows users to query data tables directly, and table dumps are available on the download server (https://hgdownload.soe. ucsc.edu/downloads.html) to enable bulk download and local processing of information in our database tables. Binary indexed files, liftOver files, and other large files can be found in the /gbdb/ directory hierarchy on the download server (https://hgdownload.soe.ucsc.edu/downloads.html#gbdb). Currently, 211 genome assemblies are available on the UCSC Genome Browser, representing 107 different species. In early 2020, as a response to the urgency of supporting biomedical research for COVID-19, the SARS-CoV-2 genome assembly was released along with relevant biomedical datasets (4) . With the growing number of datasets related to the RNA genome causing the pandemic, a COVID-19 landing page (https://genome.ucsc.edu/covid19. html) was created to consolidate and serve as a directory for certain information and research resources. Given the constant production of new datasets from researchers around the world, the UCSC Genome Browser team has added support for new data types and several new display features, some of which have been suggested by the user community. New features including Hi-C, vcf-PhasedTrio and bigDbSnp data visualizations are designed to assist in the interpretation of genetic variants in clinical and research settings. As always, all data and software are freely available for personal, non-profit, and academic research use. Updating existing data tracks and displaying new annotations is a key goal for the UCSC Genome Browser team as a means to better serve the genomics community. The addition of new vertebrate genome assemblies ensures that new sequences are incorporated into the Browser as consortia work to resolve gaps, repetitive regions and update chromosome assemblies. In the past year, considerable resources were expended to upgrade the user experience for clinical variant interpretation. A primary focus of this effort was to make the detailed information readily available via mouse-overs, rather than navigating to the details page. Figure 1 shows a composite of the mouse-overs for the ClinVar Short Variants and Copy Number Variants (5), presenting the key information underlying the variant, without a click-through to the details page. On the configuration page for several tracks (ClinVar Variants, Database of Genomic Variants (6), and Leiden Open Variation Database Public Variants (7)), filters were added to allow the display of specified subsets of the data: Variant type, molecular consequence and clinical significance. Five new tracks were created to support the assessment of sequence variants in a clinical context: gnomAD Constraint Metrics (metrics of pathogenicity per-gene and transcript regions) (Variation Group) (8); gnomAD Structural Variants (allele frequencies of SVs in the common population) (Variation Group) (8); dbVar Curated Common Structural Variants (Variation Group) (9); Automatic Variant evidence Database (AVADA) variants extracted from full-text publications (Phenotype and Literature Group) (10); the Clin-Gen track collection, including Gene Dosage Sensitivity (haploinsufficiency and triplosensitivity) (Phenotype and Literature Group) (11) , and Problematic Regions (regions known to cause short-read sequencing analysis artifacts) (Mapping and Sequencing Group). As new annotations are released by collaborators, the Browser team updates the corresponding tracks with the latest data. Using an automated system for many of these updates, the data are incorporated into the Browser soon after they are released at the source. Automated processes also check the data for consistency, flagging when updates indicate changes in data formats or unexpected changes in the number of records. This year, as indicated in Table 1, gene model tracks were updated for human, mouse, and other vertebrate genomes. The GENCODE Genes v32 and vM23 correspond to the default gene set, formerly named Known Gene, for human and mouse. While releasing the most useful genome annotations and assemblies to the Browser is a high priority, the sheer volume of new data exceeds our capacity to build tracks for everything. The Browser Track Hub mechanism allows users to view and share genomes and annotations without our intervention. In the past year, 17 hubs were added to the 'public hubs' listing, as shown in Table 2 . Numerous other hubs were created and shared among colleagues, but not added to our public hub listing. A new sharing mechanism for the NCBI RefSeq assembly hubs (http://hgdownload.soe.ucsc. edu/hubs/) is now available and utilizes short links similar to the new session URLs described in the 2020 Genome Browser update (2) . For example, using the RefSeq assembly accession for an elephant genome (GCF 000001905.1), a URL can be constructed such as https://genome.ucsc.edu/ h/GCF 000001905.1, and will display the African savanna elephant assembly hub. A total of four genome assemblies have been added to the Genome Browser within the last year; two of these are new to the Browser. In collaboration with the Monterey Bay Aquarium, the genome assembly for Gidget, a southern sea otter (enhLutNer1), was created and released. The other new genome assembly was the coronavirus, SARS-CoV-2 (wuhCor1), released as part of the effort to consolidate sequence and annotation information in one place for the virus and vaccine research communities. The assemblies for horse (equCab3), rhesus macaque (rheMac10) and gorilla (gorGor6) were updated. Amidst the coronavirus pandemic, the SARS-CoV-2 assembly browser was added with datasets from major annotation databases: Protein Data Bank (12), non-coding RNA families (13) , Immune Epitope Database (14) , Global Initiative on Sharing All Influenza Data (GISAID) (15) , and Universal Protein Resource (16) . The addition of an RNA virus genome required changes to our BLAT tool to make searching feasible, and changes to the genomic display to substitute uracil for thymine. The default view and tracks for the SARS-CoV-2 genome browser are shown in Figure 2 . These datasets include a wide array of information such as gene annotations, variant data, antibody epitope mappings, single-nucleotide variants (SNVs), and locally produced multiple genome alignments. Primer sets for the SARS-CoV-2 virus were added for RT-PCR, CRISPR and sequencing. We also added a Problematic Sites track with locations where masking or caution is recommended for analysis, for example, variants that are most likely sequencing artifacts and should not be used for phylogeny building (17) . To accommodate the vast number of datasets being generated by researchers worldwide, crowd-sourced community annotations were added as a track. Using this mechanism, anyone can add annotations using a simple Google Sheets spreadsheet linked from the track documentation. Information about tracks released can be found in the Nature Genetics paper released in September 2020 (4). In the human browsers (GRCh37/hg19 and GRCh38/hg38), tracks presenting meta-analysis of SARS-CoV-2 infection susceptibility and disease severity in humans were added from the COVID-19 Host Genetics Initiative (18) . The new 'lollipop' display is used to highlight the genomic positions of SNPs that have significant effects on the phenotypes studied. The height and colors of the lollipop items represent the statistical significance along with the effect direction and size. Items can be optionally filtered by the number of studies where the SNP was identified as significant, the minimum −log 10 P-value and the effect size. Last year, we began to incorporate official patch sequences from the Genome Reference Consortium into the hg38 assembly (2) . This year, we added patch sequences from Genbank (19) to the hg19 assembly, and also introduced a new mitochondrial sequence, chrMT. The original hg19 genome assembly was released at UCSC along with the Genbank sequence NC 001807, designated chrM, as the mitochondrial sequence. However, the sequence preferred by the community is the revised Cambridge Ref-erence Sequence (rCRS) NC 012920 (20) . This sequence, along with many annotations, has been added to hg19 as chrMT (https://genome.ucsc.edu/cgi-bin/hgTracks?db= hg19&position=chrMT:1-16569). The original chrM sequence remains. The patch sequences added to hg19 correspond to Genome Reference Consortium's human build 37 patch release 13 (GRCh37.p13) and can be viewed using two tracks, the Reference Assembly Fix Patch Sequence Alignments and Reference Assembly Alternate Haplotype Sequence Alignments (Mapping and Sequencing). Adding the patch sequences to the hg19 genome can cause problems for short read aligners, because some sequences appearing in alternate haplotypes now appear as repeats to aligners, when in fact they are unique in the genome, just not in the genome database. In response, an 'analysis set' version of the hg19 genome FASTA files (https://hgdownload.soe.ucsc. edu/goldenPath/hg19/bigZips/analysisSet/) has been added to the bigZips directory, along with indices for BWA (21, 22) , Bowtie2 (23), and Hisat2 (24) . This analysis set is identical to NCBI's analysis set but with UCSC style sequence names. The ENCODE candidate cis-regulatory elements (cCREs) combined from all cell types track (Regulation Group) was added to the human (hg38) and mouse genomes (GRCm38/mm10) this year. The registry of cCREs is a core result of the integrative analysis of epigenomic and transcriptomic data sets produced from nearly two decades of ENCODE Consortium (25) in the July 2020 special issue of Nature marking the results from phase 3 of the ENCODE project (27) . The transcription start sites from the Eukaryotic Promoter Database (28, 29) were incorporated into the Promoters from EPDnew track (Expression Group) for human (hg38 and hg19) and mouse (mm10) assemblies. These tracks represent experimentally validated promoters based on gene transcript models obtained from multiple sources (HGNC (30) , GENCODE (31), Ensembl (32) and RefSeq (33)), then validated using data from CAGE (34) and RAMPAGE (35) experimental studies obtained from FANTOM 5 (36), UCSC, and ENCODE. Peak calling, clustering, and filtering based on relative expression were applied to identify the most expressed promoters and those present in the largest number of samples. The The GTEx gene expression from RNA-seq track (Expression Group) for hg38 and hg19 was updated to reflect the final data release (V8) from the project. This release is based on data from 17 382 tissue samples obtained from 948 adult post-mortem individuals, reflecting a near doubling of samples and donors from the previous (V6, midpoint) release. The GTEx project final reporting is featured in the September 11 special issue of Science (37) . Several software improvements for track visualization and overall usability of the Genome Browser have been made in the last year. The capability of the RESTful API has been expanded. We have also added new display modes for data types not previously supported, such as chromatin conformation data and phased trio haplotypes. During the year, NCBI modified the format of their dbSNP download files. This required re-engineering the pipeline for display in the Browser, which additionally provided the opportunity for improvements in functionality. Hide empty subtracks. Motivated by new large composite ENCODE track hubs (>1200 tracks of transcription factor ChIP-seq peaks), a new feature named 'hide empty subtracks' allows users to configure a composite to display only those subtracks containing data in the current viewing region. This feature is demonstrated in the new Problematic Regions track for the hg19 assembly, as shown in Figure 3 . The track highlights regions known to cause issues in short read alignment, variant calling, or peak calling. Currently, the feature is limited to bigBed tracks inside of a multi-view composite track. Collapse track items. Another track visualization improvement influenced by the large amount of data available for ENCODE tracks is the ability to merge track items that span the genomic region of the viewing window. If an item neither begins nor ends within the viewing window, then this track setting suppresses display of the item. This is useful when visualizing large chromosome imbalances in tracks such as DECIPHER (38) or ClinVar CNVs (5), which have data across many megabases of DNA. A click on the merged track items restores the default view and all items are shown again. An example of collapsed track items is showcased in Figure 4 with the ClinVar CNVs track for hg38. Figure 5 . RESTful API changes. The RESTful API was described in the 2020 Genome Browser update (2) and offers an easy method to extract and download annotations, chromosome lists, DNA sequences and other data from the Browser. In the original release, the tool was limited to nine track types. In the past year, support for other track types was added, including: altGraphX, barChart, chain, ctgPos, expRatio, factorSource, gvf, interact, netAlign, peptideMapping and pgSnp. Figure 3 . The Problematic Regions track is shown in both images. In (A), the original Browser display is shown for the NEB gene, a gene that has an internal duplication of some exons and short read sequencing mapping algorithms do not work well in this region. In (B), the 'hide empty subtracks' feature is used and is hiding four subtracks from the Browser display. (C) shows the track configuration settings used in (B). NoName nssv706714 nssv579151 nssv578369 nssv579162 nssv578371 nssv582673 nssv583212 nssv583165 nssv583110 nssv706233 nssv584496 nssv576523 nssv575310 nssv578383 nssv578372 nssv584362 nssv706399 nssv3396555 nssv707045 nssv1610128 nssv1604377 nssv1603982 nssv1495130 nssv1495122 nssv1415439 nssv1415276 nssv1415250 nssv1603668 nssv707040 nssv583734 nssv706987 nssv575936 nssv1610125 nssv576736 nssv706566 nssv578385 nssv578387 nssv579174 nssv3396566 nssv1609854 nssv583729 nssv579177 nssv579183 nssv578394 nssv578390 nssv579184 nssv579185 nssv707614 nssv585184 nssv3397115 nssv3397350 nssv3396998 nssv1608297 nssv3395173 nssv3394987 nssv3396861 nssv3397143 nssv3395445 nssv3394949 nssv3397316 nssv3395264 nssv1608239 nssv578397 nssv578396 nssv578398 nssv579189 nssv1603540 nssv3396565 nssv582162 nssv578399 nssv582186 nssv579193 nssv582275 nssv579194 nssv1610514 nssv585191 nssv579197 nssv1495593 nssv1495135 nssv3395481 nssv1601711 nssv584354 nssv1603947 With the original release of the REST API, there were seven endpoint functions. This past year, a new endpoint function, /list/schema/, was added. This function returns the data format or track schema and configuration parameters in JSON format for a data track in a specified hub or native genome assembly. Hi-C display. The increasing availability of chromatin conformation data, particularly since the release of the insitu Hi-C protocol published in 2014 (39) , stimulated the development of a display mode to visualize these data. Several tools already existed to view Hi-C data (JuiceBox (40), HiGlass (41), the 3D Genome Browser (42) , and the WashU Epigenome Browser (43)), but there was no fully integrated solution to view the data in tandem with Genome Browser tracks or sessions. Two new Hi-C heatmap tracks are available for human (hg38 and hg19) assemblies that utilize the new display mode. Heatmaps can be configured as squares, triangles, or arcs showing interaction scores, which could indicate enhancer-promoter interactions. An example of the traditional Hi-C heatmap available on the Genome Browser is shown in Figure 6 . High interaction scores indicate that more linkages were formed in the chromatin experiments and are shown with an increase in color intensity. Hi-C data from custom tracks and track hubs can also be visualized (40, 44, 45) . Personal genomics trio display. A new track type, vcf-PhasedTrio, allows for the visualization of phased personal genomics data, generally a trio consisting of a child and two parents. Two lines are drawn per sample in the underlying VCF, illustrating the haplotypes of each person's diploid genome. Variants are then drawn as tick marks on the haplotype line corresponding to which haplotype they belong, such that variants on the same line were likely inherited together. The child haplotypes are drawn in the center of each group, flanked above and below by the par-ent haplotypes. Haplotypes are sorted to show the transmitted alleles and ticks are colored in a variety of userconfigurable settings, such as by inconsistent phasing information or predicted functional effect. The track type also allows custom sorting of the individuals, for instance, showing the child haplotypes below the parent haplotypes. The vcfPhasedTrio track type is available for both custom tracks and track hubs. The 1000 Genomes Project Family VCF trios track in hg38 utilizes this track type and is shown in Figure 7 . New dbSNP display and JSON format. The addition of the short genetic variants from dbSNP release b153 for the human assemblies (hg19 and hg38) introduces a new pipeline and display for the dataset. dbSNP (46) has seen exponential growth in recent releases; from roughly 324 million variants in build b150, to >700 million variants in the latest build b153. To continue providing efficient access to the data, dbSNP has redesigned its architecture and data flow. At the same time, they have made an important change to the representation of insertion/deletion variants (indels) in repetitive regions. Rather than annotating the minimal representation of the indel on the genome, which requires a choice of left-most, right-most, or arbitrary placement within the repetitive region, dbSNP now expands the reference and alternative alleles to cover the entire repetitive region on the genome. The change in dbSNP's data format, and indel representation in particular, has led to the redesign of the dbSNP import pipeline and data representation at UCSC. A new track type, bigDbSnp, was created that uses thin and thick lines to indicate the region of uncertain placement of indels and the minimal size. An example of an arbitrarily placed variant is shown in Figure 8 . The dbSNP b153 track is composed of five subtracks, four of which correspond to previously released SNP tracks (All, Common, Flagged, and Mult subsets Comparison of Micro-C and In situ Hi-C protocols in H1-hESC and HFFc6 Figure 6 . The heatmap for hg38 shows a traditional Hi-C display in 'square' mode (other viewing modes are 'triangle' and 'arc'). Scores for this type of display correspond to how close two genomic regions are in 3-D space, with color intensity showing a high scoring interaction. The upper-left corner of the square corresponds to the left-most position of the current window, while the bottom-right corner corresponds to the right-most position of the window. Figure 7 . The haplotypes for the mother and father are displayed above and below the child's haplotypes. Variants predicted to be de-novo mutations in the child are shown in red. Figure 8 . This image shows a variant in the new dbSNP b153 track inside a repetitive region. There is a deletion of one base in a range of nine identical bases, so a thin rectangle is drawn over the first eight bases to show that there is uncertain placement, and a thick rectangle is placed over the last base to show that one base is deleted from the range. hg19 and hg38. While processing the information downloaded from dbSNP, UCSC annotates some properties of interest. These are noted on the variant's details page, and the track can be filtered to include or exclude affected variants. The SNP tracks were previously based on related MySQL database tables, but with the release of dbSNP b153, the bigDbSnp format is a bigBed file with extra columns that contain all the necessary information to display a variant. An accompanying dbSnpDetails file includes additional data displayed on the details page for an item. With this bigDbSnp format change, the data will no longer be available as database table dumps. Instead, the bigDbSnp file for each subtrack and the shared dbSnpDetails file may be downloaded for hg19 and hg38. In the past year, the Genome Browser's training team provided 20 in-person seminars and workshops and three webinar-style presentations to help users take advantage of the latest features, including appearances at several national and international meetings. Due to the coronavirus pandemic, more than a dozen appearances were canceled, postponed, or converted to virtual presentations. Outreach is supported by updates to the training documentation (https://genome.ucsc.edu/training/) with links to videos and in-depth descriptions of new Browser features. The training page also includes information on how to submit a request for a workshop and where future workshops are scheduled. Seven new videos have been added to the Genome Browser YouTube channel (https://bit.ly/ucscVideos) in the past year. Previous videos highlighted specific problems or features of the Browser that are not obvious to the casual user or had been requested by users on our mailing lists. A new, three-part series, 'UCSC Genome Browser Basics,' is designed to help new users gain familiarity with the Browser and its many features. Another video, 'UCSC Genome Browser: Coronavirus Browser SARS-CoV-2' is designed to introduce workers in virology to the Browser and the virus. Finally, a three-part series, 'Making Links to the UCSC Genome Browser' is designed to help programmers, bioinformaticians, scientists using spreadsheets, and anyone wishing to share stable, customizable links to the Genome Browser. General contact information for the UCSC Genome Browser can be found at https://genome.ucsc.edu/contacts. html, including information on accessing our email support list and an archive of previously answered mailing list questions. UCSC also maintains mirrors in Germany and Japan with the gracious assistance of Bielefeld University, Germany, and RIKEN, Japan. These sites can be found at https: //genome-euro.ucsc.edu and https://genome-asia.ucsc.edu. The coming year will bring more data and tools to the UCSC Genome Browser. Additional features will be added to the phased haplotypes display. More track filtering options will be added, and the 'hide empty subtracks' feature will include a display of the number of tracks that are hidden in the viewing window. New features for composite tracks are in development such as introducing faceted search controls to configure complex composite tracks. Support for single-cell sequencing will continue to be developed in the coming year. We will continue to incorporate COVID-19 human annotations and SARS-CoV-2 viral genome annotations as they become available. The Human Genome Browser at UCSC UCSC Genome Browser enters 20th year The UCSC Table Browser data retrieval tool The UCSC SARS-CoV-2 Genome Browser ClinVar: improvements to accessing data The Database of Genomic Variants: a curated collection of structural variation in the human genome LOVD v.2.0: the next generation in gene variant databases The mutational constraint spectrum quantified from variation in 141,456 humans DbVar and DGVa: public archives for genomic structural variation AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature ClinGen --The Clinical Genome Resource The Protein Data Bank Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families Development of an epitope conservancy analysis tool to facilitate the design of epitope-based diagnostics and vaccines GISAID: Global initiative on sharing all influenza data -from vision to reality UniProt: a worldwide hub of protein knowledge Updated analysis with data from 12th The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA Fast and accurate long-read alignment with Burrows-Wheeler transform Fast and accurate short read alignment with Burrows-Wheeler transform Fast gapped-read alignment with Bowtie 2 HISAT: a fast spliced aligner with low memory requirements An integrated encyclopedia of DNA elements in the human genome Integrative analysis of 111 reference human epigenomes Expanded encyclopaedias of DNA elements in the human and mouse genomes The Eukaryotic Promoter Database: expansion of EPDnew and new promoter analysis tools The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms Genenames.org: the HGNC and VGNC resources in 2019 GENCODE reference annotation for the human and mouse genomes Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation A promoter-level mammalian expression atlas High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression An atlas of active enhancers across human cell types and tissues The GTEx Consortium (2020) The GTEx Consortium atlas of genetic regulatory effects across human tissues DECIPHER: Database of chromosomal imbalance and phenotype in humans using Ensembl resources A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom HiGlass: web-based visual exploration and analysis of genome interaction maps The 3D Genome Browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions WashU Epigenome Browser update 2019 Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments Ultrastructural details of mammalian chromosome architecture ) dbSNP: the NCBI database of genetic variation