key: cord-0763561-djol4x04 authors: nan title: Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022 date: 2021-10-28 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab951 sha: 5ea10b7ab5a1e1f1e26c6964e4a70f532cc48a2b doc_id: 763561 cord_uid: djol4x04 The National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB), provides a family of database resources to support global research in both academia and industry. With the explosively accumulated multi-omics data at ever-faster rates, CNCB-NGDC is constantly scaling up and updating its core database resources through big data archive, curation, integration and analysis. In the past year, efforts have been made to synthesize the growing data and knowledge, particularly in single-cell omics and precision medicine research, and a series of resources have been newly developed, updated and enhanced. Moreover, CNCB-NGDC has continued to daily update SARS-CoV-2 genome sequences, variants, haplotypes and literature. Particularly, OpenLB, an open library of bioscience, has been established by providing easy and open access to a substantial number of abstract texts from PubMed, bioRxiv and medRxiv. In addition, Database Commons is significantly updated by cataloguing a full list of global databases, and BLAST tools are newly deployed to provide online sequence search services. All these resources along with their services are publicly accessible at https://ngdc.cncb.ac.cn. The National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB), was officially founded in 2019. Since then, CNCB-NGDC is constructed by joint efforts and collaborations from three institutions of Chinese Academy of Sciences, namely, Beijing Institute of Genomics, Institute of Biophysics and Shanghai Institute of Nutrition and Health as well as several partners (https://ngdc.cncb.ac.cn/partners). In the past several years, an increasing number of large-scale highthroughput sequencing projects have been carried out in biomedical research worldwide, resulting in vast amounts of multi-omics data that are continually generated at evergrowing rates and scales. Therefore, CNCB-NGDC is devoted to empowering accelerated progresses in life and health sciences by providing open access to a suite of database resources through big data archive, curation, integration and analysis (1) (2) (3) (4) (5) . Nowadays, rapid advances in single-cell sequencing technologies have opened a new era for biomedical research, paving the way to delineate cellular composition diversity and elucidate complex mechanisms of organ development and diseases at single-cell resolution (6, 7) . In addition, large-scale cohort-based precision medicine studies have identified new biomarkers and drug targets, greatly promoting the development of more effective means for disease diagnosis, molecular subtyping and medical treatment (8) . To synthesize such growing data and knowledge, CNCB-NGDC has made considerable efforts in the past year by developing new resources and updating relevant resources. Particularly, due to the coronavirus disease pandemic that is still a global health threat to our human being, CNCB-NGDC has continued to put enormous efforts in daily update of SARS-CoV-2 genome sequences, variants, haplotypes and literature (https://ngdc.cncb.ac.cn/ ncov) (9, 10) . Moreover, Database Commons is significantly updated to provide open access to a full list of worldwide biological databases, and BLAST tools are newly deployed to CeDR Atlas (https://ngdc.cncb.ac.cn/cedr, detailed in (12) in this issue) is a knowledge base reporting computational inference of cellular drug response for hundreds of cell types from various tissues. By collecting the fast-growing singlecell transcriptome profiles generated by multiple international consortiums and other available labeled datasets, tissue and cell type specific drug response analysis was conducted to provide direct references for cellular drug response profiles, including not only disease cell types but also normal cell types. Currently, CeDR Atlas maintains the results of 582 single-cell data objects for human, mouse and cell lines. Specifically, it hosts 188 157 significant cell typedrug associations for human, 42 660 for mouse and 10 299 for cell lines. Cell Taxonomy (https://ngdc.cncb.ac.cn/celltaxonomy) is a curated repository of cell types and cell markers covering a wide range of species, tissues and conditions. Based on manual curation of 3402 publications, it presents a standardized and well-structured taxonomy for 2650 cell types and collects 25 087 associated cell markers in 157 conditions and 296 tissues across 21 species. In addition, Cell Taxonomy incorporates 564 single-cell RNA-seq datasets and provides multifaceted characterization for cell types and cell markers by enrichment analysis, cellular component similarity estimation and quality assessment of cell markers and cell clusters. Taken together, Cell Taxonomy is of great utility for cell type characterization and accurate selection of cell markers and reference datasets, functioning as a fundamental reference cellular resource for a wide range of single-cell research. (https://ngdc.cncb.ac.cn/ compodynamics, detailed in (13) in this issue) is a comprehensive database of sequence compositions of coding sequences (CDSs) and genomes for a wide range of species. CompoDynamics characterizes rich sequence compositions (nucleotide content, codon usage and amino acid usage) and derived molecular features (coding potential, physicochemical property and phase separation) for 118 689 747 high-quality CDSs and 34 562 genomes across 24 995 species. In addition, multiple tools are provided to enable comparative analyses of sequence compositions and features across different species and gene groups. Collectively, CompoDynamics bears great potential to help us reveal sequence composition dynamics across genes and genomes, providing a fundamental resource for a broad spectrum of biological studies. The Open Archive for Miscellaneous Data (OMIX; https: //ngdc.cncb.ac.cn/omix), a new member of the GSA family, aims to meet users' needs for archiving miscellaneous data that are unsuitable for storing in GSA/GSA-Human. It allows different data types (e.g. microarray and genotype), accepts various omics data (e.g. lipidome, metabolome and proteome) and houses analyzed results and related research data (e.g. clinical information, demographic data and questionnaire). OMIX features straightforward submission interfaces and offers open-access and controlled-access data management strategies. As of September 2021, OMIX has archived 269 data submissions with 13.3 Terabytes (TB), among which 115 have controlled access. The Open Library of Bioscience (OpenLB; https://ngdc. cncb.ac.cn/openlb) provides easy and open access to a large number of biological literatures. In the current version, it contains ∼33 million abstract texts from PubMed (14) , bioRxiv and medRxiv. OpenLB provides both simple keyword query and advanced search functionalities, in order to help users search publications in a convenient and customized manner. In addition, OpenLB aims to provide seamless links with CNCB-NGDC database resources, associating scientific literature with omics data and curated information if available so that users can easily find both publications and their related data/information. Ongoing efforts of OpenLB include the integration of more literature types, deployment of named entity recognition tool and development of manuscript submission service. The Registry and Database of Bioparts for Synthetic Biology (RDBSB; https://www.biosino.org/rdbsb) is a finely curated resource for catalytic bioparts, incorporating comprehensive information of biopart sequence and functions (including catalytic processes, qualitative and quantitative parameters and biopart expression). RDBSB collects 366 045 catalytic bioparts, and 72 180 of them are manually curated with experimental evidence from literature mining. In addition, RDBSB collects relevant experimental conditions, such as pH, temperature and chassis, etc., which are crucial for pathway design in a given chassis. Roadmap (https://ngdc.cncb.ac.cn/ regeneration, detailed in (15) in this issue) is a comprehensive database collecting and standardizing experimental data generated in regeneration research. In the current version, Regeneration Roadmap systematically and comprehensively collects regenerative information over 1.96 million data entries across 10 species and 34 tissues, including regeneration-related genes, bulk and single-cell transcriptomics, epigenomics and pharmacogenomics data. In this database, users can easily explore regulatory and expression changes of regeneration-associated genes in different species or tissues. Together, Regeneration Roadmap provides the research community with a long awaited and valuable data resource featuring convenient computing and visualization tools. (https://ngdc.cncb.ac.cn/bioproject) and BioSample (https://ngdc.cncb.ac.cn/biosample) are two public repositories of biological research projects and samples, respectively. They collect descriptive metadata on biological projects and samples investigated in experiments and provide centralized accesses to all public projects and samples as well as cross links to their related data resources. BioProject organizes a huge volume of projects, involving multi-omics sequencing efforts, genome-wide association studies and variation analyses. BioSample supports a wide scope of sample types, including human, plant, animal, microbe, virus, pathogen and metagenome. Up to September 2021, there are a total of 4514 biological projects and 482 577 samples submitted by 2538 users from 514 organizations (Figure 2A) , presenting a rapid increase by comparison with 2288 projects and 176 288 samples in August 2020. The Genome Sequence Archive (GSA; https://ngdc.cncb. ac.cn/gsa) (16,17) is a public data repository for archiving raw sequence reads. GSA accepts worldwide data submissions, performs data curation and quality control, and provides free open access to all publicly available data without restrictions. In addition, GSA for Human (GSA-Human; https://ngdc.cncb.ac.cn/gsa-human) (17) , serving as an important partner database of GSA, features controlledaccess and security services for human genetics-related data and accepts data submissions of various studies, includ- ing disease, cohort, cell line, clinical pathogen and humanassociated metagenome. As of September 2021, GSA together with GSA-Human has reached a milestone of over 10 PB of raw sequencing data archived as well as 398 322 experiments and 465 245 runs ( Figure 2B and C), showing the doubled volume by comparison with the previous release last August (∼4.6 PB). In particular, GSA-Human has accommodated 5.6 PB of raw sequence data since its inception in 2018, demonstrating that human genetic data are growing at an unprecedented rate and scale. The Genome Warehouse (GWH; https://ngdc.cncb.ac.cn/ gwh) is a public repository of genome-scale data for a wide range of species (18) . By September 2021, GWH has housed a total of 20 606 submitted genome assemblies covering 1,251 species ( Figure 2D ), presenting a doubled increase in contrast to the previous release (9337 assemblies in 2020). Among them, 9886 genome assemblies have been publicly released and reported in 97 articles of 47 journals. Particularly, GWH has received the submission of 1660 SARS-CoV-2 genome assemblies, which were further integrated into the 2019 novel coronavirus resources (2019nCoVR) (9, 10) . Moreover, compared with the previous release, GWH has been significantly upgraded by providing sequence alignment service via BLAST (19) and supplying encrypted links for reviewing unpublic data. Collectively, GWH serves as an important resource for genome assembly data to support genomic research throughout the world. The Gene Expression Nebulas (GEN; https://ngdc.cncb.ac. cn/gen) is a data portal integrating transcriptomic profiles at both bulk and single-cell levels in various conditions across multiple species (detailed in (20) 21), editome-disease associations from Editome-Disease Knowledgebase (EDK) (22) and RT-qPCR reference genes from Internal Control Genes (ICG) (23) are also interconnected to expand the scope of knowledge for corresponding genes. The Methylation Bank (MethBank; https://ngdc.cncb.ac. cn/methbank) (24,25) is a comprehensive database of DNA methylation data. The current version of MethBank incorporates 855 single-base resolution methylomes (SRMs), 93 936 775 methylation profiles of genes, 6 945 524 methylated CpG Islands and 304 884 differentially methylated promoters based on whole-genome bisulfite sequencing data, exhibiting significant updates relative to the previous version in August 2020 (394 SRMs, 19 701 343 methylation profiles, 1 258 420 methylated CpG Islands and 304 884 differentially methylated promoters). Based on 4577 450K DNA methylation samples from normal peripheral blood, MethBank also offers 692 methylation sites closely associated with age, 2335 sites with constant methylation levels across different ages, 53 211 age-specific differentially methylated cytosines and 1899 age-specific differentially methylated regions. The single-cell methylation bank (scMethBank; https:// ngdc.cncb.ac.cn/methbank/scm) is a public data portal that integrates a comprehensive collection of single-cell DNA methylation data (detailed in (26) in this issue). In the past year, scMethBank has rapidly grown from 3166 samples in August 2020 to 8328 samples currently, involving 29 cell types and 67 619 genes with curated metadata in human and mouse. Based on uniformed data processing, it presents whole-genome DNA methylation profiles at singlenucleotide resolution in various biological contexts and developmental stages. Accordingly, user-friendly web interfaces for data search, download, visualization and online tools for downstream analysis are implemented in scMeth-Bank. LncRNAWiki (https://ngdc.cncb.ac.cn/lncrnawiki) is a wiki-based database for community-curation of human long non-coding RNAs (lncRNAs) (27, 28) . The current version of LncRNAWiki 2.0 is significantly updated by (i) providing a new curation model with more informative and essential annotation items, (ii) developing a new web system based on MySQL/Java (instead of MediaWiki) that is capable of organizing all contents in a structured manner, (iii) improving the community-annotation submission functionality and providing more user-friendly web interfaces and (iv) equipping with online tools for ID conversion and functional prediction. Consequently, LncRNAWiki 2.0 incorporates 2512 lncRNAs and their annotations compared to 2056 featured lncRNAs in LncR-NAWiki 1.0 in 2020, thus providing an up-to-date picture of experimentally validated and functionally annotated lncRNAs in human. The updated version of piRBase v3.0 (http://bigdata.ibp.ac. cn/piRBase) (29) is a comprehensive database of piRNA sequences. In current release of piRBase, the number of nonredundant piRNA sequence increases from 173 million in last August to 181 million, and the species reaches 44 compared to 21 in August 2020. In view of the huge amount of piRNAs, it provides users with gold standard piRNA sequence sets. In order to further expand the research on piRNA function, potential information of splicing-junction piRNA and piRNA variants is also included in piRBase, offering an alternative explanation for possible mechanism of piRNAs. In addition, it integrates piRNA-related information on a variety of diseases, like cancers, cardiovascular diseases, stroke and Alzheimer. Also, piRBase presents regulatory network of piRNAs in a visualized manner and provides the expression of piRNAs in different tissues and cell lines. The EWAS Open Platform (https://ngdc.cncb.ac.cn/ewas) is an open platform for epigenome-wide association studies (EWAS), including EWAS Atlas, EWAS Data Hub and EWAS Toolkit (detailed in (30) in this issue). As an EWAS knowledgebase, EWAS Atlas (https://ngdc.cncb.ac. cn/ewas/atlas) has grown from 577 267 associations in August 2020 to 617 018 associations curated from 910 publications, covering 618 traits and 3382 cohorts in September 2021 (31) . As a data portal of EWAS Open Platform, EWAS Data Hub (https://ngdc.cncb.ac.cn/ewas/datahub) integrates 115 852 samples (in contrast to 95 783 samples in August 2020) of standardized DNA methylation array data (450K and EPIC/850k) (32) and the corresponding metadata involving 925 tissues/cells and 528 diseases (33) . EWAS Toolkit (https://ngdc.cncb.ac.cn/ewas/toolkit) is newly developed to provide downstream analysis and network visualization, such as trait enrichment, genomic location enrichment, GO and KEGG enrichment, chromatin state and histone modification enrichment, tissue methylation, expression regulation, motif enrichment and EWAS knowledge graph. GWAS Atlas (https://ngdc.cncb.ac.cn/gwas) (34) is a curated resource of genome-wide variant-trait associations in plants and animals. In contrast to 78 950 associations in August 2020, the current version of GWAS Atlas has archived a total of 96 141 associations across seven cultivated plants and five domesticated animals, manually curated from 1350 studies in 367 publications. As a result, a total of 23 880 genes and 862 traits were annotated and presented based on a set of ontologies. Together, GWAS Atlas provides highquality curated GWAS associations for plants and animals, and accordingly serves as a valuable resource for genetic research of important traits and breeding application. BrainBase BrainBase (https://ngdc.cncb.ac.cn/brainbase, detailed in (35) in this issue) is a curated knowledgebase for brain diseases with the aim to provide a whole picture of brain diseases and associated genes. Compared to the previous version that contains 4248 associations and 3996 genes in August 2020, the current version houses 7175 disease-gene associations, spanning a total of 123 brain diseases and linking with 5662 genes. It also integrates 16 591 drug-target interactions covering 2118 drugs/chemicals and 623 genes, and presents specific genes in light of expression specificity in brain tissue/regions/cerebrospinal fluid/cells. In addition, BrainBase incorporates multi-omics datasets to identify glioma featured genes with potential clinical significance. The database of Differentially Expressed MicroRNAs in human Cancers (dbDEMC, https://www.biosino.org/ dbDEMC) is an integrated database for storing and annotating potential cancer-related microRNAs (miRNAs), retrieved by analyzing large numbers of miRNA expression profiling studies. Compared with the previous version (2224 differentially expressed miRNAs [DEMs] in 36 cancer types from 209 expression profiling data sets), dbDEMC version 3.0 integrates more data entries, containing 3268 DEMs in 40 cancer types curated from 807 experiments in human, mouse and rat. It is also updated by enhancing the visualization functionalities for expression heatmap, regulatory network, gene ontology, KEGG pathway map and miRNA expression boxplot. In addition, dbDEMC incorporates experimentally validated targets for the DEMs. Therefore, db-DEMC will play an important role in characterizing molecular functions and regulatory mechanisms of DEMs in human cancers. The 2019 Novel Coronavirus Resources (2019nCoVR; https://bigd.big.ac.cn/ncov) (36,37) contains a comprehensive collection of all publicly available SARS-CoV-2 genome sequences with quality evaluation and value-added manual annotations. Consequently, it houses a global landscape of genomic variants and haplotypes, visualizes the spatiotemporal change for each variant and constructs haplotype network maps for the course of the outbreak. More importantly, it provides the hierarchical epidemiological lineage browser to easily capture the leading edge of pandemic transmission (38) . Besides, 2019nCoVR offers a set of online tools for SARS-CoV-2 genome assembly and annotation, variant identification and effect annotation, genome tracing and haplotype construction as well as a full collection of literatures on COVID-19 (9) . Notably, all SARS-CoV-2 genome sequences, variants, haplotypes and literatures are updated daily since January 2020. Meantime, a patient-centric resource named integrative CT images and clinical features for COVID-19 (iCTCF) is developed to archive chest CT images and 130 types of clinical features as well as laboratory-confirmed SARS-CoV-2 clinical status, providing a useful tool for improving diagnosis and treatment of COVID-19 patients (39) . iDog iDog (https://ngdc.cncb.ac.cn/idog) is an integrated omics data resource for domestic dog (Canis lupus familiaris) and wild canids (40) . In the current version, iDog is updated by integrating 27 ancient dog samples with 6 544 496 unique SNPs and including 26 cell clusters with 105 057 single cells for dog brain tissue. As a result, a total of 71 050 194 unique SNPs in 722 samples, 481 breeds, 806 diseases and 1170 genotype-to-phenotype pairs from 1192 experiments and 62 high-quality RNA-seq projects are integrated, dramatically increasing from 42 871 184 SNPs and 594 genotypeto-phenotype pairs in August 2020. Additionally, iDog provides an online classification tool used to predict the dog breed by using deep learning method. As a data resource of the Dog 10K Genomes Project (http://dog10k.big.ac.cn), with these functions and data, iDog provides freely browse, search and download services for worldwide users. NGDC Education (https://ngdc.cncb.ac.cn/education) is an open education resource that provides a series of educational materials. This past year, two courses, viz., Bioinformatics and Genomics Data Analysis, were newly added by the courtesy of Prof. Yu Xue from Huazhong University of Science and Technology and Prof. Cheng Li from Peking University, respectively. In addition, biographies of the late Profs. Xiaocheng Gu of Peking University and Bailin Hao of Fudan University were added. Early in the 1990s, Prof. Gu established the Center for Bioinformatics in Peking University to provide bioinformatics resources and services for domestic and international users. Prof. Hao made great contributions to the bioinformatics research, particularly his CVTree algorithm for bacterial genome classification (43, 44) and advocate of establishing the CNCB since the 1990s. Their personal profiles, articles, and videos (if available) can be found at NGDC Education. In addition, in coordination with the Global Biodiversity and Health Big Data (BHBD) Alliance, we promote open sharing of educational materials as well as multi-omics data throughout the world. Users' needs of sequence search and comparison are growing with the expansion of various database resources in CNCB-NGDC. BLAST tools (https://ngdc.cncb.ac.cn/ blast) are newly deployed, providing online services of different sequence alignment types developed by National Center for Biotechnology Information (NCBI) (45) with featured databases, for instance, GWH transcripts, Lnc-Book human lncRNA sequences, 10K protist species genomes and SARS-CoV-2 genome sequences. In particular, to support worldwide studies on SARS-CoV-2, a series of genomic analysis tools on coronavirus are also established (https://ngdc.cncb.ac.cn/ncov/online/tools) (37) , which cover sequencing quality control, de novo assembly and variant calling, haplotype network construction, genome tracing and lineage identification. Besides, computational identification of long non-coding RNAs (https: //ngdc.cncb.ac.cn/lgc) (46) and EWAS Toolkit for functional enrichment and network visualization (https://ngdc. cncb.ac.cn/ewas/toolkit) (47) are also presented. And BIG Search, a distributed and scalable search engine, has been updated by including standardized data indexes from all resources in CNCB-NGDC, 39 partner resources (see details at https://ngdc.cncb.ac.cn/partners) as well as European Bioinformatics Institute (EBI) resources based on EBI Search RESTful API (48), NCBI resources powered by NCBI Entrez (49) and the AlphaFold Protein Structure Database (50) . This year, several core resources of CNCB-NGDC have been listed as recommended repositories (e.g. nucleic acid sequences and genetic variations) by major publishers such as Cell Press, Elsevier and Springer Nature, greatly accelerating the rapid deposition and public sharing of biomedical big data at a global scale. Additionally, we keep paying efforts to build close collaborations with INSDC (International Nucleotide Sequence Database Collaboration) (51), as testified by the open sharing of SARS-CoV-2 genome data with NCBI. Importantly, 2019nCoVR has been significantly updated by frequent data integration and web interface improvement. Meanwhile, to deal with the explosive growth of multi-omics data, CNCB-NGDC provides a suite of database resources, which are newly developed and frequently updated, to accept worldwide data submissions and provide value-added annotations and curated knowledge. Ongoing efforts include, but not limited to, optimization and automation of data submission, curation and analysis procedures, infrastructure upgrade for big data storage and transfer, and development of new tools and pipelines to support worldwide genetic and genomic research. As one of the major global centers, CNCB-NGDC will continue to expand and offer a series of data resources and services to benefit a wide range of research in life and health sciences. All the resources can be accessed at https://ngdc.cncb.ac.cn. We thank our users for submitting data, sending suggestions, reporting bugs and getting involved in community curation. CNCB-NGDC is indebted to its funders, including the Ministry of Science and Technology and the Ministry of Finance of the People's Republic of China as well as Chinese Academy of Sciences. Database resources of the national genomics data center, china national center for bioinformation in 2021 Database resources of the national genomics data center in 2020 Database resources of the big data center in 2019 Database resources of the big data center The BIG Data Center: from deposition to integration to translation Advances and applications of single-cell sequencing technologies Single-cell RNA sequencing for the study of development, physiology and disease Optimizing precision medicine for public health The 2019 novel coronavirus resource The global landscape of SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR CancerSCEM: A database of single-cell expression map across various human cancers CeDR Atlas: a knowledgebase of cellular drug response CompoDynamics: a comprehensive database for characterizing sequence composition dynamics Regeneration Roadmap: database resources for regenerative biology GSA: genome sequence archive The genome sequence archive family: toward explosive data growth and diverse data types Genome warehouse: a public repository housing genome-scale data Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Gene Expression Nebulas (GEN): a comprehensive data portal integrating transcriptomic profiles across multiple species at both bulk and single-cell levels Plant editosome database: a curated database of RNA editosome in plants Editome Disease Knowledgebase (EDK): a curated knowledgebase of editome-disease associations in human ICG: a wiki-driven knowledgebase of internal control genes for RT-qPCR normalization MethBank 3.0: a database of DNA methylomes across a variety of species MethBank: a database integrating next-generation sequencing single-base-resolution DNA methylation programming data 2022) scMethBank: a database for single-cell whole genome DNA methylation maps LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs LncRNAWiki 2.0: a knowledgebase of human long non-coding RNAs with enhanced curation model and database system piRBase: a comprehensive database of piRNA sequences EWAS Open Platform: integrated data, knowledge and toolkit for epigenome-wide association study EWAS Atlas: a curated knowledgebase of epigenome-wide association studies GMQN: A reference-based method for correcting batch effects as well as probes bias in HumanMethylation BeadChip EWAS Data Hub: a resource of DNA methylation array data and metadata GWAS Atlas: a curated resource of genome-wide variant-trait associations in plants and animals BrainBase: a curated knowledgebase for brain diseases The Global Landscape of SARS An online coronavirus analysis platform from the National Genomics Data Center 2020) A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Open resource of clinical data from patients with pneumonia for the prediction of COVID-19 outcomes via deep learning iDog: an integrated resource for domestic dogs and wild canids 2021) iSheep: an integrated resource for sheep genome, variant and phenotype SorGSD: updating and expanding the sorghum genome science database with new contents and tools CVTree: a phylogenetic tree reconstruction tool based on whole genomes CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes Basic local alignment search tool Characterization and identification of long non-coding RNAs based on feature relationship EWAS Atlas: a curated knowledgebase of epigenome-wide association studies The EMBL-EBI search and sequence analysis tools APIs in 2019 Searching NCBI databases using entrez Highly accurate protein structure prediction with AlphaFold The international nucleotide sequence database collaboration Qiancheng Chen 1,2 , Xiaoyu Yang 1,2 , Xin Zhang 1,2 , Zhengqi Sang Shuang Zhai 1,2 , Huanxin Chen 1,2 , Wenming Zhao 1,2,3 , Jingfa Xiao 1,2,3 , Yiming Bao 1,2,3 , Lili Hao 1,2,# (TL) MethBank: Mochen Zhang 1 Rongqin Zhang 1,2,3 , Dong Zou 1,2 , Lina Ma 1,2,# (TL) dbDEMC: Feng Xu 19,# , Yifan Wang 5,# , Yunchao Ling 5 Shuhui Song 1,2,3 , Zhang Zhang 1,2,3 , Mingkun Li 2,9 Listed in alphabetical order by database names) BBCancer: Zhixiang Zuo 27 Fangqing Zhao 29 CirFunBase: Xianwen Meng 30 Anyuan Guo 21 lnCAR: Yubin Xie 27 , Jian Ren 27 MCA: Yincong Zhou 30 , Ming Chen 30 , Guoji Guo 36 MiCroKiTS: Chenwei Wang 21 PceRBase: Chunhui Yuan 30 , Ming Chen 30 PlantRegMap: Feng Tian 39 , Dechang Yang 39 , Ge Gao 39 PLMD: Dachao Tang 21 , Yu Xue 21 PncStres: Wenyi Wu 30 , Ming Chen 30 PTMD: Yujie Gou 21 Tel: +86 10 84097636; Email: zhaowm@big.ac.cn Correspondence may also be addressed to Jingfa Xiao. Tel: +86 10 84097443; Email: xiaojingfa@big.ac.cn Correspondence may also be addressed to Shunmin He. Tel: +86 10 64807279; Email: heshunmin@ibp.ac.cn Correspondence may also be addressed to Guoqing Zhang Correspondence may also be addressed to Guoping Zhao. Tel: +86 21 54924000; Email: gpzhao@sibs.ac.cn Correspondence may also be addressed to Runsheng Chen. Tel: +86 10 64888543; Email: crs@ibp.ac.cn # The authors wish it to be known that, in their opinion, these authors should be regarded as Joint First Authors. 1