key: cord-0891668-2rtbp9fi
authors: Sahoo, Susrita; Mahapatra, Soumya R.; Parida, Bikram K.; Rath, Satyajit; Dehury, Bhudeswar; Raina, Vishakha; Mahakud, Nirmal Kumar; Misra, Namrata; Suar, Mrutyunjay
title: DBCOVP: A Database of Coronavirus Virulent Glycoproteins
date: 2020-11-21
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2020.104131
sha: f59e0abc8b8b97712db2e751cfc938dc61573d47
doc_id: 891668
cord_uid: 2rtbp9fi

Since the emergence of SARS‐CoV-1 (2002), novel coronaviruses have emerged periodically like the MERS‐ CoV (2012) and now, the SARS‐CoV-2 outbreak which has posed a global threat to public health. Although, this is the third zoonotic coronavirus breakout within the last two decades, there are only a few platforms that provide information about coronavirus genomes. None of them is specific for the virulence glycoproteins and complete sequence-structural features of these virulence factors across the betacoronavirus family including SARS-CoV-2 strains are lacking. Against this backdrop, we present DBCOVP (http://covp.immt.res.in/), the first manually-curated, web-based resource to provide extensive information on the complete repertoire of structural virulent glycoproteins from coronavirus genomes belonging to betacoronavirus genera. The database provides various sequence-structural properties in which users can browse and analyze information in different ways. Furthermore, many conserved T-cell and B-cell epitopes predicted for each protein are present that may perform a significant role in eliciting the humoral and cellular immune response. The tertiary structure of the epitopes together with the docked epitope-HLA binding-complex is made available to facilitate further analysis. DBCOVP presents an easy-to-use interface with in-built tools for similarity search, cross-genome comparison, phylogenetic, and multiple sequence alignment. DBCOVP will certainly be an important resource for experimental biologists engaged in coronavirus research studies and will aid in vaccine development.

Coronaviruses belonging to the Coronaviridae family is the causative agent of neurologic, enteric, hepatic, and upper respiratory tract diseases in a wide range of hosts including human, cattle, camels, swine, bats, cats, dogs, rabbits, snake, and several other wild animals and avian host species [1] . The genome comprises a single positive-stranded RNA genome, with size ranging from 26 to 32 Kilo bases in length, with G+C contents varying from 32 to 43 % [1, 2] .

Among the various coronaviruses that are infecting humans, the majority are associated with mild clinical symptoms unlike the Severe Acute Respiratory Syndrome (SARS) coronavirus (SARS-CoV-1) and Middle East Respiratory Syndrome (MERS) coronavirus (MERS-CoV) [3] , which cause high morbidity and mortality in human populations. SARS-CoV-1 incidence was initially reported in November 2002 in Guangdong, Southern China, and resulted in around 8000 cases of human infections with 744 deaths, around 9.5% mortality rate [4, 5] . Later on, a similar epidemic outbreak (MERS-CoV) was first detected in Saudi Arabia in September 2012 which resulted in a higher incidence of mortality rate [6] [7] [8] . Recently, in late December 2019, patients with viral pneumonia symptoms due to an unidentified etiology were reported first in Wuhan City, China [9] . A novel coronavirus was later identified as the causative pathogen, provisionally named as 2019-nCoV, and later renamed as SARS-CoV-2, has been declared as the Public Health Emergency of International Concern by the World Health Organization (WHO) on 30 January 2020 [9, 10] . As of 1 st August 2020, the virus has spread worldwide affecting 213 countries with more than 6 million cases of infected patients. According to comparative genomic analysis, SARS-CoV-2 shares 79.5% nucleotide identity with SARS-CoV-1; and 96% identity with bat-CoV-RaTG13. Therefore, SARS-CoV-2 is considered as SARS related coronavirus, and bats as the most probable source of infection [11] . The SARS-CoV-2, SARS-CoV-1, and MERS-CoV show several J o u r n a l P r e -p r o o f similarities regarding the clinical presentations with pneumonia-like symptoms, evidence of zoonotic transmission as the route of disease origin, and human to human transmission [12] .

Furthermore, all three coronaviruses belong to the genus betacoronavirus which is further classified into five sub-genus, namely Sarbecovirus, Embecovirus, Hibecovirus, Merbecovirus, and Nobecovirus. The SARS-CoV-2, SARS-CoV-1belongs to the Sarbecovirus subgenus [9] .

Despite the great threats to public health around the world and global concern to combat the spread of the ongoing outbreak, to date, there are no clinically approved vaccines available for either SARS-CoV-2 or SARS, MERS, and therefore further research is imperative for identifying appropriate therapeutic targets for the development of safe, stable vaccines for combating human coronavirus infections [12, 13] .

Advances in molecular biology and the use of bioinformatics resources, particularly the immunoinformatics approach have resulted in a deluge of genomic data that can provide prior information on the efficacy of potential vaccine targets worthy of subsequent validation through wet-lab experiments, thus saving a lot of time and effort in the vaccine discovery process [13, 14] . The prediction and characterization of immunogenic epitopes that can induce antibody production from B-cells and cellular response and cytokine secretion from T-cells is a critical step in silico identification and assessment of potential vaccine targets. The epitopedriven vaccine concept has already been successfully employed against many infectious diseases in recent years [15] [16] [17] . As the first step in this direction, it is essential to find proteins that play a definite role in the pathogenesis of any virus. The primary goal of any viral infection is to pinpoint a receptor on the host cell surface for effective binding which would pave the entry of the virus into the host cell. In most cases, glycoproteins are involved in host binding and subsequent virus-host membrane fusion to establish the pathogenesis of the virus [18] . The J o u r n a l P r e -p r o o f four important glycoproteins that majorly contribute to the structure of all coronaviruses are the spike protein (S), small envelope protein (E), membrane protein (M), and nucleocapsid (N) protein [13] . The S protein mediates receptor binding and membrane fusion and is vital for identifying host tropism and transmission capacity [19] [20] [21] . Mutations in the gene encoding spike protein have resulted in altered pathogenesis and virulence in other coronaviruses [22] . It is believed that three molecules of spike proteins form the characteristic 'spikes' or the crownlike appearance specific of this virus family [13] .The majority of the candidate vaccine that is being developed against coronaviruses, targets the spike protein as they are the major inducer of neutralizing antibodies [23, 24] . It is seen that the association of the spike with the membrane protein is crucial in the formation of the viral envelope and the accumulation of both the glycoproteins at the site of virus assembly [22] . The gene encoding the nucleocapsid protein in the SARS-CoV-1 virus is believed to possess a novel nuclear function, which could play a role in pathogenesis. Additionally, the basic nature of this protein implies that it may assist in RNA binding [22, 23] . Lastly, the envelope protein has been shown to play an important role in the assembly of the virion and its replication [25, 26] . These structural proteins have a diverse functional role in the viral pathogenesis; therefore, a dedicated database on all the four discussed major structural glycoproteins will provide a timely and valuable source of detailed sequence-structural properties about these virulence factors to the scientific community that will aid in the development of vaccines against coronavirus.

Despite the constant emerging and re-emerging of the deadly coronavirus since the last two decades, to date, there are only a few dedicated web resources exclusively available to study coronaviruses genes and proteins. For instance, the Comprehensive Database for Comparative Analysis of Coronavirus Genes and Genomes (CoVDB) that performs fast, and precise batch J o u r n a l P r e -p r o o f sequence retrieval, the basis for comparative gene or genome analysis [27] . CoVDB has not been updated since 2007 and provides limited annotation features including cleavage sites, genome information, tandem repeat sequences, transcription regulatory sequences, and RNA structures. Virus Pathogen Database and Analysis Resource (VipR) covers a huge plethora of human pathogenic viruses but includes knowledge on sequence records, a few genome and protein annotations, tertiary protein structures, immune epitope, surveillance, and clinical metadata derived from comparative genomics analysis [28] . Although very useful, VipR doesn't hold any information specific to the virulence glycoprotein and further lacks details on secondary structure properties, subcellular location, molecular function, biological process, domain, cluster, Super family, Physicochemical properties, Epitope conservancy, Allergenicity, Antigenicity, Toxicity, 3D epitope structure, Population coverage analysis. Similarly, ViralZone Table 1 (Table is attached Furthermore, since computational identification of antigenic epitopes requires a complex analysis with a combination of several different tools and is a time-consuming and complex process. Therefore, to enable researchers to have a better understanding of the immunological properties and identify suitable vaccine candidates in the coronaviruses, we have mapped the potential conserved T-cell and B-cell epitopes on all the antigenic protein sequences along with information on the conservancy of the epitopes, potential immunogenicity, allergenicity, toxicity, and allergenicity analysis. Since HLA allele distribution differs among diverse geographic regions and ethnic groups around the world, population coverage analysis is an J o u r n a l P r e -p r o o f important factor in vaccine development. Thus, the cumulative percentage of population coverage across the world was estimated for the predicted epitopes and these results are freely available in the database.

Besides, we determined the 3D structure of the epitopes and its binding interaction with the HLA molecules using in silico docking techniques. To our knowledge, DBCOVP is the first database with a special focus on SARS and MERS betacoronavirus virulence proteins containing detailed physicochemical, and structural information on the spike, envelope, membrane, and nucleocapsid protein sequences derived from 137 strains belonging to diverse host organisms. Most importantly, it is the only database to provide computed high-confidence complete immunological data of the coronavirus antigenic proteins in one platform. All the annotation data were manually curated from public databases and published literature but also computationally predicted using various bioinformatics tools and databases for complete functional annotation of each protein. Additionally, to facilitate further comparative data analysis, DBCOVP supports multiple search and browsing options, with integrated tools for multiple sequence alignment, phylogenetic tree construction, local BLAST alignment search, and in house developed compare tool for comparative genomic analysis. To promote its usability, 'Exclusive Entries for COVID-19' has been included, which consists of proteomic, genomic, and immunoinformatics details of virulent glycoproteins specific to SARS-CoV-2. Moreover, DBCOVP maintains a 'Data Submission Form' that enables users to submit a protein sequence in FASTA format to proceed with the sequence-structure analysis. With the rapidly increasing global demand for the development of a vaccine against SARS-CoV-2, this database will certainly act as a one-stop resource for virologist and vaccinologists for understanding the pathogenesis of this epidemic disease and also for accelerating rational vaccine design by subsequent in vitro and in vivo experimental validation of the identified promiscuous vaccine targets. 

Each protein entry in the database has five important annotation components as discussed below. The detailed annotation has been manually predicted using various tools and databases as described in Supplementary Table 1 d. Epitopes: Each spike, membrane, envelope, and nucleocapsid protein sequences were analyzed to identify the highest immunogenic, and antigenic T-cell epitopes along with B-cell epitopes. We have also predicted the binding Class I and Class II HLA alleles, conservancy score, allergenicity, antigenicity, toxicity, hydropathicity, hydrophilicity, charge, molecular weight of the predicted peptides. In addition, the population coverage analysis of the promiscuous epitopes is also available in the database. Furthermore, the 3D structure of the epitopes along with the docked complex of the epitope and binding HLA have been developed and users can also download the structures for further analysis. The detailed immunogenic results obtained for epitope analysis is described in the next section (Figure 4d ). For each protein sequence, most promiscuous T-cell epitopes and B-cell epitopes were selected which were recognized by a considerable number of HLA alleles and contained the highest immunogenicity, antigenicity value, and were nontoxic to human and hence, considered as the most potential epitopes to induce a strong immune response. Furthermore, the epitopes were selected based on the consensus matching results of all the employed tools. HLA allele distribution differs among diverse geographic regions and ethnic groups around the world. Therefore, population coverage analysis of the epitopes is a very important factor that must be taken into consideration during the development of an effective vaccine. Therefore, for all the predicted epitopes, the cumulative percentage of population coverage across the world was measured and the results are displayed in a graphical format as shown in Figure 2d . The results indicate that all the predicted epitopes and their binding HLA alleles covered more than 80% of the world's population, which is a very important factor for a vaccine candidate since the emerging SARS-CoV-2 strain has affected the human population across the world. Besides, the three-dimensional structure of each of the predicted epitopes was determined and the binding interaction with the most conserved HLA allele was studied using the docking technique. The PDB structures are available for download. The ribbon representation of the structures was prepared and visualized by the PyMOL molecular graphics system.

To facilitate further in-depth analysis of virulence proteins from coronavirus, four analysis tools have been integrated. Sequence similarity search of both nucleotide and amino acid sequences can be performed using the basic local alignment search tool (BLAST) algorithm through an integrated Blast module within the database. The BLAST interface allows alignment of a user-provided sequence against a customized BLAST library containing all sequences present in the DBCOVP database. This helps to identify the sequence similarity of any unknown sequence to known annotated proteins. The user may specify BLAST parameters and upload or paste the query sequences. The output is given in the standard format with the blast score and ordered by ascending e-value. Each hit is hyperlinked to that entry's browser page. As the analysis of variability of virulence proteins is important for understanding the emergence of novel strains and to decipher sequence level variations leading to changes in pathogenicity, therefore to facilitate cross-genome comparative analysis a COMPARE Tool has been integrated by which users can analyze the variations in targeted sequences across multiple strains belonging to same or different host species. Additionally, multiple sequence alignment and phylogenetic tree can be constructed using embedded MUSCLE tool and PhyML tool, respectively in the database.

The COVID-19 pandemic has resulted in an exponential increase in the number of novel SARS-CoV-2 coronaviruses genomes being sequenced. Therefore, computational methods and databases are needed to organize, explore and analyze large volumes of the biological data to aid in understanding the mechanisms of disease pathogenesis and, most importantly, to speed up the vaccines development process by providing adequate information on the efficacy and immunogenicity of potential molecular targets critical for subsequent clinical validation. Increasing studies have shown that the four major structural glycoproteins namely spike protein, envelope protein, membrane proteinn and nucleocapsid protein have important J o u r n a l P r e -p r o o f functions and play vital roles in viral infection and particularly spike protein has been shown to elicit T-cell responses suggesting as potential vaccine candidates against SARS infection [39] .

In this study, we developed the DBCOVP, the first manually curated database to provide comprehensive information on the entire repertoire of structural glycoproteins from coronavirus genomes of betacoronavirus genera including the newly sequenced SARS-CoV-2 strains which are majorly responsible for the atypical severe acute respiratory syndrome. As compared to few existing databases on coronaviruses research, DBCOVP is a specialized database focussed on coronavirus spike, envelope, membrane, and nucleocapsid proteins and excels in the following aspects: (i) Substantially extended data volume consisting of a total of 185 structural proteins from 137 strains including sequences from the recently deposited SARS-CoV-2 strains in NCBI. (ii) Complete functional annotation of the proteins highlighting 14 sequence-structural properties which are partially addressed in some of the existing coronavirus sequence data resources. Basic information about each protein includes manually curated information from known databases while more specific and source-dependent annotation features have been computationally predicted using various bioinformatics tools and methods. (iii) The major purpose of the database is to enable users to perform knowledge discovery from coronavirus antigen data with particular emphasis on applications in immunology and vaccinology. Each spike, membrane, envelope, and nucleocapsid protein sequences have been mapped to highlight the most promiscuous epitopic regions (Tcell and B-cell) along with conservancy score, allergenicity, antigenicity, toxicity, hydropathicity, hydrophilicity, charge, molecular weight, and population coverage analysis of the predicted peptides. In addition, the 3D structure of the epitopes along with the docked epitope-HLA binding complex is available for further analysis. This is the first database containing the aforementioned immunogenic data specific for Research on viable therapeutics and vaccine targets against human coronavirus infection is probably only beginning to unfold. In the future, we will continue to update the database and include sequences from other coronavirus strains as well as with more valuable resources constantly integrated into the database.

Furthermore, we will also try to combine all the complex steps and tools employed in this study for epitope analysis into one automated tool which would be particularly useful for researchers with little knowledge in bioinformatics to rapidly analyze the immunogenic properties of uncharacterized sequences in one platform without moving data between different analysis tools. DBCOVP will certainly be an important resource when prioritizing vaccine candidates against coronavirus infection. 

Emerging novel coronavirus (SARS-CoV-2 )-current scenario, evolutionary perspective based on genome analysis and recent developments

Epidemiology, genetic recombination, and pathogenesis of coronaviruses

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Severe acute respiratory syndrome

Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia

A dynamic compartmental model for the Middle East respiratory syndrome outbreak in the Republic of Korea: a retrospective analysis on control interventions and superspreading events

The clinical and virological features of the first imported case causing MERS-CoV outbreak in South Korea

A new coronavirus associated with human respiratory disease in China

The epidemic of 2019-novel-coronavirus (SARS-CoV-2 ) pneumonia and insights for emerging infectious diseases in the future

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Overlapping and discrete aspects of the pathology and pathogenesis of the emerging human pathogenic coronaviruses SARS-CoV, MERS-CoV, and SARS-CoV-2

Coronavirus infections and immune responses

Coronavirus genomics and bioinformatics analysis

A highly immunogenic trivalent T cell receptor peptide vaccine for multiple sclerosis

A synthetic malaria vaccine elicits a potent CD8(+) and CD4(+) T lymphocyte immune response in humans. Implications for vaccination strategies

Immunization with a HER-2/neu helper peptide vaccine generates HER-2/neu CD8 T-cell immunity in cancer patients

Viral glycoproteins: biological role and application in diagnosis

Structure, function, and evolution of coronavirus spike proteins

Bat-to-human: spike features determining 'host jump' of coronaviruses SARS-CoV, MERS-CoV, and beyond

MERS-CoV spike protein: targets for vaccines and therapeutics

The Genome sequence of the SARS-associated coronavirus

SARS vaccine development

SARS-CoV-2 SPIKE PROTEIN: an optimal immunological target for vaccines

Coronavirus envelope protein: current knowledge

The coronavirus E protein: assembly and beyond

CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes

Virus pathogen database and analysis resource (ViPR): a comprehensive bioinformatics database and analysis resource for the coronavirus research community

ViralZone: a knowledge resource to understand virus diversity

VADR: validation and annotation of virus sequence submissions to GenBank

Bioinformatic Approaches for Comparative Analysis of Viruses

Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families

CORDITE: the curated CORona drug InTERactions database for SARS-CoV-2

Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing

A database resource for Genome-wide dynamics analysis of Coronaviruses on a historical and global scale

A database for potential immune epitopes of coronaviruses

COVIDep: a web-based platform for real-time reporting of vaccine target recommendations for SARS-CoV-2

A web-based platform on COVID-19 to maintain Predicted Diagnostic, Drug and Vaccine candidates

Priming with SARS CoV S DNA and boosting with SARS CoV S epitopes specific for CD4+ and CD8+ T cells promote cellular immune responses