key: cord-0908906-m3k8o573 authors: Ghosh, Ambarnil; Nandy, Ashesh title: Graphical representation and mathematical characterization of protein sequences and applications to viral proteins date: 2011-05-12 journal: Adv Protein Chem Struct Biol DOI: 10.1016/b978-0-12-381262-9.00001-x sha: 5e1532b05d5b91301c284337ba1a4d0fab32ab28 doc_id: 908906 cord_uid: m3k8o573 Graphical representation and numerical characterization (GRANCH) of nucleotide and protein sequences is a new field that is showing a lot of promise in analysis of such sequences. While formulation and applications of GRANCH techniques for DNA/RNA sequences started just over a decade ago, analyses of protein sequences by these techniques are of more recent origin. The emphasis is still on developing the underlying technique, but significant results have been achieved in using these methods for protein phylogeny, mass spectral data of proteins and protein serum profiles in parasites, toxicoproteomics, determination of different indices for use in QSAR studies, among others. We briefly mention these in this chapter, with some details on protein phylogeny and viral diseases. In particular, we cover a systematic method developed in GRANCH to determine conserved surface exposed peptide segments in selected viral proteins that can be used for drug and vaccine targeting. The new GRANCH techniques and applications for DNAs and proteins are covered briefly to provide an overview to this nascent field. the environment too (Makowski et al., 2008) . Protein structure is often referred to in terms of four aspects: The primary structure consisting of the amino acid chain, the secondary structure which contains regularly repeating structures like alpha helices and beta sheets stabilized by hydrogen bonds, the tertiary structure which is the final folded structure incorporating the various secondary structures, and a quaternary structure where several proteins are bound together to form one protein complex such as are found in the neuraminidase body of an influenza virion (Russell et al., 2006) or the VP7 of a rotavirus particle (Li et al., 2009b) . The tertiary and quaternary structures of a large number of proteins have become available through X-ray crystallography and NMR spectroscopy studies and the data are available in Protein Data Bases (PDB) such as World Wide Protein Data Bank (WWPDB; Berman et al., 2007) , RCSB Protein Data Bank (RCSB-PDB; Deshpande et al., 2005; Dutta et al., 2007) , Protein Data Bank Europe (PDBe; Velankar et al., 2010) , Protein Data Bank Japan (PDBj; Nakamura et al., 2002; Kinjo et al., 2010) , and Biological Magnetic Resonance Databank (BMRB; Markley et al., 2008) . The difficulty of crystallizing proteins has restricted the number of proteins whose structures are sufficiently well known (Chayen, 2004 (Chayen, , 2009 Chayen and Saridakis, 2008) . However, taking the protein primary structure as the source material for all subsequent structures, structural genomics and protein structure prediction methods theoretically predict protein secondary and tertiary structures based on known structures (Baker and Sali, 2001) . The importance of proteins in biological function have led to wide ranging studies to understand how proteins fold (Dobson, 2004; Dill et al., 2007; Ghosh et al., 2007) , interact with other proteins to regulate enzyme activity (Frieden, 1971) , oligomerize to form fibrils (Powers and Powers, 2008) , aggregate to protein complexes that lead to conformational changes, and enable signaling networks. These interactions are mediated by the chief characteristic of a protein: the ability to bind other molecules specifically and tightly to it. The specificity arises from unique shapes in the tertiary structure of the protein surface (Roach et al., 2005 (Roach et al., , 2006 where, for example, a depression acts as a binding site or pocket and by the chemical natures of the side chains of the neighboring amino acids. This also results in total inability to bind in cases where changes in the amino acid composition render conformational changes to the binding site (Moscona, 2005) . Such changes arising out of mutations in the amino acid chains are among the main factors responsible for development of drug resistance in bacterial and viral diseases (Moscona, 2004) . Enzymatic role of proteins helps catalyze metabolic reactions but only a small region of the protein consisting of a few amino acids are active in the catalysis; a noncatalytic example of protein includes the antibodies that are part of the adaptive immune systems and act as a binder to antigens for destruction (MacCallum et al., 1996) . Ligand-binding proteins such as hemoglobin bind specific small molecules to transport them to other locations in the body of a multicellular organism (Baldwin and Chothia, 1979) . Structural proteins such as actin and tubulin confer stiffness and rigidity to the cytoskeleton (Doherty and McMahon, 2008) ; other structural proteins such as myosin and kinesin generate mechanical forces and are responsible for the motility of many single cell organisms (Rayment, 1996) . Thus, there are numerous processes, and there are numerous proteins that take part in them. These processes and the functions of the proteins are studied through in vivo and in vitro analysis. In vitro analysis helps understand how a protein functions, in vivo analysis often helps in understanding its functional location and related parameters in the living system; however, the specifics of how a protein targets particular organelles or cellular structures are often unclear (Bejarano and Gonzalez, 1999) . Site-directed mutagenesis techniques (Ruvkun and Ausubel, 1981) that alter the protein sequence and hence its structure and cellular location/function that help to identify susceptibility to regulation provide guidelines to rational drug design or development of new proteins with novel properties. Among the simplest of biological entities, and of particular interest for this chapter, is the virus. A virus particle like the influenza or rotavirus contains about 8-11 protein-coding genes in a multiprotein coat that protects the RNA or DNA of the virus and also enables the proteins and genetic materials to enter and leave cells. A great range of variability in amino acid composition is observed for these viral proteins (Reid et al., 2000; Ghosh et al., 2009) , specifically the surface situated ones like NA (neuraminidase; Ghosh et al., 2010) , HA (hemagglutinin), VP4 (variable protein), VP7 (Gunn et al., 1985) , and gp120 (of the HIV) but the functional impact remains the same. Often, a single change in the side chain of a single amino acid is enough for producing a new mutant (Lopez et al., 2005) . Viruses use this highly mutable property for escaping the host defense mechanism and they are also frequently found to generate escape mutants against a naturally occurring immunity or artificially designed drug or vaccine (Air and Laver, 1989 ). Proteins are involved by function or malfunction, in diseases of organism. Bacterial, viral, and other pathogens disrupt the normal protein functions and thereby destabilize the infected host organism (Goldsby et al., 2000) . While immunological defenses are called into action by the infection, often these are inadequate by themselves and have to be supported by drugs, vaccines, and other therapeutical regimes. Design of drugs and study of their actions have therefore been an important area of research. Drugs can act through formation of drug-DNA complexes (Chaires, 1997 (Chaires, , 1998 or protein-drug complexes (Chicault et al., 1981) . Major trends of research into drug-DNA relationships have been recently reviewed (Nandy and Basak, 2010) . Stated simply, DNA drugs and vaccines are made of plasmids designed to carry a selected gene into cells where it is translated into a protein. In the case of antiviral DNA vaccine, for example, plasmids are created for producing the selected viral protein in the cell and immune systems are expected to act to prevent future infections from the virus (Ulmer et al., 1996a,b; Gurunathan et al., 2000) . Advanced techniques such as codon optimization (Deml et al., 2001) are enhancing the protein production from the plasmids and others such as adjuvant incorporation are enhancing the immune response leading to more effective vaccines and therapies, several of which are already available for treatment of specified animals afflicted with the West Nile virus (Kramer et al., 2007) , melanoma and fetal loss, while applications for humans for treating HIV, influenza, hepatitis C, and other diseases are under trial (Morrow and Weiner, 2010) . Pharmaceutical proteins effective against a wide range of bacterial infections can be traced to penicillin, and developed into new class of drugs referred to as antibiotics. Conventional production processes for antibiotics are expensive and face many regulatory issues. Vaccines that enhance the body's immune system consist of attenuated viruses but can, in rare cases, harm the host with a full-blown viral occupancy (Ball et al., 1998; Colgrove and Bayer, 2005) . Since viruses use the host's cells to replicate, designing safe and effective antiviral drugs is difficult and also makes it difficult to find targets for the drugs that would interfere with the virus without also harming the host organism's cells. But almost all antimicrobials, including antivirals, are subject to drug resistance as the pathogens mutate over time (Gold and Moellering, 1996) , becoming less susceptible to the treatment. Small molecules are often used as drugs, but the new technology of recombinant proteins (Geigert, 1989; Dingermann, 2008) , commonly produced using bacteria or yeast in a bioreactor, potentially provide greater efficacy and fewer side effects because their action can be more precisely targeted toward the cause of a disease rather than treatment of symptoms, is yet to gain wide acceptance. Peptide-based drugs operate by stimulating the immune response to the peptide and thereby to the invading pathogen. Peptides play an important role in modulating many physiological processes in our body. Use of peptides as drugs have the benefit that they are small, easily optimized, and can be quickly investigated for therapeutic potential. However, peptide drug screening process (Otvos, 2008) , although a well-established approach, is long and arduous resulting in high manufacturing costs, and the fact that they have short half-life, and limited in vivo bioavailability hampers their effectiveness; new approaches have been proposed to overcome the difficulty of generating sufficient amount of the required tRNAs (Owens, 2004) . The peptides can be naturally derived or chemically synthesized, with the latter method being more prevalent. Novel peptide analogs (Lee et al., 2002) are also being synthesized to create more potent drugs. In practice, protein and peptide drugs are finding increasing acceptance in therapeutics. A drug's efficiency is related to the degree of its binding with the proteins in blood serum (Meyer and Guttman, 1968; Koch-Weser and Sellers, 1976) : The less bound a drug is, the more efficiently it can diffuse through cell membrane. Common drug-binding proteins in plasma are human serum albumin, lipoprotein, glycoprotein, etc. It is the unbound fraction of the drug-protein complex that exhibits therapeutic effect and excessive binding may mitigate against rapid action of the drug. However, the same effect can be used for long-lasting dosage by designing drugs that bind to the protein and act as a reservoir so that the unbound fraction is released slowly. But degradation of the proteins during storage and drug administration routes remains a challenging problem (Frokjaer and Otzen, 2005) . These issues of stability of therapeutic proteins toward aggregation and misfolding in long-term storage as well as means of efficacious delivery that avoid adverse immunogenic side effects are engaging the attention of the pharmaceutical industry (Frokjaer and Otzen, 2005) . While invasive routes such as subcutaneous injections are often used, oral delivery faces difficulties in poor permeability across biological membranes due to the hydrophobic nature and large molecular size, susceptibility to enzymatic attack, among others. Formulation strategies for protein therapeutics thus continue to remain a challenging problem. The complexities of protein function and structure have necessitated the development of computational techniques to analyze available data and help in formulating novel ways to predict structure, function, and interaction of proteins. Especially, in view of the requirements of new approaches to drug development through recombinant proteins, synthesizing new peptides, and investigating drugs-DNA complexes, use of computational methods is now of vital importance. The increased availability and accessibility of genomic and protein sequence data have opened up new possibilities for the search for target proteins, and the success of protein and peptide therapeutics is revolutionizing the biotech and pharmaceutical market, spurring the creation of next-generation products with reduced immunogenicity (Schellekens, 2002; Tangri et al., 2005) , improved safety, and greater effectiveness. The protein engineering market is expected to cross $100 billion in sales in 2010 from about $36 billion 4 years ago. The top-selling therapeutic protein is reported to be Amgen's Aranesp (Locatelli and Vecchio, 2001) , a reengineered variant of the company's first-generation product Epogen (recombinant human erythropoietin). A number of such products have been launched by Genetech and others, and nonparenteral delivery systems, alongside parenteral protein and peptide drug delivery systems have also been approved (Packhaeuser et al., 2004) . Progress in bioinformatics and computational biology as well as new techniques in protein engineering (recombinant proteins through site-directed mutagenesis and posttranslational modifications) are aiding the development of reengineered, improved, whole antibody, and antibody fragment-based products, reducing immunogenicity by using fully human recombinant antibodies or human antibodies derived from transgenic mice and allowing biosimilar products to be differentiated on the basis of superior characteristics. Screening experiments for appropriate molecules rely critically on bioinformatics support for design of experiments and for GRANCH AND ITS APPLICATION TO VIRAL PROTEINS interpreting the generated data, for example, to identify interesting differentially expressed genes and to predict the function and structure of putative target proteins (Lengauer and Zimmer, 2000) . Protein characterization and in silico protein design and structure analyses form an integral part of these developments. Phylogenetic analyses based on primary sequences have been used to group related proteins and understand their evolutionary history, algorithms have been developed to predict protein secondary structures, and web accessible systems are available to suggest possible folding patterns (Shen and Chou, 2009) . A number of epitope prediction tools have been devised with varying degrees of success to aid in drug design ); one area of nascent research is concerned with understanding of allosteric conformations that may help or hinder protein interactions (Teague, 2003) . In a broader area, computational biology has already proved itself as one of the powerful tools for handling the large genomic databases. The basic applications involve killer tools like sequence alignment, phylogenetic tree drawing, sequence comparison, etc. In silico motif search algorithms on primary protein structure can be applied for finding structural information like signal sequence prediction (Menne et al., 2000) , cleavage site prediction (Chou, 2001) , glycosylation prediction (Blom et al., 2004) , posttranslational modifications prediction, etc. Large datasets are frequently found to be utilized in predictions of protein structural levels from primary structure. Software like Modeller (Eswar et al., 2007 (Eswar et al., , 2008 , Discovery Studio, etc., can predict 3D structure of proteins from a database of known crystallized proteins. Many theories have been developed in this prediction research but they are often ineffective in case of a completely new protein for comparison with the preexisting database (comparative protein modeling) or a protein without appropriate template. Another very important application of data mining is the use of computational power in handling the proteomics data. In proteomics, proteins are detected by matching a part of it with the whole existing protein database in mass spectrophotometer software (Perkins et al., 1999) . The basis of all these data mining and related computational techniques is mathematics and statistics. Different theories like dot-matrix algorithm (Gibbs and McIntyre, 1970) , Needleman Wunsch algorithm (Needleman and Wunsch, 1970) , Smith Waterman algorithm (Smith and Waterman, 1981; Smith et al., 1985) , Hidden Marakov model (Eddy, 1996) , Chou-Fasman algorithm (Chou and Fasman, 1974a,b) , etc., are widely used. Some models or algorithms work on interpretation from statistics and probability and others depend on the visual interpretation of genomic data by different techniques (Nandy, 1994; Randic, 2004 Randic, , 2006 Randic et al., 2005 Randic et al., , 2006 Nandy et al., 2007; Basak and Gute, 2008; Gonzalez-Diaz et al., 2008a . To aid in protein characterization, ideas of graphical representation and numerical characterization (GRANCH) have been taken up from their success in DNA sequence analysis, but complicated here by the fact that protein sequences are composed of 20 amino acids whereas a DNA sequence is concerned with only the four building blocks of nucleotides. However, while some standard procedures such as dot-matrix plot have been used for a long time, several ingenious schemes have been developed recently that have marked significant success in this nascent field as we show in the next few sections. Coronavirus phylogeny, studies of H5N1 neuraminidase protein mutations and identification of highly conserved peptide stretches on influenza virus and rotavirus proteins that could potentially aid in the development of new drugs and vaccines are some of the significant results of application of these novel techniques. We provide a brief review of these studies in Section III. Graphical methods to display sequences have the advantage of visual indications of trends and inherent features. The familiar dot-matrix type of graphs have been widely used to determine systematics in nucleotide and amino acid sequences. The dots plotted on a 2D grid with the sequence running along the positive x-and negative y-axes produce a pattern (Gibbs and McIntyre, 1970) that is useful in determining sequence similarity, direct repeats, inverted repeats, etc., and such plots have also been used in RNA secondary structure predictions theories, for example, complementary sequences in a RNA structure in the dot-matrix analysis of nucleotide sequences of potato tuber spindle viroid (Fig. 1 ). In the case of proteins, one of the more widely used molecular graphs is the hydrophobicity-polarity lattice graph to model structure-activity relationships and folding dynamics in 2D/3D spaces (Jiang and Zhu, 2005; Chikenji et al., 2006) . In continuing developments in the field, a new pseudofolding molecular graph or network-type representation has been proposed recently . Much of the recent interest in graphical methods arose from their applications in analysis of DNA and RNA sequences. Representation of the sequence of bases in a DNA or RNA strand using graphical methods was initiated several years ago with a 3D model proposed by Hamori and Ruskin (1983) , followed up subsequently by Gates (1986) , Nandy (1994) , and Leong and Morgenthaler (1995) with 2D representations, while Peng et al. (1992) and Jeffrey (1990) represented sequence data graphically in more abstract forms. The plot of purine-pyrimidines against base numbers devised by Peng et al. (1992) demonstrated the presence of longrange correlations in DNA sequences while Jeffrey's Chaos Game Representation (CGR) method showed visually for the first time the fractal nature embedded in these sequences (Fig. 2) , as also the different patterns for mammalian, bacterial, and phage sequences reflecting the inherent differences in their base organization. The utility of the graphical approach have led to many new techniques of GRANCH of DNA and RNA sequences (see review Nandy et al., 2006) . The basic approach can be most simply described in the 2D representation where the four cardinal directions are associated with the four bases. Nandy (1994) associated adenine with the negative x-axis, cytosine with the þ y-axis, guanine with þ x-axis, and the thymine with the -y-axis and plotted a sequence starting from the origin and moving, for each base in the sequence, one step at a time in the designated direction depending on the specific base until the entire sequence is plotted. Figure 3 follows the above mentioned direction of graphical representation technique for first 10 nucleotides of neuraminidase RNA (c-DNA) and generates a series of points like a Markov chain that reflects the sequence and distribution of bases in the sequence in the chosen representation. However, this simple approach has the disadvantage of allowing reentry in the random walk path, for example, a sequence like AGAGAG traces only one unit path in the Nandy representation, and several other schemes have been formulated that minimize or eliminate this problem, but with reduced visual appeal (Nandy et al., 2006) . Randic and his coworkers, for example, proposed various representations such as ''worm'' curve (Randic et al., 2003c; Randic, 2004) , ''four horizontal line'' curve (Randic et al., 2003a,b) , four-color maps (Randic et al., 2005) , ''spectrum-like'' figures (Randic, 2006 ) among others to reduce or eliminate the degeneracy inherent in the 2D approach. Yau et al. (2003) proposed a 2D graphical representation, where the purines (A, G) and pyrimidines (T, C) are plotted on two quadrants of the Cartesian coordinate system at fixed angles to the x-axis; such a system has no degenracy. A sequence is plotted as a progression of points counted along the x-axis but rising or falling with the nature of the base thus tracing a pattern that is unique for the particular sequence. Among other proposals, mention may be made of the recent works of Todeschini et al. (2006 Todeschini et al. ( , 2008 who use partial ordering ideas to compare the first exons of eight beta-globin sequences, and Liu and Wang (2010) who used an 8D representation of DNA sequences for comparison of similarities/dissimilarities of over 40 viral, lipase, phage, Sequence: ATGAATCCTA (first 10 base) Source: A /duck/Guangdong/07/2000 (H5N1) FIG. 3. Graphical representation (according to Nandy, 1994) of first 10 nucleotides of H5N1 neuraminidase RNA (or c-DNA). and other genes, both of which methods dispense with visual rendering in favor of more rigorous mathematical approaches. To obtain a quantitative measure of the graphical representations, different techniques have been devised to convert these representations into numbers or vectors that are expected to be characteristic of each sequence. A simple geometrical technique for the 2D graph of a DNA sequence determines the weighted center of mass of a plot (m x , m y ) and a graph radius (g R ), and therefrom the distance (Dg R ) of two sequences, using Euclidean measures (Raychaudhury and Nandy, 1999) : where the (x i , y i ) represent the coordinates of each point on the plot, N is the total number of the bases in the segment and the m 1 and m 2 refer to two different DNA sequences. The g R here represents a base distribution index that is critically dependent upon the position of each base in the sequence and together with the m x , m y form a set of biodescriptors for the sequence. The g R and the Dg R have been found to be very sensitive measures of the sequence composition and distribution (Nandy and Nandy, 2003) . The difference index, Dg R , provides a quantitative comparison between the sequences: the smaller the Dg R , the more similar are the underlying DNA sequences and the higher the Dg R , the more dissimilar are the sequences. A matrix method of determining numerical indexes for DNA sequences was proposed by Randic et al. (2000) in a 3D graphical representation in which the position of every base of a sequence was related to all other bases through a Euclidean and graph-theoretic distance. The ratios of these distances, D E /D G , formed the elements of a DD matrix. Since matrices are well-known objects with well-defined properties, the leading eigenvalues of a DD matrix are considered to be characteristic, or invariants, of the matrix and, by association, to be descriptors of the DNA sequence itself. The authors calculated the leading eigenvalues of the first exon sequence of the beta-globin gene of eight species and determined the similarity/dissimilarity between the various sequences. This was followed by successive proposals for different graphical representations that similarly used matrix methods to determine invariants to characterize each sequence and form vectors of such invariants to estimate the degrees of similarities and dissimilarities between members of a family of DNA sequences (Nandy et al., 2006) . The works of Randic, Todischini, and Wang referred to earlier use these techniques of DD matrices or Hasse matrices to compute the distances between species from their gene sequences. The new GRANCH techniques gave a rich view of the complexities of DNA sequences. Among the first applications of these techniques to human diseases, Liao et al. (2006) showed that mathematical techniques can be used to analyze the underlying DNA/RNA sequences by studying the severe acute respiratory syndrome (SARS) coronavirus, and, separately, that GRANCH techniques could do away with multiple alignment requirements to study gene families. Larionov et al. (2008) broadened the usage by showing that plots of human and mouse chromosomal sequences in a graphical representation were able to reveal long-range palindromes. The 3D and 2D graphical representations visually highlighted the base preferences along a DNA sequence (Hamori and Ruskin, 1983; Nandy, 1994) , while 2D representations showed long runs of duplications of a motif as simple runs on the graphs (Nandy and Nandy, 1995) . Gates had remarked on large-scale complex repeats that were revealed by 2D graphs (Gates, 1986 ); Nandy showed that conserved genes have shapes on the 2D maps that are similar across species (Nandy, 1994) , a visual rendition no doubt of homology. Viewing a number of maps of the H5N1 neuraminidase gene revealed a conserved region (Nandy et al., 2007) , and numerical characterization of the maps, in the whole RNA sequence and in segments, has allowed reconstruction of the wide dissemination and possible recombination of segments of the gene not reported heretofore . In a novel application, using a variation of the 2D graphical representation, Wiesner and Wiesnerova (2010) studied multiallelic marker loci from Begonia  tuberhybrida. They found significant correlation of graph invariants to genetic diversity of the marker loci and suggested that DNA walk representation may predict allelerich loci solely from their primary sequences, which improves current design of new DNA germplasm identificators. Recently, Nandy has shown from inspection of conserved gene representations on 2D maps (Nandy, 2009 ) that effects of point mutations in gene sequences over evolutionary time scales indicate a polynomial relationship between the intrapurine intrapyrimidine differences on each strand of a DNA sequence. The experience with GRANCH techniques for DNA and RNA sequences led to many proposals for GRANCH methods for protein sequences, although complicated by the necessity of accommodating 20 residues for proteins compared with four bases for the nucleotide sequences. One of the earliest attempts is the dot-matrix plot for protein sequences, but other techniques were also developed. Among the pioneer works for representing the chemical information in a protein graphically is the representation of protein bonds through the Ramachandran plot (Ramachandran et al., 1963; Ramachandran and Sasisekharan, 1968) and, for protein primary sequences, the Hydropathy plot (Engelman et al., 1986) . The former can extract the secondary structural information from protein's bond angles, while the latter draws the graph from thermodynamical and chemical properties of amino acids. The DNA graphical representation methodology led Randic to propose a Magic Circle representation , where the total protein sequence is represented in a unit circle and the graph starts from the center following the sequence by moving half way toward the corresponding amino acids which are positioned equally spaced on the circumference. The result of the complete execution of the protein sequence within the circle produces a typical graph for a particular protein (Fig. 4) , except for large protein sequences which are often found to have lesser visual benefits. Li et al. used a reduction model of abstracting the protein sequence in a five-letter code (Wang and Wang, 1999; Li et al., 2008) each representing a specific group of amino acids and generated a 2D-graph by plotting the reduced sequence on the x-axis and all five group representatives horizontally at equal intervals along the y-axis resulting in a zig-zag like graphical representation of the sequence (Fig. 5) . 2D graphical representations based on nucleotide triplet codons (Bai and Wang, 2006) have been proposed for sequence comparison and start-stop sign of a coding region. Liao et al. (2006) used this approach to study 24 coronavirus genomic sequences which have $ 29,000 bases each. They classified the 20 amino acids of a protein sequence into four separate groups according to the chemistry of their R groups: amino acids A, V, F, P, M, I, L belong to the hydrophobic chemical group; amino acids D, E, K, R belong to charged chemical group; amino acids S, T, Y, H, C, N, Q, W belong to polar chemical group; the unique G amino acid belongs to glycine chemical group. Starting with the nucleotide sequence, this enabled them to construct three 2D graphs (one for each reading frame) for each gene sequence and compute a distance matrix between the 24 coronaviruses from which they could generate a phylogenetic tree relating all the sequences without the need for any multiple alignments. Gonzalez-Diaz and his coworkers have used 2D lattice graphs for proteins (Aguero-Chapin et al., 2006; Gonzalez-Diaz et al., 2008a) , constructed in a similar way to the DNA representations of Nandy and adapted to proteins according to a proposed protocol (Estrada, 2002) , and extended to other graph representations such as spiral and star networks (Aguero-Chapin et al., 2008a,b; Dea-Ayuela et al., 2008; Vilar et al., 2008; Munteanu et al., 2009) ; for example, for a star network, starting from the beginning of the sequence, the amino acids are placed in the corresponding branches transforming the protein sequence into modified branch connectivity graph from which connectivity indices (CIs) can be derived. Other authors have proposed higher dimensional representations such as the 3D model of Bai and Wang (2006) who embedded a dodecahedron in 3D space where each corner represented one of the 20 amino acids and thus generated a walk for the protein sequence, and the 20D representation of Novic and Randic (2008) . The present authors have proposed an alternate 20D representation in Euclidean space , where each amino acid is assigned to one axis in the 20D space and the sequence plotted using algorithms similar to the random walk model for 2D graphical representation of DNA sequences. This procedure generates a graph in the abstract 20D space from which consequences can be calculated to characterize proteins and quantify similarities and dissimilarities. Analogously to the GRANCH techniques of DNA sequences, to obtain quantitative measures for protein sequences, Randic, Li, Humberto Gonzales-Diaz, and several other authors have extended the methodologies of numerical descriptors for DNA sequences and of topological indices (TIs) used in QSAR studies to analysis of protein sequences, viral surfaces, and RNA secondary structures (Estrada and Uriarte, 2001; Randic et al., 2004 Randic et al., , 2008 Gonzalez-Diaz et al., 2007b; Li et al., 2009a) I I I I I I I I I I I I I I I I AA E E FIG. 5. Zig-zag curve generated from the 2D graphical representation of five-letter coded amino acids IKKIIIIIIIIGIIGAIKIGKEIAKIKKAA. Reproduced with permission from BMB reports, Web: http://www.bmbreports.org. Source: Li et al., 2008. leading to more general biological applications. González-Díaz and collaborators have done extensive work on extension of these representations to the study of protein sequences (Aguero-Chapin et al., 2009 ) and applied to mass spectral data of proteins and protein serum profiles in parasites (Gonzalez-Diaz et al., 2008b) , toxicoproteomics, and diagnosis of cancer patients (Cruz-Monteagudo et al., 2008; Gonzalez-Diaz et al., 2008a) . Their group has used mathematical biodescriptors derived from toxicoproteomics maps in conjunction with chemodescriptors of toxic molecules to predict their toxicity (Hawkins et al., 2006) . Integrated QSARs developed using chemodescriptors for ligands and biodescriptors of a molecular entity, for example, connect structural information of drug molecules, DNA and RNA sequences, or RNA secondary and protein tertiary structures and may be used to predict parameters for new entities (Gonzalez-Diaz et al., 2008a) . It has been found that using different type of numerical indices derived from the protein 2D molecular graphics to perform QSAR studies is simpler than having to work with the protein 3D structures Vilar and González-Díaz, 2010) . These indices describe graph/network topology, connectivity, or branching, often referred to as the graph TIs or network CIs used to determine structure-function relationships in cellular biochemistry (Chou and Cai, 2003) , and have been applied in theoretical biology and bioinformatics of small-size molecules, macromolecules, proteome mass spectra, and protein interaction networks (Aguero-Chapin et al., 2006; Gonzalez-Diaz et al., 2007a , 2008a . Basak et al. (2011) have in a pathbreaking work using a new differential QSAR approach for study of dihydrofolate reductases (DHFR) from multiple strains of Plasmodium falciparum shown that DHFR from the wild strain is substantially different from four mutant strains of their study and remark that the protocols indicated in the paper can be used for the development of drugs to combat drug-resistant pathogens arising continuously in nature due to mutations. proposed to numerically characterize protein sequences through the nucleotide triplet codons by using a 2D graphical representation system similar to that of Yau et al. (2003) to generate protein descriptors for the Homo sapiens X-linked nuclear protein (ATRX). An intuitively simpler indexation scheme based on the 20D graphical representation of protein sequences proposed by the present authors ) and described below has been found useful in generating phylogenetic relationships between sequences without necessity of multiple alignments and for determining conserved surface exposed stretches on viral proteins that could be useful in drug and vaccine designs (Ghosh et al., 2010) . Numerical characterization of sequences have also been targeted at the challenging problem of determining evolutionary relationships in protein families; for example, the multiplicity of voltage-gated sodium channel proteins from one for the bacteria (e.g., Bacillus halodurans) to 10 in humans, the development of the globin genes, the growth of differences in the highly conserved histones. Popular software like PHYLIP (Retief, 2000) , MEGA (Tamura et al., 2007; Kumar et al., 2008) , etc., are available for phylogenetic analysis, based generally on complex multiple sequence alignment (MSA) algorithms. Graphical methods like k-tuple, dot-matrix method (Gibbs and McIntyre, 1970) , etc., are found as an integral part of MSA algorithm, and other graphical methods assess the extent of similarity/dissimilarity between protein sequence and serve as inputs to the software packages to generate the phylogenetic trees. Bai and Wang (2006) derived the phylogenetic relationships for selected proteins using their 3D graphical representation where the amino acids are plotted on the corners of a dodecahedron. From the curve of the protein sequence obtained as a walk within this 3D space (Fig. 6) , they derive a quotient matrix similar to the DD matrix discussed earlier for the DNA plots (Li and Wang, 2005) , from which they can calculate the distance matrix between a set of protein sequences. The application of a similar procedure to a set of nine nerve genes from various organisms led to the generation of phylogenetic trees (Fig. 7) . While usual methods of generating such trees are difficult due to the varying lengths of the sequences, the matrix method with leading eigenvalues do not have such problems and generates fairly acceptable relationships, although some of the details show, as the authors point out, that the method requires further refinement. The method is also useful in that it allows visible inspection of protein sequence characteristics and thus is good for comparative study of proteins too. Li et al. (2009a) proposed a 3D graphical representation of protein sequences where the amino acids were classified into five separate groups based on their interactions. Thus, in terms of the one-letter code of amino acids, Group 1 consisted of the amino acids C, M, F, I, L, V, W, Y; Group 2 of A, T, H; Group 3 of G, P; Group 4 of D, E; and the last Group 5 of S, N, Q, R, K. Representing each group by one amino acid, a protein sequence can be reduced to a sequence of five letters only, which can then be used to generate a random walk in a 3D Euclidean space where the steps are in designated cardinal directions (Fig. 8) . Taking a cue from the work by Gonzalez-Diaz et al. (2005) , they use the charge information of the amino acids and the number of amino acids at each node of the walk to define four charge coupling numbers for each sequence from which, after some combinatorics, they generate a 60-component vector for each sequence. Applying this technique to beta-globin protein sequences from 15 species, they were able to quantitatively assess the similarities and dissimilarities between the proteins from comparison of the sequence vectors. This also led to generation of a distance matrix which, though not explicitly shown by the authors, can be used to draw the phylogenetic tree for this protein family. The results obtained by the authors' prescription are analogous to established data. Thus, GRANCH methods are seen to be useful techniques to represent protein characteristics that can be easily computed while avoiding complications arising out of the need for multiple alignments (Altschul, 1989; Gotoh, 1993) and other modeling assumptions. The usefulness of these approaches and the reasonable agreement that we observe with standard results provide a good basis to investigate new phenomena such as viral issues which are the subjects of the next section. Viruses, the smallest biological entities, possess distinct groups of proteins holding a number of unique properties like high adaptability, high mutation rate, high structural flexibility, loose packing of the core, high proportion of disordered segments, among others (Koonin et al., 2009; Tokuriki et al., 2009; Kristensen et al., 2010) . At the genome level, a very specific example of viral uniqueness resides in the existence of virus hallmark genes (Koonin et al., 2006) , which play a central role in viral replication and structure, and are shared by a broad variety of viruses. In contrast to thermostable proteins like the heat-shock protein Thermotoga maritime (Tokuriki et al., 2009 ) which have specialized characteristics like high contact density, highly stable sequence composition, and highly compact structural scaffold, viral proteins necessarily have to be more complex to retain their functional characteristics in spite of the high variability. Another remarkable feature of viruses is the diversity in their genetic cycle. Altogether the variety of genetic strategies, genomic complexity, and global ecology of viral evolution lead to the formation of an infective long existing noncellular life form. FIG. 8. 3D graph of the five-letter sequence of first 31 residues of Gorilla betaprotein IIAIAGEEKKAIAAIIGKIKIEEIGGEAIGK; each node may contain more than one amino acid. Reproduced with permission from Elsevier provided by Copyright Clearance Center (CCC). Web: http://www.sciencedirect.com/science/journal/ 03784371. Source: Li et al., 2009a. Currently prevention and treatment of viral diseases such as influenza rely on inactivated vaccines and antiviral drugs. Impact of mutational changes in amino acid residues on the stability, activity, and sensitivity of the target protein is a widely studied topic in antiviral drug design and for adequate remedy. The general causes like high mutability, altered specificity, environmental adaptability, etc., that are involved in generation of antiviral-resistant variety of the strains have been the main target of the major researches. Several investigations have focused upon phylogenetic relationships in viral evolution and transmission (Vijaykrishna et al., 2010) and reassortment (Lam et al., 2008; Owoade et al., 2008) and some other researchers are trying to correlate the evolution of genomic influenza varieties that affect humans and those that infect other life forms, for example, avian populations with a view that characterization of the causative proteins and determination or isolation of the conserved parts are useful approaches to combating viral diseases. Application of GRANCH techniques to viral proteins indicates one path to achieve this goal. The smallest unit that makes a particular protein identifiable is an eight to nine amino acid long peptide segment. This fundamental unit is frequently used in wet lab and dry lab researches involving protein mass spectrometry data analysis, sequence alignment and phylogenetic algorithms, protein database handling software (Perkins et al., 1999) , structure-based drug design, etc. Comparison of a large group of protein sequences often involves comparing the basic units of the proteins (single amino acids to complex structural levels like peptides, secondary structures, domains, etc.) and their organization. GRANCH provides novel ways for identifying sequences or peptides by generating an identifier with the aim to uniquely prescribe a protein and its compositional information. GRANCH techniques for protein sequences have emerged recently with promising applications to studies of coronavirus and the avian and swine flu viruses. We briefly describe the characteristics of the viruses, cover the GRANCH methods used in these studies, and state the significant results. The H5N1 avian flu erupted in Hong Kong in 1997 (Hatta et al., 2001) and got carried by migratory birds from its place of origin in South Central China to the rest of Asia and to Europe and Africa. The existence of the virus gene pool in China and continuous mutations among the virus strains have led to continued rapid spread worldwide by different carriers with sudden conflagrations erupting at different locations, among aquatic birds, poultry, and farm animals, and also infecting humans resulting in over 300 deaths out of 505 confirmed cases (from World Health Organization; updated August 31, 2010). The H5N1 virus, like all other influenza viruses, is an enveloped virus with an eight-segment single-stranded RNA in the core and two surface proteins on the envelope, the hemagglutinin and the neuraminidase, which are responsible for the glycosilation necessary for cell entry and exit. Although the number of fatalities in humans from this virus appears small, the rapid mutations that can occur in the RNA genome, and the possibility of whole gene or gene fragments shuffling between avian and mammalian hosts (Wu et al., 2008) , are considered to carry the potential to cause a pandemic challenge. Since the inhibitors of this influenza virus, principally oseltamivir and zanamivir, act on the neuraminidase component of the H5N1 protein, continuous monitoring of the mutational changes in this gene assumes significance. The H1N1 swine flu outbreak of 2009, often referred to as Mexican flu or just swine flu, though less severe pathogenically than the H5N1 avian flu, infected humans and spread worldwide rapidly enough to lead the WHO to declare it as a pandemic. The genomic structure closely parallels the H5N1 genome except for the important difference in the hemagglutinin subtype and the virus has responded well to the osletamivir therapy, implying again the importance of the neuraminidase in the control and remedy of these forms of influenza (Moscona, 2005) . However, an escape route (Moscona, 2004 (Moscona, , 2009 ) from this standard treatment through genetic mutations remains highly probable and provides ample impetus for continued research into development of alternate therapeutic strategies. The SARS erupted on the world stage in 2003 (Gorbalenya et al., 2004) from its origins in South East Asia and was established to have been caused by a novel form of the coronavirus. Coronaviruses also are enveloped viruses with a single-stranded multisegment RNA genome, but ranging in size from 16 to 31 kb (Lai, 1990) . The virus primarily infects the upper respiratory and gastrointestinal tracts of mammals and birds, but the human SARS coronavirus also affects the lower respiratory tract. Experimental studies are complicated by the fact that the human coronaviruses are difficult to grow in the laboratory. While earlier only two coronaviruses were known, the HcoV-229E and HcoV-OC43 (Gorbalenya et al., 2004) , after the SARS epidemic three more coronaviruses were identified by 2005, the SARS-CoV, the NL63 (van der Hoek et al., 2004) , and HKU1 leading to interest in the evolutionary history of this virus. 1. The 2D Method of Liao et al. (2006) As mentioned earlier (Section II), Liao et al. (2006) constructed 2D graphs with the four R-groups of the amino acids at predetermined angles on either side of the x-axis. For each nucleotide sequence, they constructed three separate graphs for the three reading frames of the gene sequence. For each graph, they defined a geometric center of mass x 0 , y 0 and a covariance matrix CM as where the summations are over the subscript i which runs from 1 to N, the length of the sequence. The covariance matrix CM is a 2  2 square matrix with a leading eigenvalue l. Thus, for the three graphs of each sequence there will be a set of three geometric centers of masses and three leading eigenvalues. From these eigenvalues, they defined a distance measure for two sequences i and j as which can be used for studies of evolutionary relationships between species without having to make any evolutionary model assumptions or multiple alignments of the sequence. 2. The 2D Method of Li et al. (2008) Another 2D graphical method has been described by Li et al. (2008) where they ascribe a 60-component vector to each of the proteins and construct a distance matrix where i and j refer to two different sequences and r ¼ 1,2,3,. . ., 60. This structure allows them to generate a phylogenetic tree in similar fashion to Liao et al. Taking a cue from the graphical representations of DNA sequences, Nandy et al. (2009) proposed an abstract 20D Cartesian coordinate system to generate a protein sequence walk by plotting one point for each amino acid in the sequence along a designated axis for that acid as shown in Table I ; the choice of association is equivalent for all residues and can be arbitrarily assigned but once assigned will be fixed for the duration of the computation. The walk as per the sequence will result in a series of points in the abstract 20D space generating a curve, each point on the walk being specified by 20 coordinate values. For example, for a protein sequence like MVHLTPEEKS the coordinate of the end point will then be (0,0,0,2,0,0,1,0,1,1,1,0,1,0,0,1,1,1,0,0) and the exercise can likewise be performed for any protein sequence. Unlike some of the 2D graphical representation of DNA and protein sequences, there are no degeneracies or path retracements in this representation and all amino acids are represented on equal footing. While the disadvantage of this method is clear that the graph cannot be visualized, numerical characterization of the sequences can be easily computed as described below and used for comparison between sequences irrespective of sequence lengths . The quantification procedure in this representation characterizes a sequence by a weighted center of mass approach first used for DNA sequences with the CM coordinates given by here the x i 's are the coordinate values of each point on the abstract curve and N, a normalization factor for the m i 's, is the number of amino acids in the protein chain. Using these weighted averages, the procedure defines a protein graph vector p R (m 1 , m 2 ,. . ., m 20 ) and a protein graph radius S 17 Threonine Thr T 18 Valine Val V 19 Tryptophan Trp W 20 Tyrosine Tyr Y Again, the distance between two sequences i and j can be defined as where the sum is taken over all 20 coordinates. Obtaining a distance matrix from comparison of a family of sequences can enable generating a phylogenetic tree to study evolutionary relationships, again, as in all GRANCH methods, without having to introduce multiple alignments or any other model dependencies. Nandy et al. (2009) has successfully applied this algorithm in tree construction for human globin variants and between voltage-gated sodium channel isoforms. It is to be noted that this numerical characterization method refers strictly to the identities of the amino acids and is transparent to their chemical properties, that is, no distinctions are made between residues that are mutationally conservative or nonconservative, between polar and nonpolar residues, between basic and acidic residues, etc., and all residues are treated at par. As in the case of the g R for the DNA sequences, the p R values also are found to be sensitive to changes in the amino acid sequences (Ghosh et al., , 2010 , and equal values of the p R imply exact duplication of the amino acid composition and distribution along the sequences. For the 24 coronavirus genomes selected for their study, Liao et al. constructed a 24  24 distance matrix (Liao et al., 2006) from which they were able to generate a phylogenetic tree of the whole genome of the virus for different species using their 2D GRANCH technique. MSA, the popular phylogenetic tree generation algorithm, does not work properly for the whole genome, and the evolutionary model used may produce a wrong interpretation (Liao et al., 2006) . Here, the phylogenetic tree ( Fig. 9 ) obtained by their method clearly defined the evolutionary relationships between the whole genomes of 24 different species of the coronaviruses. Liu and Wang (2010) using an L-tuple-based DNA representation constructed a set of L  L matrices whose mathematical characterization led to characterization of the DNA sequences. Obtaining the distance matrices between a set of eight H5N1 avian flu genomes, they were able to generate a phylogenetic tree where the evolutionary relationships between the various strains of the virus were clearly identified. Nandy et al. (2007) and Ghosh et al. (2009) used the 2D GRANCH techniques for DNA sequences and the 20D methods for the protein sequences for analyses of global characteristics of over 680 H5N1 neuraminidase sequences to determine any systematic and exceptional behavior that may have arisen from mutational changes. They found from detailed comparison of the g R and p R values that, at the protein level, only about 62% unique strains are observed, whereas for the nucleotide sequences the percentage of unique strains is considerably high at 80%, implying that about 22% (percentage of synonymous sequences to uniques) of the purportedly new strains of the neuraminidase gene have synonymous mutations . Considering the neuraminidase's segmented structure of transmembrane, stalk, and body regions, it was found that the body region appears proportionately less stable than the transmembrane or the stalk regions . In contrast, a 50-base segment at the 5 0 -end of the gene is found highly stable, mutations there being observed in less than 4.5% of the sequences at the RNA level and about 1.9% at the protein level raising the possibility of investigating this region as potentially useful for designing novel neuraminidase inhibitors. The duplicate sequences identified by the p R analysis showed sequence duplication across species and distributed over substantial distances in space and time . While localized or cosynchronous distributions can be expected to occur due to rapid dissemination of specific strains through viral shedding as one mechanism, the appearance of identical strains in geographically widely separated locations several thousand kilometers apart, or after a lapse of 2 years or more, is puzzling since viral genes are known to mutate rapidly in replication. The authors hypothesized that this may arise out of viral shedding in aquatic and nonaquatic habitats that are subsequently spread across wide regions by the migratory or local birds who themselves might not be infected but act merely as carrier agents. The p R analysis from this technique also showed for the first time that recombinations between segments in the neuraminidase gene may have been taking place. Thus, sequence similarities and dissimilarities analysis done comparatively easily through the numerical descriptors can reveal many interesting features of viral spread and mutational changes. 3. Conserved and surface exposed peptide stretch identification In nature, viruses are found to carry a great quantity of sequence variation in both the RNA and the protein level. These variations in viral sequences (Phillips et al., 1991; Chen and Deng, 2009 ) generally come from spontaneous mutation, adaptive forces, various mutagenic effects, sequence recombination, etc. Such mutations are observed more in the parts in contact with the environment and therefore readily develop resistance to drugs. Conserved region in such parts of the protein, when determined, can be used for many purposes like structure-based drug design, viral proteins activity determination, vaccine design, etc. Ghosh et al. have applied the methods of GRANCH to determine just such regions in the H5N1 avian flu neuraminidase protein (Ghosh et al., 2010) . Using the 20D similarity/dissimilarity technique ) through comparisons of p R values, all the proteins in the dataset were scanned by a window size of 6-14 amino acids and the p R values compared to find regions of least variability. This variability profile is then compared with a solvent accessibility profile to determine regions of low variability and high solvent accessibility implying that these identified regions would be accessible to drugs and vaccines and also offer target sites over many cycles of mutations. A view of the 3D structure also ensures that high surface exposed regions are actually selected. The authors determined six such regions on the neuraminidase protein (Fig. 10) , of which the most promising appears to be the 50-base (16 amino acid) stretch at the 5 0 -end of the gene mentioned earlier. A special feature of this 16 amino acid long FIG. 10 . Conserved surface exposed regions are shown in different colors in the cyan colored monomer of neuraminidase (other monomers are colored in magenta, green, and yellow). Here the six conserved regions are shown in six different colors. The conserved C-terminal portion is shown in blue. peptide is its location on the dimeric interface (indicated by blue color in Fig. 10 ) of the quaternary structure of the neuraminidase protein, implying that any disruption of this stretch could interfere with the stability of the protein itself. Nandy (2010) has reported on a similar work done on the rotavirus in association with several others. There, seven such distinct conserved surface exposed regions have been identified with this procedure. The most promising four regions have tested positive by epitope prediction servers (Peters et al., 2005; Vita et al., 2010) and reportedly hold promise for peptide drug and vaccine development. Thus, the GRANCH techniques for protein sequences are turning out to be quite useful novel method for analysis of proteins. Extension of these techniques to applications of different measurements as espoused by Gonzalez-Diaz et al. and others are opening up new methods to visualize and analyze experimental data and provide new insights. Applications by various authors to viral issues have generated new model independent ways to establish evolutionary relationships. In particular, the GRANCH techniques have provided for the first time a systematic method to determine conserved surface exposed peptide stretches on viral proteins that could be potentially very useful for drug and vaccine development. Comparative study of topological indices of macro/supramolecular RNA complex networks MMM-QSAR recognition of ribonucleases without alignment: comparison with an HMM model and isolation from Schizosaccharomyces pombe, prediction, and experimental assay of a new sequence Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence The neuraminidase of influenza virus Gap costs for multiple sequence alignment A 2-D graphical representation of protein sequences based on nucleotide triplet codons On graphical and numerical representation of protein sequences Analysis of similarity between RNA secondary structures Protein structure prediction and structural genomics Haemoglobin: the structural changes related to ligand binding and its allosteric mechanism Risky business: challenges in vaccine risk communication Mathematical biodescriptors of proteomics maps: background and applications Predicting pharmacological and toxicological activity of heterocyclic compounds using QSAR and molecular modeling Characterization of dihydrofolate reductases from multiple strains of Plasmodium falciparum using mathematical descriptors of their inhibitors Motif trap: a rapid method to clone motifs that can target proteins to defined subcellular localisations The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence Energetics of drug-DNA interactions Drug-DNA interactions Turning protein crystallisation from an art into a science High-throughput protein crystallization Protein crystallization: from purified protein to diffraction-quality crystal Influenza virus antigenic variation, host antibody production and new approach to control epidemics Drug protein interactions Shaping up the protein folding funnel by local interaction: lesson from a structure prediction study Prediction of protein signal sequences and their cleavage sites Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins Prediction of protein conformation Could it happen here? Vaccine risk controversies and the specter of derailment 3D-MEDNEs: an alternative ''in silico'' technique for chemical research in toxicology. 2. quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy HP-Lattice QSAR for dynein proteins: experimental proteomics (2D-electrophoresis, mass spectrometry) and theoretic study of a Leishmania infantum sequence Multiple effects of codon usage optimization on expression and immunogenicity of DNA candidate vaccines encoding the human immunodeficiency virus type 1 Gag protein The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema The protein folding problem The protein folding problem: when will it be solved? Recombinant therapeutic proteins: production platforms and challenges Principles of protein folding, misfolding and aggregation Mediation, modulation, and consequences of membrane-cytoskeleton interactions Using the tools and resources of the RCSB protein data bank Hidden Markov models Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins Characterization of the folding degree of proteins Recent advances on the role of topological indices in drug discovery research Protein structure modeling with MODELLER Comparative protein structure modeling using MODELLER Protein conformational prediction Probabilistic sampling of protein conformations: new hope for brute force? Classification of conformational stability of protein mutants from 3D pseudo-folding graph representation of protein sequences using support vector machines Protein-protein interaction and enzymatic activity Protein drug stability: a formulation challenge A simple way to look at DNA Overview of the stability and handling of recombinant protein drugs Computational analysis and determination of a highly conserved surface exposed segment in H5N1 avian flu and H1N1 swine flu neuraminidase Computational study of dispersion and extent of mutated and duplicated sequences of the H5N1 influenza neuraminidase over the period 1997-2008 The ultimate speed limit to protein folding is conformational searching The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences Antimicrobial-drug resistance Overview of Immune System Recognition of stable protein mutants with 3D stochastic average electrostatic potentials Generalized lattice graphs for 2D-visualization of biological information Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach Computational chemistry approach to protein kinase recognition using 3D stochastic van der Waals spectral moments Medicinal chemistry and bioinformatics-current trends in drugs discovery with networks topological indices Severe acute respiratory syndrome coronavirus phylogeny: toward consensus Optimal alignment between groups of sequences and its application to multiple sequence alignment Rotavirus neutralizing protein VP7: antigenic determinants investigated by sequence analysis and peptide synthesis DNA vaccines: immunology, application, and optimization H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences Molecular basis for high virulence of Hong Kong H5N1 influenza A viruses Combining chemodescriptors and biodescriptors in quantitative structure-activity relationship modeling Chaos game representation of gene structure Protein folding on the hexagonal lattice in the HP model Biological molecules. Cell and Molecular Biology Structure and function in myoglobin and other proteins A three-dimensional model of the myoglobin molecule obtained by x-ray analysis Structure of myoglobin: A three-dimensional Fourier synthesis at 2 A. resolution PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan Binding of drugs to serum albumin (first of two parts) The ancient Virus World and evolution of cells The complexity of the virus world West Nile virus New dimensions of the virus world discovered through metagenomics MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences Coronavirus: organization, replication and expression of genome Evolutionary and transmission dynamics of reassortant H5N1 influenza virus in Indonesia Chromosome evolution with naked eye: palindromic context of the life origin Design of novel peptide analogs with potent fungicidal activity, based on PMAP-23 antimicrobial peptide isolated from porcine myeloid Protein structure prediction methods for drug design Random walk and gap plots of DNA sequences Rotavirus architecture at subnanometer resolution New invariant of DNA sequences 2-D graphical representation of protein sequences and its application to coronavirus phylogeny 3-D maps and coupling numbers for protein sequences Coronavirus phylogeny based on triplets of nucleic acids bases Vector representations and related matrices of DNA primary sequence based on L-tuple Darbepoetin alfa Amgen Characterization of neuraminidase-resistant mutants derived from rotavirus porcine strain OSU Antibody-antigen interactions: contact analysis and binding site topography Molecular crowding inhibits intramolecular breathing motions in proteins BioMagResBank (BMRB) as a partner in the Worldwide Protein Data Bank (wwPDB): new policies affecting biomolecular NMR depositions A comparison of signal sequence prediction methods using a test set of signal peptides The binding of drugs by plasma proteins DNA drugs come of age Oseltamivir-resistant influenza? Oseltamivir resistance-disabling our influenza defenses Global transmission of oseltamivir-resistant influenza Bioinformatics, sequence and genome analysis Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices Development of PDBj: Advanced database for protein structures A new graphical representation and analysis of DNA sequence structure: I. methodology and application to globin genes Towards stable vaccines: Contributions from DNA and protein numerical characterization studies. 50th Anniversary Celebration with Mathematical Chemistry New approaches to drug-DNA interactions based on graphical representation and numerical characterization of DNA sequences Graphical representation and numerical characterization of H5N1 avian flu neuraminidase gene sequence Numerical characterization of protein sequences and application to voltage-gated sodium channel alpha subunit phylogeny Mathematical descriptors of DNA sequences: development and applications Graphical analysis of DNA sequence structure: II. Relative abundances of nucleotides in DNAs, gene evolution and duplication On the uniqueness of quantitative DNA difference descriptors in 2D graphical representation models A general method applicable to the search for similarities in the amino acid sequence of two proteins Representation of proteins as walks in 20-D space Damped elastic recoil of the titin spring in myofibrils of human myocardium Peptide-based drug design: here and now Building blocks for peptide drugs Replacement of sublineages of avian influenza (H5N1) by reassortments, sub-Saharan Africa In situ forming parenteral drug delivery systems: an overview Long-range correlations in nucleotide sequences Probability-based protein identification by searching sequence databases using mass spectrometry data Structure of hemoglobin Structure of haemoglobin: a three-dimensional Fourier synthesis at 5.5-A. resolution, obtained by X-ray analysis Crystal structure of human carboxyhaemoglobin The design and implementation of the immune epitope database and analysis resource Human immunodeficiency virus genetic variation that can escape cytotoxic T cell recognition Mechanisms of protein fibril formation: nucleated polymerization with competing off-pathway aggregation Stereochemistry of polypeptide chain configurations Conformation of polypeptides and proteins Graphical representations of DNA as 2-D map Spectrum-like graphical representation of DNA based on codons Novel 2-D graphical representation of proteins Four-color map representation of DNA or RNA sequences and their numerical characterization On novel representation of proteins based on amino acid adjacency matrix Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation Novel 2-D graphical representation of DNA sequences and their numerical characterization On 3-D graphical representation of DNA primary sequences and their numerical characterization Compact 2-D graphical representation of DNA Unique graphical representation of protein sequences based on nucleotide triplet codons Indexing scheme and similarity measures for macromolecular sequences Kinesin and myosin: molecular motors with similar engines Characterization of the 1918 ''Spanish'' influenza virus neuraminidase gene Phylogenetic analysis using PHYLIP Interpretation of protein adsorption: surfaceinduced conformational changes Surface tailoring for controlled protein adsorption: effect of topography at the nanometer scale and chemistry The structure of H5N1 avian influenza neuraminidase suggests new opportunities for drug design A general method for site-directed mutagenesis in prokaryotes The arrangement of amino acids in proteins The amino-acid sequence in the glycyl chain of insulin. I. The identification of lower peptides from partial hydrolysates The amino-acid sequence in the phenylalanyl chain of insulin. I. The identification of lower peptides from partial hydrolysates Bioequivalence and the immunogenicity of biopharmaceuticals Predicting protein fold pattern with functional domain and sequential evolution information Identification of common molecular subsequences The statistical distribution of nucleic acid similarities MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0 Rationally engineered therapeutic proteins with reduced immunogenicity Implications of protein flexibility for drug discovery A new similarity/diversity measure for the characterization of DNA sequences Characterization of DNA primary sequences by a new similarity/diversity measure based on the partial ordering Do viral proteins possess unique biophysical features? DNA vaccines for bacteria and viruses DNA vaccines Identification of a new human coronavirus PDBe: Protein Data Bank in Europe Reassortment of pandemic H1N1/2009 influenza A virus in swine QSPR models for human Rhinovirus surface networks QSAR model for alignmentfree prediction of human breast cancer biomarkers based on electrostatic potentials of protein pseudofolding HP-lattice networks The immune epitope database 2.0 A computational approach to simplifying the protein folding alphabet 2D random walk representation of Begonia  tuberhybrida multiallelic loci used for germplasm identification Antigenic profile of avian H5N1 viruses in Asia from An introduction to epitope prediction methods and software DNA sequence representation without degeneracy