key: cord-0776943-uo1d8xka authors: Singh, Dadabhai T; Trehan, Rahul; Schmidt, Bertil; Bretschneider, Timo title: Comparative phyloinformatics of virus genes at micro and macro levels in a distributed computing environment date: 2008-02-13 journal: BMC Bioinformatics DOI: 10.1186/1471-2105-9-s1-s23 sha: 111968f7fb102882c193488965e41d5d0f0103dc doc_id: 776943 cord_uid: uo1d8xka BACKGROUND: Preparedness for a possible global pandemic caused by viruses such as the highly pathogenic influenza A subtype H5N1 has become a global priority. In particular, it is critical to monitor the appearance of any new emerging subtypes. Comparative phyloinformatics can be used to monitor, analyze, and possibly predict the evolution of viruses. However, in order to utilize the full functionality of available analysis packages for large-scale phyloinformatics studies, a team of computer scientists, biostatisticians and virologists is needed – a requirement which cannot be fulfilled in many cases. Furthermore, the time complexities of many algorithms involved leads to prohibitive runtimes on sequential computer platforms. This has so far hindered the use of comparative phyloinformatics as a commonly applied tool in this area. RESULTS: In this paper the graphical-oriented workflow design system called Quascade and its efficient usage for comparative phyloinformatics are presented. In particular, we focus on how this task can be effectively performed in a distributed computing environment. As a proof of concept, the designed workflows are used for the phylogenetic analysis of neuraminidase of H5N1 isolates (micro level) and influenza viruses (macro level). The results of this paper are hence twofold. Firstly, this paper demonstrates the usefulness of a graphical user interface system to design and execute complex distributed workflows for large-scale phyloinformatics studies of virus genes. Secondly, the analysis of neuraminidase on different levels of complexity provides valuable insights of this virus's tendency for geographical based clustering in the phylogenetic tree and also shows the importance of glycan sites in its molecular evolution. CONCLUSION: The current study demonstrates the efficiency and utility of workflow systems providing a biologist friendly approach to complex biological dataset analysis using high performance computing. In particular, the utility of the platform Quascade for deploying distributed and parallelized versions of a variety of computationally intensive phylogenetic algorithms has been shown. Secondly, the analysis of the utilized H5N1 neuraminidase datasets at macro and micro levels has clearly indicated a pattern of spatial clustering of the H5N1 viral isolates based on geographical distribution rather than temporal or host range based clustering. Recent occurrences of pandemics like the Severe Acute Respiratory Syndrome (SARS) or Avian Influenza clearly underscore the threat and seriousness of global diseases. The steadily growing globalization makes it difficult to contain pandemics to a certain region. Therefore, pandemic control is of highest importance to human health. Unfortunately, the segmented nature of the genome of influenza viruses is very conducive for genetic shifts and their rapid spread across various genera augments genetic drift. For example, the human influenza pandemics in 1957 and 1968 were suggested to have been caused by reassorted influenza viruses [1] . Moreover, the H5N1 outbreak in Hong Kong in 1997 has convincingly demonstrated the ability of an avian virus to make the transition from birds to humans directly without going through a perceived "permissible host". In particular, the H5N1 virus is believed to have acquired the hemagglutinin (HA) gene from A/goose/Guangdong/1/96 H5N1 and A/teal/ Hong Kong/W312/97 H6N1, while the internal genes were received from A/quail/Hong Kong/G1/97 H2N2 or A/teal/Hong Kong/W312/97 H6N1, respectively [2] , [3] . Even though this particular strain was eliminated by culling millions of chicken, its ancestors remains circulating in aquatic birds. This paper proposes a new approach to pandemic control by constantly monitoring molecular evolution at both macro level (within the group of viruses) and micro level (within the group of strains) using comparative phyloinformatics. This can facilitate prediction of how these viruses are evolving in terms of spatial, temporal, and host dimensions, and therefore, allows for faster responses to new outbreaks as well as their diagnosis. However, corresponding phylogenetic tree construction algorithms suffer from long runtimes due to their high degrees of computational complexity as well as the large datasets involved. Therefore, it is necessary to develop informatics based solutions that use suitable algorithms and take advantage of distributed computing technologies to make such studies feasible in a reasonable amount of time. Furthermore, these solutions need to be integrated in a framework for pandemic control that is biologist-friendly. As a result, effective vaccines and antivirals can be designed more easily and within a shorter response time, since specific local strains can be targeted directly. In this paper it is demonstrated how such a system can be developed using a new distributed workflow design system called Quascade. As a proof of concept, results for an actual systematic phyloinformatics analysis of neuraminidase (NA) genes of H5N1 isolates and influenza viruses are presented. Various phylogenetic algorithms; i.e. UPGMA, Neighbor-Joining, Maximum Parsimony, Maximum Likelihood and Mr. Bayes [4] [5] [6] [7] , are compared with respect to their efficiency and accuracy in deriving biologically meaningful results. In particular, the presented study illustrates how these algorithms can be integrated in a unique user-friendly workflow in order to enhance the efficiency of a comparative phyloinformatic analysis. The selection of the receptor destroyer NA for this study is motivated by its following properties. NA belongs to the glycosyl hydrolase family of proteins and catalyzes the cleavage of sialic acid residues from viral and cellular glycoproteins. Its most important function is to remove the terminal sialic acids from HA in order to enable the virus budding. Moreover, it helps the virus spreading through the system by additionally removing sialic acids from cell surfaces. In particular, since variations in the NA gene can lead to the evolution of more potent strains, studying the molecular evolution of this gene using the latest algorithms and computing power is essential. Several isolated studies have been conducted since the H5N1 outbreak in late 2003 to monitor the molecular evolution at gene and genome level [8] , [9] . All these studies involve a phylogenetic analysis at one level or the other. However, no systematic efforts have been undertaken so far to evaluate the suitability of the used phylogenetic algorithms in yielding biologically meaningful results for the analysis of HA and NA gene products of H5N1 viruses in particular and influenza viruses in general. We have used the new Quascade system to implement distributed workflows for comparative phyloinformatics using different algorithms. Figures 1, 2 , and 3 show these workflows for distance-based algorithms, maximum parsimony algorithms and maximum likelihood algorithms, respectively. The workflows use ClustalW [10] to compute multiple sequence alignments and different phylogenetic algorithms from the PHYLIP package [4] . The most compute-intensive parts of each workflow (i.e. ProtDist in Figure 1 , ProtPars in Figure 2 , and ProtML in Figure 3 ) are executed in a distributed computing environment by simply multiplexing the data and using several instances of the respective programs. Figures 4 and 5 show execution times of the workflows for varying numbers of processors and sequences. In summary it can be seen that the system can achieve linear speedups. We have used the designed workflows for the phyloinformatics analysis of NA in different populations of H5N1 in particular and in influenza viruses in general. The reason for choosing a protein-based phyloinformatics approach instead of a gene-based approach is to gain a better understanding of the molecular evolution of the gene product that makes this deadly virus spreading across different hosts. To perform this study, three protein data sets have been collected as follows. A medium sized dataset has been obtained comprising 581 entries pertaining to H5N1 NA that were mined from the Uniprot database [11] . 4 . A macro dataset of 909 entries of NA from all Influenza A viruses has been obtained from Uniprot. These three datasets are shown in the additional files 1, 2, and 3. Phylograms obtained for the core group from the character based algorithms ProtPars [4] and ProtML [4] are shown in the additional files 4 and 5, while phylograms obtained from the two distance-based algorithms are shown in the additional files 6 and 7. The phylogram obtained by Mr. Bayes [7] is displayed in Figure 6 . P18269Sial, a trypanosoan sialdase and Q05JH9H9N2, a NA of the distantly related H9N2 virus have been used as members of an outlier group for the phyloinformatics analysis. Our phyloinformatics analysis with the core set has revealed a clear pattern: spatial clustering of the strains based on the particular geographical region rather than temporal clustering based on time scale or according to host range. The obtained trees for the medium-size data set are shown in the additional files 8, 9, and 10. The algorithms ProtPars, NJ, and UPGMA were used. All these algorithms have been distributed using the above Quascade workflows and deployed on a cluster of PCs. The utilized cluster consists of 16 nodes comprising 32 CPUs. Its detailed specification and architecture are shown in Table 1 and Figure 7 , respectively. However, the algorithms ML and Mr. Bayes could not be run on the existing system with the medium dataset since it became apparent that these algorithms require further optimization in terms of distribution and/or more processing resources. Finally, in order to provide a global perspective of the molecular evolution of H5N1_NA, the dataset of 909 sequences of Influenza A viruses rather than H5N1 alone has been utilized. The objective of this macro level analysis is to signify the fact that this approach may lead to the nearest neighbors of some of the lethal clades which may be kept under observation for further evolution. Again, the large dataset was analyzed using the three algorithms ProtPars, NJ and UPGMA. The results obtained are shown in the additional files 11, 12, and 13. As per the core and medium sized datasets, the macro dataset also conforms Distributed ML Workflow Distance-based Parsimony to the clustering of H5N1_NA along the spatial lines rather than temporal or host range lines. Keeping the Thai clade isolates in focus while analyzing the large dataset, the results obtained with ProtPars agree with the trend obtained so far. All twelve core sequences are mapped to the Thai clade. Besides these twelve, ten more sequences were found in this clade, whereby eight of them are fragments. The remaining two are actually complete sequences from Thailand (chicken isolates). However, they were annotated as Neuramindase instead of Neuraminidase [12], [13] which explains why these two entries are missing in the core set. However, querying of Uniprot with "Influenza virus" and "NA" as query terms has returned these two entries which are clearly captured and validated in our phyloinformatics analysis. Zooming out of the Thai clade has shown the Vietnam clade as its nearest neighbor, followed further by the Indian clade with Vietnam strains as their neighbors. On the next higher level Indonesian, Malaysian and Chinese clades are indicated as neighbors. It would be interesting to further study the Indian clade and verify how closely the Indian strains and Vietnamese strains are related with respect to other genes and also at genome level. Such an analysis may lead to the identification of the origin for Bird Flu in India. Further analysis of the large dataset indicates H9N2 as the nearest clade to H5N1 and non-structural protein as nearest to NA within the H5N1_NA group. The overall tree includes isolates from H3N2 to H15N9. A detailed analysis of each clade and sub-clade within this tree and with reference to the other two trees could lead to an understanding of H5N1's molecular evolution at global level. The core set includes samples isolated over a time frame from 1959 to 2006 and represents globally distributed localities from Thailand to Bavaria, Germany. All the phylograms have essentially yielded geographic based clustering rather than time based clustering. The seven distinct clades that were clearly demarcated are: Thailand clade I (Q2L700; Q2LDC8; Q2LDC0; Q6PUP7; Q307U7; Q45ZM8), Thailand clade II (Q5SDA6; Q307V5; Q6PUP6; Q5MD56), Thailand clade III (Q6B518; Q4PKD4), Indian clade (Q0PEF9; Q0PEG0) Scotland clade (Q0A2H3; Q710U6) Bavarian clade (A1EHP1; Workflow Scalability for large data sets A quick perusal of the results obtained with the three algorithms (ProtPars, NJ and UPGMA) has confirmed the pattern obtained from the core set. The strains of H5N1_NA cluster spatially rather than temporally or according to host. However, there are subtle differences amongst the outputs of the three algorithms with respect to their resolution. The clustering obtained using NJ seems to be better resolved than the other two in terms of branch length and sub-speciation. A detailed analysis of the Thai clade obtained by ProtPars also has revealed that all the sequences from the core set have been represented in this tree as well. The additional isolates are mostly fragments obtained through PCR amplification and uploaded to Uniprot. An interesting aspect of this clade is the finding that the entry Q5EP24 (Chicken isolate from Vietnam) is placed almost as an outlier in this group. It would be interesting to analyze the genealogies of Vietnam and Thailand clades and to verify whether there are any "bridging" isolates such as Q5EP24 that may play an important role in the spread of this deadly disease across the globe. We have analyzed our medium dataset to confirm the pattern we obtained with the core set. A detailed analysis of each geographic clade with respect to the "bridging" isolates such as Q5EP24 may reveal a global pattern of H5N1 spread. Global preparedness for H5N1 pandemic has been declared as the top priority of global health agencies such as CDC. In view of the ever escalating threat of the virus mutating to a human lethal form, it has become essential to constantly monitor the molecular evolution of this virus utilizing the latest and best phylogenetic algorithms. The current study was undertaken with two objectives: One is to demonstrate the utility of a biologist's friendly high performance computing workflow system in analyzing large and complex biological datasets by deploying compute intensive algorithms in their parallelized ver-sion. The second objective is to analyze H5N1_NA data, as a proof of concept, in such a workflow to understand the molecular evolution of this rapidly evolving virus. We have demonstrated the utility of workflow systems by designing a pipeline that starts with data input by the biologist. Once the relevant data is uploaded into the workflow the remaining steps are all automated. In particular these steps comprise starting with multiple alignments of sequences to the distribution of the output to various algorithms, distributing the relevant outputs to different servers in a distributed environment which is grid compatible, and finally visualizing the output. Such a workflow system saves a significant amount of time and eliminates possible human errors in analyzing critical data. In any phylogenetic analysis involving pathogenic viruses there are three clear possibilities of clustering behavior: spatial, temporal and host based. Our analysis of H5N1 NA data utilizing different phylogenetic algorithms have indicated a spatial clustering of this virus based on geographical distribution rather than temporal or host. Of course, single gene product analysis is insufficient to arrive at any biologically relevant conclusion and, hence, we propose to do multiple gene and genome level analysis as our future work. However, even for a single gene analysis the computing power required is formidable and we have used this approach as a proof of the concept that high performance computing is a must for any meaningful phyloinformatics study. Even with the limited amount of data, we have been able to find a clear pattern (geographical clustering) and the importance of glycon sites. However, further detailed studies in conjunction with other proteins such as HA, Polymerase etc., and also at gene and genome level, are required to draw firm conclusions. It is also true that interpretation becomes more difficult if the size of the trees becomes very large. In our experience, zooming onto a cluster of interest (e.g. Thailand and China clades) with relative ease and quickness, using the Quascade middleware is an attractive feature of our study. Furthermore, the comparison of different algorithms simultaneously with the same input file is another attractive attribute. The exponential growth in the size of biological databases has established the need for high performance computing (HPC) in bioinformatics. Typically, an HPC setup operates in a cluster computing environment consisting of multiple computers that communicate over fast switches. For popular database scanning applications such as BLAST and HMMER the benefits of clusters are immediate and linear speedups can be easily achieved. However, the evolving challenges in life sciences cannot be all addressed by off-the-shelf bioinformatics applications. Life scientists need to analyze their data using novel approaches published in recent journals or based on their own hypotheses and assumptions. Quascade-MP2 has been developed to address this need. It is a visual prototyping tool created especially for data-driven, high performance scientific applications. The complexity involved in using traditional technologies and tools often proves to be over-whelming and counterproductive. Most of these tools require an understanding of programming or scripting languages, such as Perl, Python, Java, and UNIX-scripts. Recent examples of such systems in bioinformatics include Biopipe [15] , BioWBI [16] , Taverna [17] , Wildfire [18] , KDE Biosciences [19] , gRNA [20] , Biowep [21] , and BioWMS [22] . Each deployment is therefore subject to its own code development and testing. Although programming languages offer sophisticated control over the intended procedure, the learning curve and overheads in terms of human resources are difficult to justify. Quascade has been developed with this problem in mind. Quascade has been designed for research applications characterized by strict high performance requirements. It provides a graphical, drag-and-drop interface to allow users to design and execute ad-hoc workflows. Workflows are constructed by an end-user by inter-connecting functional blocks, called components. A component is a piece of independent and self-contained code performing a given functionality on its input and generates an output. Data paths among components can be specified by drawing lines between/among the output(s) and input(s) of different components, respectively. Thereby, various combinations of these components (such as one output to multiple inputs, multiple outputs to multiple inputs) can be used to create any workflow depicting the flow of both data and logic. Individual components in a workflow may be designated to run on different computing servers in a cluster by simply specifying a corresponding condition in Quascade. This provides a straightforward approach of constructing workflows that execute in a high-performance environment. In turn, a component running on a given server may use several computing nodes to execute its program, thereby providing a two-tiered mechanism of distributed processing. As an example, Figure 8 shows a physical workflow and its graphical counterpart. The workflow consists of a sample generator component, which generates/collects data, an analyzer component which performs the actual processing, and a sample sink component which 'consumes' the processed data in terms of not forwarding the data further to another component. While the genera-tor and sink components can be connected to external entities, i.e. physical devices, the analyzer component may be a simple single-node program or a complex, parallel multi-node application. The user creates a workflow by selecting workflow components from a predefined list of deployed components on the cluster. Each component is configured to run on a particular cluster server or automatically assigned to a server if no explicit configuration is desired. At run-time Quascade performs a remote invocation to the selected servers and creates local instances of the used components. Hence, client-server communication takes place between a Quascade client and one or more instances of MP 2 servers forming the cluster. Complementary, server-server communication takes place between two or more servers or between a server and its compute nodes. Two different types of communication can be differentiated: explicit and implicit. They distinguish between the workload distribution by a parallel component to multiple compute nodes and the communication between two of more servers exchanging data, respectively. Serverserver communication uses raw sockets to transfer data from one machine to the other. This solution provides a performance benefit over more sophisticated alternatives such as RMI or CORBA, but represents the lowest level of abstraction of the underlying network. As an answer to this trade-off, MP 2 overcomes the limitations of development complexity and inflexibility by providing an abstraction at the component level supported by a fixed number of network level operations. Thus, a developer has a high degree of flexibility regarding the operation of a component as long as the implementation adheres to the I/O scheme provided by the communication layer. Implicit communication refers to the communication that takes place between the output of one component and the input of another component, i.e. the selected data paths among components. The underlying transfer mechanism is provided by the middleware and comprises input buffers in order to enable asynchronous communication. Input buffers are continuously polled by the corresponding component and processing continued once data is available. An adjustable buffer length according to load expectation helps to prevent overflows. For example our workflow component for ClustalW shows all the options that the original ClustalW program presents to users. The options are presented in a parameter output panel and translated into command line string for calling the underlying ClustalW program upon execution of the workflow. After the completion of ClustalW, an output file is produced on the local hard disk at a specified location containing the aligned data set. This file name and its location is the input for the next component (e.g. Seqboot). The remaining components, e.g. Seqboot with its underlying PHYLIP operate in the same way. For instance, Figure 9 shows a screenshot of the implemented sequential workflow for a distance based phylogenetic analysis. The runtime of this workflow has been profiled on a single server for an input of 42 sequences and N = 100 replica. As can be seen form Figure 10 , the Protdist component clearly dominates the overall execution time and requires parallelization. The Seqboot component is modified to break up its output file's data into several data sets, which are written to a buffered output port. Distribution of files is then implemented by using a demultiplexer (Demux) component and a result combining module (Concat). More technical details about the file distribution and concatenation as well as other workflow parameters can be found elsewhere [23] . The model, design and implementation used to distribute the distance-based workflow can also be applied to parallelize the parsimony and the ML based workflows. The usage of the model and design developed for the Protdist component has therefore been used for the Protpars and the ProtML components. The corresponding workflows are shown in Figure 1 , 2, and 3. Figure 9 Distance based Quascade workflow. Screenshot of sequential distance based phylogenetic analysis workflow in Quascade. The pathogenesis of influenza in humans Molecular aspects of avian influenza (H5N1) viruses isolated from humans The continued pandemic threat posed by avian influenza viruses in Hong Kong The neighbor-joining method: A new method for reconstructing phylogenetic trees Phylogeny estimation: Traditional and Bayesian approaches MrBayes 3: Bayesian phylogenetic inference under mixed models Sawanpanyalert P: Molecular characterization of the complete genome of human influenza H5N1 virus isolates from Thailand World Health Organization International Avian Influenza Investigative Team, Farrar J: Avian influenza A (H5N1) in 10 patients in Vietnam CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Evolutionary interactions between N-linked glycosylation sites in the HIV envelope Biopipe: A flexible frameyours -you keep the copyright Submit your manuscript here work for protocol-based bioinformatics analysis BioWBI: An integrated tool for building and executing Bioinformatic analysis workflows Taverna: A tool for the composition and enactment of bioinformatics workflows Wildfire: Distributed, Grid-enabled workflow construction and execution Platform for bioinformatics analysis workflows Development of distributed bioinformatics applications with GMP. Concurrency and Computation: Practice and Experience Biowep: A workflow enactment portal for bioinformatics applications BioWMS: A webbased workflow management system for bioinformatics Phylogenetic analysis of neuraminidase genes of H5N1 isolates using HPC technologies This article has been published as part of BMC Bioinformatics Volume 9 Supplement 1, 2008: Asia Pacific Bioinformatics Network (APBioNet) Sixth International Conference on Bioinformatics (InCoB2007). The full contents of the supplement are available online at http://www.biomedcentral.com/ 1471-2105/9?issue=S1. The authors declare that they have no competing interests. All sequence in the core data set. Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2105-9-S1-S23-S1.txt] All sequence in the medium data set. Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2105-9-S1-S23-S2.txt] All sequence in the macro data set. Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2105-9-S1-S23-S3.txt] Phylograms obtained from the Parsimony workflow for the dataset H5N1_NA_24.txt. Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2105-9-S1-S23-S4.pdf] Phylograms obtained from the ML workflow for the dataset H5N1_NA_24.txt. Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2105-9-S1-S23-S5.pdf] Multiple Sequence Alignment of all sequences in H5N1_NA_24.txt except the two outliers.Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2105-9-S1-S23-S15.pdf]