key: cord-1018841-zamrnne9 authors: Pérez-Montoto, Lázaro Guillermo; Dea-Ayuela, María Auxiliadora; Prado-Prado, Francisco J.; Bolas-Fernández, Francisco; Ubeira, Florencio M.; González-Díaz, Humberto title: Study of peptide fingerprints of parasite proteins and drug–DNA interactions with Markov-Mean-Energy invariants of biopolymer molecular-dynamic lattice networks date: 2009-07-17 journal: Polymer (Guildf) DOI: 10.1016/j.polymer.2009.05.055 sha: b1577fbdc56e73d9a509bf83ada71cae3c142ee3 doc_id: 1018841 cord_uid: zamrnne9 Since the advent of Molecular Dynamics (MD) in biopolymers science with the study by Karplus et al. on protein dynamics, MD has become the by foremost well established, computational technique to investigate structure and function of biomolecules and their respective complexes and interactions. The analysis of the MD trajectories (MDTs) remains, however, the greatest challenge and requires a great deal of insight, experience, and effort. Here, we introduce a new class of invariants for MDTs based on the spatial distribution of Mean-Energy values ξ(k)(L) on a 2D Euclidean space representation of the MDTs. The procedure forces one MD trajectory to fold into a 2D Cartesian coordinates system using a step-by-step procedure driven by simple rules. The ξ(k)(L) values are invariants of a Markov matrix ((1)Π), which describes the probabilities of transition between two states in the new 2D space; which is associated to a graph representation of MDTs similar to the lattice networks (LNs) of DNA and protein sequences. We also introduce a new algorithm to perform phylogenetic analysis of peptides based on MDTs instead of the sequence of the polypeptide. In a first experiment, we illustrate this algorithm for 35 peptides present on the Peptide Mass Fingerprint (PMF) of a new protein of Leishmania infantum studied in this work. We report, by the first time, 2D Electrophoresis isolation, MALDI TOF Mass Spectroscopy characterization, and MASCOT search results for this PMF. In a second experiment, we construct the LNs for 422 MDTs obtained in DNA–Drug Docking simulations of the interaction of 57 anticancer furocoumarins with a DNA oligonucleotide. We calculated the respective ξ(k)(L) values for all these LNs and used them as inputs to train a new classifier with Accuracy = 85.44% and 84.91% in training and validation respectively. The new model can be used as scoring function to guide DNA–Drug Docking studies in drug design of new coumarins for PUVA therapy. The new phylogenetics analysis algorithms encode information different from sequence similarity and may be used to analyze MDTs obtained in Docking or modeling experiments for any classes of biopolymers. The work opens new perspective on the analysis and applications of MD in polymer sciences. Computational approaches can timely provide very useful information and insights for both basic proteome research and drug design. Many line evidences of evidences such as structural bioinformatics [1] , molecular docking [2] , molecular packing [3] , pharmacophore modeling [4] , Monte Carlo simulated annealing approach [5] , diffusion-controlled reaction simulation. In addition, Quantitative Structure-Activity Relationships (QSARs) [6, 7] , protein sub-cellular location prediction [8] [9] [10] , protein structural class prediction [11] , identification of membrane proteins [12] , identification of enzymes and their functional classes [13, 14] , identification of GPCR [15] , identification of proteases, protein cleavage site prediction [16] , and signal peptide prediction [17] have indicated that they are widely welcome by science community. In this context, after a pioneer paper entitled 'The Biological Functions of Low-Frequency Phonons' [18] published in 1977, a series of investigations into biopolymers from dynamic point of view have been stimulated. These studies have suggested that lowfrequency (or terahertz frequency) collective motions do exist in proteins and DNA that hold a very high potential to reveal the profound dynamic mechanisms of many marvelous biological functions in biological systems (see, e.g. [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] and a comprehensive review [33] ). This kind of inferences has been later observed by NMR [34] , and been further used for medical treatments [35, 36] . In view of this, to understand really the interaction mechanism of drugs with proteins or DNA, we should consider not only the static structures concerned but also the dynamical information obtained by simulating their interactions through a dynamic process. The present study was attempted to address this problem from the angle of Molecular Dynamics (MD). In fact, since the advent of MD with the work of Karplus et al. [37] [38] [39] [40] [41] [42] , MD has become a computational technique to investigate structure and function of biomolecules and their respective complexes and interactions. It is also of high relevance taking into account that the previous structure of the polymeric double helix of DNA as well as the non-covalent binding (in dark) between DNA and drug has a strong influence on the subsequent photoreaction and therefore on their biological activity [43, 44] . Consequently, MD studies of the biopolymers including polypeptides or polynucleotide DNA-Drug complexes are of the major relevance too [32, 45] . In general, the analysis of the MD trajectories (MDTs) resulting from the integration of the equations of motions in MD remains, however, the greatest challenge and requires a great deal of insight, experience, and effort. In a recent and very important work, Hamacher [46] proposed a new, theoretical sound, and versatile analysis procedure that provide scientists with a semi-quantitative invariant measures to compare various scenarios of their respective simulations. On the other side, using graphic approaches to study biological systems can provide useful insights. As indicated by many previous studies graph have been used on a series of important biological topics, such as enzyme-catalyzed reactions [47] [48] [49] , protein folding kinetics [50] , inhibition kinetics of processive nucleic acid polymerases and nucleases [51] , analysis of codon usage [52] , base frequencies in the anti-sense strands [53] , analysis of DNA sequence [54] . Moreover, graphical methods have been introduced for QSAR study [55] as well as utilized to deal with complicated network systems [56, 57] . Recently, the ''cellular automaton image'' [58] has also been applied to study hepatitis B viral infections [59] , HBV virus gene miss-sense mutation [60] , and visual analysis of SARS-CoV [61] , as well as representing complicated biological sequences [62] and helping to identify protein attributes [63] . In this sense, several authors have used pseudo-folding lattice Hydrophobicity-Polarity (HP) models to simulate polymer folding making simulations to optimize the lattice structure and resemble real folding [64] [65] [66] [67] [68] [69] [70] [71] . However, we can choose notably simpler polymer chain pseudo-folding rules to avoid optimization procedures and speed up notably the construction of the lattice. In this sense, useful graph representations of DNA, RNA and/or protein sequences have been introduced by Gates [72] , Nandy [73] , Leong [74] , Randic et al. [75] based on 2D coordinate systems. We call these graph representations as polymer sequence pseudo-folding Lattice Networks (LNs) because they look like lattice structures and in fact we are forcing a sequence to fold in a way that not necessarily occurs in nature. In general, these LNs (as for other polymer graph representations) can be numerically characterized with Topological Indices (TIs), see for instance the previous paper series published by our group on Polymers [76] [77] [78] [79] . These TIs describe the distribution of amino acids or nucleotides along the polymeric chain but also encode higher-order information or other type of information on polymer mixtures. Thus lattice pseudo-folding TIs can be used in protein Quantitative Structure-Property Relationships (QSARs) [7, 81] to connect polymeric structure with physicochemical or biological properties. Our group has used the approach called MARCH-INSIDE to calculate these TIs of pseudo-folding lattice-like networks to predict diverse protein or DNA/RNA functions. For instance, we have used stochastic pseudo-folding spectral moments to predict Ribonucleases [82] and Dyneins [83] . The MARCH-INSIDE pseudo-folding TIs can be calculated when we sum the respective indices for each node of the graph. All the abovementioned values were used recently to predict microbacterial promoters and compare entropies, spectral moments, and pseudofolding electrostatic potentials [84] . The readers may see three recent reviews discussing the applications ranging from graph of small molecules to graph or network representation of protein sequences and 3D structure, DNA sequences, RNA secondary structure, or human blood proteome mass spectroscopy outcomes [7, 81, 85] . In any case, if we understand sequence as a type of input data we have not to limit the applications of the pseudo-folding lattice network method to proteins, DNA or RNA sequences. Elaborating this line of thinking we have proposed pseudo-folding lattice network representations of Mass Spectroscopy outcomes typical of blood Proteome samples containing many proteins. For instance we have constructed lattice network representations for mass spectroscopy results obtained from blood proteome samples typical of drugs causing cardiotoxicity [86] . After calculation of the sum of the TIs of each sample we used them to seek a new type of classifier. The model connects TIs values of the Mass Spectra of the blood proteome with the probability of appearance of drug cardiotoxicity [79, 87] . We have used these lattice network TIs also to predict human prostate cancer [88] . The success of this strategy encouraged us to consider other classes of sequence data and solve different problems. For instance, the MDTs referred in previous paragraphs are time series obtained from simulation runs that constitute another type of sequential data. Considering that the Mean Values of a Markov Chain associated to LN are also sequence invariants we decided to explore here the use of these indices to describe MDTs. In the present paper, we adapted LN representations for the study of MDTs obtained in both DNA-Drug Docking and Peptide structure optimization experiments. In this sense, we report two different experiments: in one we report a new phylogenetic analysis for MDTs of Peptides (Experiment 1) in the other we obtain a new scoring function for DNA-Drug Docking studies (Experiment 2). In Experiment 1: we shall deal with the following questions: (a) adaptation of LN to represent MDTs obtained in peptide optimization procedures; (b) calculation of a new class of TIs for peptide MDT networks; and (c) introduction of a new phylogenetic approach to compare the MDTs of different peptides found in the Peptide Mass Fingerprint (PMF) of protein. For it, we are going to use as example a real experiment we describe here by the first time. Here we isolate with 2D gel Electrophoresis and characterize with MALDI TPF MS all the peptides found on the PMF of a protein expressed on Leishmania infantum. The study of PMFs of new proteins may become an interesting source to fish new peptides with potential use as drug, in vaccine design, or as disease biomarkers. In particular, Leishmania parasitic species are the causal agents of Leishmaniosis one of the most important parasitic diseases [89, 90] . The toxicity and inefficacy of actual organic drugs against Leishmaniosis justify research projects to find new drugs or drug molecular targets in Leishmania species including L. infantum (L. infantum) and Leishmania major (L. major), both important pathogens [83, [91] [92] [93] . In this sense the bioinformatics studies of Leishmania gene and proteins become a significant goal [94, 95] . One possibility to accomplish the former goal is the use of proteome research techniques. For instance, in proteome research authors often use a combination of 2D Electrophoresis (2DE) and Mass Spectroscopy (MS) to isolate and characterize new sequences from biological samples [96] . Obtaining the PMF of the protein is a very useful procedure in this sense [97] . In these cases, we employ informatics tools, such as Sequest or MASCOT, to have the MS outcomes for some of the more important peptides of the more similar proteins [98, 99] . It means that, for instance, MASCOT may give the collection of MS signals and the corresponding sequence of peptides present in known proteins that match with our MS input. In order to rank and select the best protein/peptide candidates MASCOT use the Mowse score (M s ) [100] . If a template protein in the database has a high M s (>51), this protein has a PMF very similar to the PMF of our query proteins and we can detect a high sequence homology and perform function annotation. However, there is still another situation that often appears in proteome research and do not coincide exactly with the two situations above-mentioned in the first paragraph. We refer to the case when you identify a new protein, perform the MS analysis of PMF; introduce it in MASCOT (or other MS and sequence database) and the software identify some template candidates with an important M s that is not sufficiently high to accurately annotate the query protein (>40). In an excellent work have been reported an alternative to M s and discussed the limits of accurate scoring [101] . Nevertheless, if this kind of situation persists you have neither the sequence of the query protein nor the sequence of a template protein with high homology but you have the PMFs of both the query and the template. We call this situation here as: the query sequence missing and Low-Mowse scoring case. Independently from the possibility of function annotation of Low-Mowse proteins this kind of PMFs are, in our opinion, ideal sources to fish interesting peptides with bioinformatics methods. Anyhow, from these facts we can conclude that the method used to compare different peptides in this search is very important. Bioinformatics methods based on Sequence alignment and similarity measures are very useful to perform sequence function annotation. Some authors have referred however that an alignment procedure may fail in cases of low sequence homology between the query and the template sequences deposited in the database. Alignment techniques are also useless if there is high query-template homology but we do not know the function of the template sequence deposited in the database [102] . On the other hand, some authors mentioned in the previous paragraphs have introduced 2D or higher dimension graph representations of sequences prior to the calculation of TIs. These representations are associated to algebraic structures that have been extended also to genetic codes [103] [104] [105] . This constitutes an important step in order to uncover useful higher-order information not encoded by 1D sequence parameters [73, [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] . In any case, we can use either the sequence directly or the graph parameter to develop phylogenetic trees in order to compare different peptides. Phylogenetic analysis often relays on tree graph construction. Applications of graphs and networks in phylogenetics are too broad a topic for detailed treatment here. The reader can consult reviews and compilations on this topic for an overview of this area [120] . Phylogenetics is commonly used to predict which amino acid residues are critical for the function of a given protein [121] . However, such approaches do have inherent limitations, such as the requirement for the identification of multiple homologs of the protein under consideration. Thibert and Bredesen [122] reported a study of cancer proteins in an extensive human PIN constructed by computational methods. They compared a couple of phylogenetic approaches to several different network-based methods. Another interesting direction is the use of TIs of the DNA, RNA and/or protein graph representations described above to construct phylogenetic trees in an alignment-independent way. For instance, Zupan and Randič [123] studied Spectrum-like and Zig-Zag representations of the beta-globin gene for different species and also obtained phylogenetic trees without alignment. In another paper, Liao proposed a 2D graphical representation of a DNA sequence [124] . Liao et al. [125] used this representation as a basis to compute the similarities between 11 mitochondrial sequences belonging to different species and used the elements of the similarity matrix to construct the phylogenic tree. Among all above-mentioned, Liao, Randic, Basak, Vackro, Nandy and Wang [108, 118] associated a DNA sequence having n bases with n  n non-negative real symmetric matrix A with elements a ij and use its leading eigenvalue to characterize the DNA sequence in phylogenetic studies. These matrices have been derived from 2DD representations and different TIs calculated [126] . On the other hand, Zhang et al. [127] very recently introduced TIs referred to as Zinv for 3DD curves and used them to analyze the phylogenetic relationships for the seven HA (H5N1) sequences of avian influenza virus. The equations used in these methods to calculate the indices Inv(A) and Zinv(A) as well as the phylogenetic distances between two peptides p and q are given as follows [127] , see Equations (1)-(4). In closing, in the Experiment 2 we recognize the necessity of new peptide phylogenetic methods, the success of TIs of LN to encode sequence information, and the importance of MDT studies. Consequently, based on these facts we introduce a new phylogenetic method based on x k (L) values. The type of parameter obtained depends on the type of systems under study (molecules, proteins, Mass Spectra), the parts of the system (atoms, amino acids, MS signals), and the property used to describe these parts (electronegativities, electrostatic charge, signal intensity). For instance, in other works, we used Markov chain pseudo-folding electrostatic potentials to found models that predict Polygalacturonases [117] or human colon and breast cancer biomarkers [129] . In this experiment we found a model that can be used as scoring function to evaluate DNA-Drug Docking search. These models belong to a general class of methods known as QSARs are devoted to unravel structural and physicochemical requirements for biological activity in a great variety of compounds [130] . The classic QSAR studies connect information of the chemical structure of the molecule, expressed by means of numbers, with the biological activity [55] . However, QSARlike procedures are not restricted to drugs and biological activity but other systems and properties, such as proteins or DNA/RNA may be predicted [132] [133] [134] [135] . One special class of indices used in QSAR are the TIs of molecular graphs; which indicates the presence of vertices or nodes (atoms) and connections or edges between nodes (chemical bonds) [136] [137] [138] [139] . Nevertheless, we can use TIs of different types of graph representations or networks may be used. In these networks, amino acids, nucleotides, enzymes, microorganisms, cerebral cortex regions, web pages, social groups, etc., may play the role of nodes and electrostatic interactions, mutations, metabolic reactions, host-parasite relationships, brain region co-activations, links, diseases propagations, etc. may play the role of edges [7, [140] [141] [142] [143] [144] [145] . Many authors prefer to use the term Quantitative Structure-Binding affinity Relationship (QSBR) when one use QSAR-like procedures to predict drug-target binding affinity and 3D structural information [146] . In any case, both approaches QSAR and QSBR diverge in some degree on the type of measure (activity or binding) and sometimes on how detailed we need to know the chemical structure (2D or 3D) but both use essentially the same algorithm. In addition to predicting drug activity we can use 3D drug-target QSAR/ QSBR models as scoring function to guide the search of optimal drug-target interaction geometries in drug-target Docking studies [147] [148] [149] . Almost all QSAR/QSBR or other types of Docking-scoring functions are aimed to predict protein-drug interactions. For instance, Wang et al. [150] reported a comparative study of eleven whereas Ferrara et al. [151] studied nine different Docking-scoring functions all for Protein-drug interactions. Conversely, DNA-Drug and RNA-Drug Docking are generally less investigated. In particular, we did not found a QSBR scoring function for DNA-Furocoumarin Docking. The furocoumarins are a class of natural or synthetic compounds with very interesting pharmacological properties [152] . Commonly used in the treatment of skin diseases such as psoriasis and mycosis fungoides [153] . This treatment called PUVA consists in a therapy that combines the use of both chemicals and long-wave ultraviolet light (UV-A) [154] . The molecular base of PUVA is connected with the highly specific photo damage in DNA of epidermal cells. This damage interferes with the DNA replication, producing an inhibition of DNA synthesis which reduces or blocks the cell duplication [155] . Although the lineal furocoumarins (psoralens) are able to form the three adduct types, the geometry of the angular ones (angelicins) only allows them to form monoadducts with the DNA. On the other hand, it is well known that the side effects observed in PUVA therapy, such as skin phototoxicity and risk of skin cancer are strictly connected with the bi-functional lesions in DNA [156] . These facts points to the stability DNA-Drug complex as a central factor in the activity of anticancer drugs in general including furocoumarins. In closing, in the Experiment 1 we recognize the necessity of new scoring functions for DNA-Drug Docking methods, the importance of furocoumarins in PUVA therapy, the success of TIs of LN to encode sequence information, and the importance of MDT studies. Consequently, based on these facts we introduce a new DNA-Drug scoring function for furocoumarins based on x k (L) values. The MARCH-INSIDE approach is extended here to the study of LN representations for MDTs obtained in DNA-Drug Docking studies. In Secondly, the method uses the matrix 1 P, which is a squared matrix to characterize the MDT embedded into the LN. Please, note that the number of nodes (n) in the graph may be equal or even smaller than the number of steps given to obtain the MD profile. The same happens for amino acids or DNA bases in the polymeric chain. Accordingly, the matrix 1 P contains the probabilities 1 p ij to reach a node n i moving throughout a walk of length k ¼ 1 from other node n j [83, 129] : where, d E j is the sum of all energy values of the steps d E s that overlap on the same node j. The parameter a ij equals to 1 if the nodes n i and n j are adjacent in the graph and equals to 0 otherwise. The value D 0j gives the geometric location of the node and represents the Euclidean distance between the node and the center of coordinates. Let be, the vector of initial , the calculation of x k (L) values is straightforward to realize be means of Chapman-Kolgomorov equations; these indices can be interpreted as Mean-Energy values for on the 2D Euclidean space representation for MDTs: Promastigotes of the Leishmania strain LEM75 were grown in medium Schneider supplemented to a final concentration of 0.4 g/L NaHCO 3 , 4 g/L HEPES,100 mg/L penicillin and 100 mg/L streptomycin and 10% fetal bovine serum (Gibco), pH 6.8 and 26 C [83] . Mid-log promastigotes were recovered on the seventh day postinoculum (p.i.) and the parasites were centrifuged at 3000 rpm for 10 min at 4 C. The resulting pellet was washed five times with Tris-HCl pH 7.8, and resuspended in 0.1 mL of this same buffer. The sample was sonicated for 10 s with a Virsonic 5 (Virtis, NY, USA) set at 70% output power in ice bath. The homogenate was extracted in 5 mM Tris-HCl buffer pH 7.8 containing 1 mM phenylmethylsulfonyl fluoride (PMSF) as a protease inhibitor, at 4 C overnight and, subsequently centrifuged at 10,000g for 1 h at 4 C (Biofuge 17RS: Heraeus Sepatech, GmbH, Osterode, Denmark). The supernatant was dialysed overnight at 4 C in 0.5 mM Tris-HCl buffer. Proteins were precipitated with 20% TCA (trichloroacetic acid) in acetone with 20 mM DTT for 1 h at À20 C, added 1:1 to the homogenated. Then, the sample was centrifuged at 10,000g for 15 min and the pellet was washed with cold acetone containing 20 mM DTT. Residual acetone was removed by air-drying. In order to achieve a well-focused first-dimension separation, sample proteins must be completely disaggregated and fully solubilized, in a sample buffer containing 7 M Urea, 2 M Thiourea, 4% CHAPS, Destreak buffer (Amersham Biosciences), 5 mM CO 3 K 2 , 2% IPG buffer (Amersham Biosciences) and incubated at room temperature for 30 min. Following clarification by centrifugation at room temperature (12,000g, 10 min) the supernatant was stored frozen at À20 C [83]. In total 340 mL of rehydration buffer were added to promastigotes solubilized extracts (7 M urea, 2 M thiourea, 2% CHAPS, 0.75% IPG buffer pH 4-7, bromophenol blue) and were immediately adsorbed onto 18 cm immobilized pH 4-7 gradient (IPG) strips (Amersham Biosciences) [157] . Optimal IEF was carried out at 20 C, with an active rehydration step of 12 h (50 V), and then focused on an IPGphor IEF unit (Amersham Biosciences) by using the following program: 150 V for 2 h, 500 V for 1 h, 1000 V for 1 h, 1000-2000 V for 1 h and 8000 V for 6 h. After focusing, IPG strips were equilibrated for 15 min in 10 mL of 50 mM Tris-HCl, pH 8.8, 6 M urea, 30% v/v glycerol, 2% w/v SDS, traces of bromophenol blue, and 100 mg of DTT. Then, the strips were incubated for 25 min in the same buffer but replacing DTT by 300 mg of iodoacetamide. After equilibration, the IPG strips were placed onto 12.5% SDS-polyacrylamide gels and sealed with 0.5% (w/v) agarose. SDS-PAGE was run at 15 mA/gel. 2D gels were stained with silver staining mass spectrometry compatible. Briefly, the gels were fixed in 40% ethanol (v/v), 10% (v/v) acetic acid overnight, then were sensitized with sodium acetate 0.68% (w/v) and 0.05% sodium thiosulfate for 30 min and washed with deionized water 3 times for 5 min. The gels were incubated in 0.25% (w/v) silver nitrate for 30 min. After incubation, it was rinsed with deionized water 2 times for 50 s followed by adding the developing solution, which contained 2.5% (w/v) sodium carbonate with 0.04% (v/v) formaldehyde until the desired intensity range. Development was finished by adding 1.5% (w/v) EDTA. Spots of interest were manually excised from silver stained 2-DE gels after being de-stained, as described by Gharahdaghi et al. [158] . Then, gel pieces were incubated with 12.5 ng/ml sequencing grade trypsin (Roche Molecular Biochemicals) in 25 mM AmBic, overnight, at 37 C. After digestion, the supernatants (crude extracts) were separated. Peptides were extracted from the gel pieces first into 50% ACN, 1% trifluoroacetic acid and then into 100% ACN. Then, one microliter of each sample and 0.4 mL of 3 mg/mL a-cyano-4-hydroxycinnamic acid matrix (Sigma) in 50% ACN, 0.01% trifluoroacetic acid were spotted onto a MALDI target. MALDI-TOF MS analyses were performed on a Voyager-DE STR mass spectrometer (PerSeptive Biosystems, Framingham, MA, USA). The following parameters were used: cysteine as s-carbamidomethyl derivative and methionine in oxidized form. Spectra were acquired over the m/z range of 700-4500 Da. The PMF data, obtained from MALDI-TOF MS analyses, were used to search for protein candidates in two sequence databases: SWISS-PROT/TrEMBL non-redundant protein database (www.expasy.ch/ sprot) and a complete genomic database from the related species L. major, namely ftp://ftp.sanger.ac.uk/pub/databases/L.major_ sequences/LEISHPEP/, using MASCOT software program (www. matrixscience.com). The MASCOT search parameters were adjusted according to the MS experiment carried out and to the above description as follows: Type of search: Sequence Query; Enzyme: Trypsin; The MDTs or energetic profiles of all the starting structure of peptides were also obtained by means of the MC method, using the HyperChem package [159, 160] . In this sense, the force field AMBER94 [161] of molecular mechanics was used with distantdependent dielectric constant (scale factor 1), electrostatic and Van der Waals values by default and cutoffs shifted with outer radius of 14 Å (see Fig. 2 ). All the components of the force field were included and the atom type was recalculated keeping their current charges. Previous to Monte Carlo simulation the geometry of all the structures of peptides where optimized with this same force field. Finally, the simulation was executed in the vacuum at 300 K and 100 optimization steps obtaining MDTs with 100 potential energy d E j (j ¼ 1, 2, 3,.100) values each one. We obtained 35 MDTs for 35 peptides. In order to obtain realistic MDTs there is an additional parameter we monitor in MD algorithms; which is known as the acceptance ratio (ACCR). It appears as ACCR on the list of possible selections in the MC Averages dialog box of HyperChem (see Fig. 2 ). The acceptance ratio is a running average of the ratio of the number of accepted moves to attempted moves. Optimal values are close to 0.5. Varying the step size can have a large effect on the acceptance ratio. The step size, Dr, is the maximum allowed atomic displacement used in the generation of trial configurations. The default value of r in HyperChem is 0.05 Å [159] . For most organic molecules, this will result in an acceptance ratio of about 0.5 Å, which means that about 50% of all moves are accepted. Increasing the size of the trial displacements may lead to more complete searching of configuration space, but the acceptance ratio will, in general, decrease. Smaller displacements generally lead to higher acceptance ratios but result in more limited sampling. There has been little research to date on what the optimum value of the acceptance ratio should be. Most researchers tend to try for an average value around 0.5; smaller values may be appropriate when longer runs are acceptable and more extensive sampling is necessary [159] . The DNA-Drug Docking MDTs or energetic profiles of all the starting intercalation complexes were obtained also by means of the MC using the same parameters than for PMFs. We obtained 21 MDTs for psoralens and 154 MDTs for the 36 different angelicins. We also analyzed 36 averaged MDTs for each angelicin taking the average energy d E j (avg) for all the initial positions of one compounds at each one of the 100 steps. All these MDTs form a total of 21 þ154 þ 36 ¼ 211 MDTs. In addition, we analyzed other 211 MDTs (decoy trajectories) obtained as a random deviation from each one of the previous 211 MDTs. These random MDTs contain 100 energy values d E j (rnd) obtained with the random generator of Excel by adding a random deviation term to each d E j within the max-min limits of d E j for all the previous MDTs. The utility of these decoy trajectories is to test the robustness of the method to deviations of the MDTs selected. In total we studied 422 MDTs. The information about all these 422 MDTs including x k (L) values relevant to this work was recorded on the online Supplementary material. Using the vector of initial the vector of x k (L) values for Dist. a À0.5 À0.5 À1 À1 1 Ang For our study we used the decanucleotide of sequence d(CCGCTAGCGG) and the software application HyperChem [160, 162] , a fragment of DNA with double Helix in B form and sugars in 2 0 endo form. This decanucleotide sequence has been used in different studies concerning psoralens intercalation [160, 163] . The structure of all the compounds selected for DNA-Drug interaction studies were optimized using the interactive model building package of HyperChem [162] . The optimization of their geometries was carried out by the Semi-empirical Quantum Mechanics calculations with method PM3 [164] using the Polak-Ribiere algorithm and the options implemented by default in the mentioned package. Thus, the minimized molecular structures were intercalated by hand approach in the DNA fragment, using the HyperChem package and taking into account the following experimentally demonstrated statements: 1. In the dark, the poly[dA-dT] poly[dA-dT] sequence in DNA is the most favorable site for intercalation since the further photoreaction takes place mainly on the 5,6 double bond of the thymine [165] . So, the optimized molecules were inserted among the thymine units in a parallel plane to the bases and, according to our decision, in a halfway position (Fig. 3, left) . 2. The furocoumarins have two reactive sites, but after photoreaction, different types of cycloadducts can be formed: mono (furan-side or pyrone-side) and di-adducts (the cross-link) [166] . Although psoralens are able to form all the cycloadduct types, angelicins forms only monoadducts owing to their angular molecular structure. Keeping this in mind, for each lineal molecule we modeled only one starting conformation, for which the cycloadduct formation by either one or other reactive site (furan or pyrone-side) is equally feasible from a geometric point of view. For each angular molecule we decided to model two starting conformations, one for each monoadduct formation (for the furan-side that we named as j-conformation and for the pyrone-side that we named as c-conformation). 3. The stereochemistry of the furocoumarins adducts is cis-syn [167, 168] . Consequently, the molecules were oriented in such a way that the intercalation complex favors mainly the formation of cycloadducts with this stereochemistry. In the case of the furan-side, the stereochemistry syn means that the furan O 1 0 and the pyrimidine N 1 are going to be on the adjacent corners of the future cyclobutane ring. For the pyroneside, the stereochemistry syn is defined as having the carbonylcarbon of the pyrone ring and the N 1 of the pyrimidine on the adjacent corners of the future cyclobutane ring (Fig. 3, right) . On the other hand, some of the studied angular molecules present ramifications in the C3 carbon that hindered us to model appropriately their j-conformation, due to steric problems with the thymine ring. We also found steric impediments in the backbone of the DNA when these ramifications are much bigger. In all these cases we decided to model several alternative starting conformations for which the steric effects were eliminated. For the majority of the cases we just varied the insertion degree of molecule in the DNA; in the most critical cases we also had to rotate the molecule clockwise, see Figs. 3 and 4, and Table 1 for details. Both, the displacement outwards DNA and the molecule rotation were carried out in the halfway and parallel plane to the nitrogen bases. In this sense, the geometric criterion used was the relative distance (in the plane projection) between the geometric centers of the double bonds (j or c bond for furocoumarins and 5,6 bond for the thymine) that will take part in the photoaddition and the relative angle between them. In both Table 1 and Fig. 2 , the variations of these geometric parameters used to model the j-conformations are represented in a simplified way. Taking this aspects into consideration the notation of a MD trajectory is given here as follows. We used the notation: m-[Bond/Dist./Ang.]; where: m is the number of the compound in Table 2 or Table 3 , Bond ¼ j, c, or j&c are the chemical bonds susceptible of photoaddition in this position; whereas Dist. and Ang. are the distance and angular intercalation parameters, respectively (see also Table 1 ). In this study we selected different furocoumarins and some of their aza-analogous, whose antiproliferative activities in Ehrlich Ascites tumor cells have been determined (Tables 2 and 3) . We obtained in total 422 MDTs for these compounds. We constructed 422 LNs (one for each MD trajectory) transformed them in a vector of 11 p k (d) values for the compound and 11 x k (L) values for the MDt (see previous sections). Were grouped all these 422 MDTs on two sets composed by MDTs of complexes between DNA and active compounds and other composed by trajectories of active compounds with no-optimal MDTs and/or trajectories of nonactive compounds. In general, compounds as 4 0 -MAP and the 4-MBAP, with activities (ID 50 relative to 8-MOP) of 0.13 and 0.14 are considered as poorly active [169, 170] . The biological activity of these compounds is normally studied by evaluating of their capacity of forming an intercalated complex with DNA and their ability of photo-binding through mono-or bi-functional addition to the same macromolecule [171] . A traditional procedure to determine the photobiological and antiproliferative activity of furocoumarins is based on ID 50 , the UVA dose that reduces to 50% of the DNA synthesis in Ehrlich Ascites tumor cells (EATC) in presence of tested compound at certain concentration (18-20 mM). The protocols used in the activity determination are heterogeneous, however it is very common the use of the 8-MOP as reference to express the activity [169, 170, 172] . Keeping in mind all the above-mentioned aspects, we classified the 57 compounds, compiled for our dataset in two observed activity groups: 0 for the inactive compounds (LD 50 0.1) and 1 for the active ones (LD 50 > 0.1). QSBR studies were carried out to obtain models that allow us to classify the furocoumarins derivatives in one of these two activity groups. We selected Linear Discriminant Analysis (LDA) [173, 174] to fit the discriminant function as implemented in the LDA module of the STATISTICA 6.0 software package [175] . Forward-stepwise algorithm was used for variable selection [176] [177] [178] . The statistical significance of the LDA model was determined with Fisher ratio (F) and the respective p-level (p). All the variables included in the model were standardized in order to bring it into the same scale. Subsequently, a standardized linear discriminant equation that allows to compare their coefficients is obtained [179] . We also inspected the Accuracy, Sensitivity, and Specificity of the model for both training and external validation series. Last, cases/adjustable parameters ratios (r), and number of variables to be explored to avoid over-fitting or chance correlation [176, 177] . The general form of this model is the following, where MD score is the real valued variable (output of the model) that scores the goodness of fit to guide DNA-Drug Docking. MD score ¼ thus we refer to these references by reasons of space [180, 181] . we illustrate an overall view of the 2DE map obtained from the L. infantum promastigote homogenate. In this figure we have done a zooming in the left-to-down corner to highlight an area of high density of spots, which apparently corresponds to protein fragments of low MW and low pI. Our interest in this area derived from the fact that these spots remained unchanged from gel to gel repetitions and might correspond to relevant proteins of this parasite. To start investigation on the nature of these proteins initially we the spot marked with an arrow and encircled in the zoom image for this area, see Fig. 5 . The protein contained in each spot was submitted to in-gel trypsin digestion and the mass of the resulting peptides was obtained from MALDI-TOF MS analysis. However, we focus our attention in this study on the protein corresponding to spot #3. Once we have obtained the data from MALDI-TOF MS analysis of spot #3, the more relevant MS signals were introduced into the MASCOT search engine [182, 183] . Due to the fact that the MASCOT collection of annotated databases does not contained data about L. infantum proteome, we chose the L. major database of annotated proteins with MS recorded because of its similarity to L. Infantum [184] . Even being a protein fragment of low MW, the MASCOT search of MS signals found one hit with an M s higher than 51 (p < 0.05) for spot #3 (see Table 4 ). The top Mowse score found was 61, correspondent to the protein LmjF36.6010 of L. major with mass 73 697 but without know function annotation. The second highest Mowse score of 42 correspond to the protein LmjF31.2850c assigned as one ribosomal protein of L. major specie with mass 22 350. The other proteins to complete a total of 20 with similar Mowse scores were summarized in Table 4 . All these proteins have Mowse values lower that the threshold value of 51 used to identify proteins with significant similarity. In any case, many of them have been also recorded with unknown function or as hypothetical proteins. Taking into consideration we can consider this protein as low-Mowse score case (because no protein in MASCOT search with known function has a high score). As we referred in introduction precisely the PMF of this type of protein may be of high interest. In Table 5 we give detailed information on the results of the MS analysis of the PMF of the new protein using MALDI-TOF technique and MASCOT search engine. Similar combination have been successfully used in the past to study Trichinella antigens [157] and possible Leishmania dynein proteins [83] . In this table we have shown only the 35 more interesting peptides matching with the MS of other proteins on the MASCOT search. Considering the high importance of phylogenetic analysis in the next section we propose a new algorithm for phylogenetic tree construction based MDTs and using this set of peptides as case study. After MS characterization of the PMF of the new protein, we decided to use the x k (L) values as inputs to construct a new type of phylogenetic tree. In so doing, we obtained firstly the MDTs for the more interesting 35 peptides (see Fig. 2 ). In Table 6 we have summarized the results of MD simulation of these peptides. In this table we reported the initial energy (E 0 ) and energy gradient (d 0 ) based on the starting structure constructed with standard parameters for a-helixes (bond distances, angles, and dihedral angles) set as default on the sequence editor of Hyperchem [159, 160] . We also reported the (E 1 ) and energy gradient (d 1 ) obtained after optimization of the structure with AMBER force field as well as the last energy value (E 100 ) obtained on the MC exploration of the MDT. Last, we report in Table 6 the ACCR values for the MDT of the 35 peptides; which are all lower than 0.5. In consequence, we can accept the MD results and use them to construct a phylogenetic tree. Using information about the distribution of monomers (amino acids or nucleotides) throughout the biopolymer chain have been the major tendency on phylogenetic analysis [185] . In the Introduction, we discussed the importance of new molecular phylogenetic approaches for polypeptide chains based on other sources of information such as MDTs. In Materials and method we outlined the possibility of construction of a phylogenetic tree for the PMFs of the new protein described above using Equation (9) . For it, we have calculated first the x k (L) for the 35 more relevant peptides (see Table 7 ) and later the peptide-peptide distance using Equation (9) . In Fig. 6 we illustrate that there are notable differences on the grouping of the 35 peptides if we use the traditional sequence similarity method or alternatively the present approach. This results show that in principle the distance D pq (x) between a peptide p and other q based on x k (L) values of MDTs codify information essentially different to sequence similarity. In this sense, the present molecular phylogenetic algorithm may become an alternative to traditional methods. Using the new x k (L) values as inputs we can obtain a classifier to discriminate DNA-Drug complexes of the two classes defined in materials and methods. The best model we found was: MD score ¼ À2:24  x 0 ðLÞ À 1:59  c 2 ðDÞ À 2:11 n ¼ 316 F ¼ 120:99 p < 0:01 r ¼ 70:33 (11) The output of the model, MD score, is a real value variable that scores the predicted goodness of fit for one MD trajectory. After the forward-stepwise selection of x k (L) and c k (L) values the model retained only two parameters x 0 (L) and c 2 (L). These values can be interpreted as the average or mean value of energy for states on the 2D space (nodes on the LN) and mean electronegativity value for atoms placed at distance k ¼ 2 on the drug, respectively. The model was trained with a training series and later validated with and external validation series. In training series the model correctly classifies 78 out of 80 (specificity ¼ 97.50%) optimal and 192 out of 236 (sensitivity ¼ 81.36%) no-optimal MDTs. In statistical prediction, the following three cross-validation methods are often used to examine a classifier for its effectiveness in practical application: independent dataset test, sub-sampling test, and jackknife test [186] . However, as elucidated by [8] and demonstrated in [188] , among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used by investigators to examine the accuracy of various predictors [189] [190] [191] [192] [193] [194] [195] [196] [197] . In the current study, for simplifying the demonstration, we just used the independent dada for validation. In external validation series the model correctly classifies 25 out of 26 (specificity ¼ 96.15%) optimal and 65 out of 80 (sensitivity ¼ 81.25%) no-optimal MDTs. This results represent total accuracy ¼ 85.44% and 84.91% in training and validation respectively. This result indicates significant goodness of fit for this linear classifier based on the results reported before for LDA-QSAR classifiers. See for instance the LDA models used to predict anti-Leishmania, and in general other antiparasitic or anti-microbial drugs or other classes of activities by Galvez, García-Domenech, Marrero-Ponce, Castillo-Garit, Casañ ola-Martin, and other authors [91, [198] [199] [200] [201] [202] [203] [204] [205] [206] [207] [208] . MDTs of biopolymers can be numerically described with a new class of invariants x k (L) representing spatial distribution of Mean-Energy values on a 2D Euclidean space. The procedure forces one MD trajectory to fold into a 2D Cartesian coordinates system using a step-by-step procedure driven by simple rules. The graphical representation of this space has the form of a lattice network symbolized as LN. We can use x k (L) values of LN to develop new algorithms to perform molecular Phylogenetic analysis of peptides based on MDTs instead of the sequence of the polypeptide. The new procedure combined with 2D Electrophoresis and MALDI-TOF MS can be applied to analyze Peptide Mass Fingerprints of new proteins. We can use the same idea to seek scoring functions for DNA-Drug Docking simulations. The work opens new perspective on the analysis and applications MD on biopolymers sciences. H. acknowledges financial support of Program Isidro Parga Pondal funded by Xunta de Galicia and European Social Fund (E.S.F.). Supplementary data associated with this article can be found in the online version, at doi:10.1016/j.polymer.2009.05.055. A new kind of science Bolas-Ferná ndez F Handbook of graphs and complex networks: from the genome to the internet Brooks 3rd CL The science of photomedicine Bolá s-Ferná ndez F Hypercube Inc. Hyperchem software. Release 7.5 for windows, Molecular Modeling System A handbook of computational chemistry Discriminant analysis for activity prediction Pattern recognition in chemistry Chemometric methods in molecular design Standardized multiple regression model. Applied linear statistical models The authors sincerely thank the kind attention and comments of two unknown referees as well as Prof. J.E. Mark; editor of Polymer for Computational & Theoretical Polymer Sciences. Gonzá lez-Díaz