Submitted 25 March 2016
Accepted 9 May 2017
Published 5 June 2017

Corresponding author
Mary J. O’Connell,
m.oconnell@leeds.ac.uk

Academic editor
James Procter

Additional Information and
Declarations can be found on
page 13

DOI 10.7717/peerj-cs.118

Copyright
2017 Webb et al.

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

VESPA: Very large-scale Evolutionary and
Selective Pressure Analyses
Andrew E. Webb1, Thomas A. Walsh1 and Mary J. O’Connell1,2

1 Bioinformatics and Molecular Evolution Group, School of Biotechnology, Faculty of Science and Health,
Dublin City University, Dublin, Ireland

2 Computational and Molecular Evolutionary Biology Group, School of Biology, Faculty of Biological Sciences,
The University of Leeds, Leeds, United Kingdom

ABSTRACT
Background. Large-scale molecular evolutionary analyses of protein coding sequences
requires a number of preparatory inter-related steps from finding gene families, to
generating alignments and phylogenetic trees and assessing selective pressure variation.
Each phase of these analyses can represent significant challenges, particularly when
working with entire proteomes (all protein coding sequences in a genome) from a large
number of species.
Methods. We present VESPA, software capable of automating a selective pressure
analysis using codeML in addition to the preparatory analyses and summary statistics.
VESPA is written in python and Perl and is designed to run within a UNIX environment.
Results. We have benchmarked VESPA and our results show that the method is
consistent, performs well on both large scale and smaller scale datasets, and produces
results in line with previously published datasets.
Discussion. Large-scale gene family identification, sequence alignment, and phylogeny
reconstruction are all important aspects of large-scale molecular evolutionary analyses.
VESPA provides flexible software for simplifying these processes along with down-
stream selective pressure variation analyses. The software automatically interprets
results from codeML and produces simplified summary files to assist the user in
better understanding the results. VESPA may be found at the following website:
http://www.mol-evol.org/VESPA.

Subjects Bioinformatics, Computational Biology
Keywords Selective pressure analysis, Protein molecular evolution, Large-scale comparative
genomics, Gene family evolution, Positive selection

INTRODUCTION
Estimating selective pressure variation across homologous protein-coding genes from
different species is typically done by assessing the ratio of dN/dS, i.e., the number of
non-synonymous substitutions per non-synonymous site (dN) as a function of the
number of synonymous substitutions per synonymous site (dS). The ratio of dN/dS is
commonly referred to as omega (ω), and is routinely used to assess selective pressure
variation or constraints across protein families or protein-interaction networks (Kim,
Korbel & Gerstein, 2007; Kosiol et al., 2008; Alvarez-Ponce, Aguade & Rozas, 2009). These
calculations of selective pressure variation are performed on alignments of protein coding
sequences (and not on other data types such as raw reads from NGS experiments). Codeml

How to cite this article Webb et al. (2017), VESPA: Very large-scale Evolutionary and Selective Pressure Analyses. PeerJ Comput. Sci.
3:e118; DOI 10.7717/peerj-cs.118

https://peerj.com
mailto:m.oconnell@leeds.ac.uk
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.118
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://www.mol-evol.org/VESPA
http://dx.doi.org/10.7717/peerj-cs.118


is part of the PAML package for the analyses of selective pressure variation in nucleotide
sequence data in a maximum likelihood framework (Yang, 2007). The models available
in PAML for assessing selective pressure variation can simultaneously compare variation
across sites and across lineages in the homologous protein coding gene dataset. In this way
the ‘‘foreground lineage’’ is compared to all other lineages in the dataset in an attempt
to determine lineage specific selective pressure variation. Some well-known examples of
selective pressure variation on foreground lineages include the identification of positive
selection in reproductive proteins that contribute to species divergence in mammals
(Swanson et al., 2001), and the identification of molecular signatures of positive selection
that govern protein functional divergence in a group of mammal enzymes (Loughran et
al., 2012). A number of software packages estimate selective pressure variation (Pond,
Frost & Muse, 2005; Yang, 2007; Delport et al., 2010). One of the most popular methods is
codeML from the PAML software package (Yang, 2007). The strength of this approach is
the application of flexible codon-based models capable of assessing variation in selective
pressures at two levels: (i) across sites in an alignment and (ii) across sites in a predefined,
or ‘‘foreground’’ lineage on a phylogenetic tree (Yang & Dos Reis, 2011).

Operating codeML requires a complex file structure to compute the parameters under
multiple nested models. Associated likelihood ratio tests (LRTs) must also be performed in
the identification of the model of best fit. These complexities are often compounded by the
size of study, which increasingly are genomic in scale (Liu et al., 2014; Keane et al., 2015;
Webb et al., 2015). Other approaches to streamline the process of applying codon-based
models of evolution to homologous sequences sets focus on site-specific models such as
POTION (Hongo et al., 2015).

To address these issues we have designed VESPA (Very large-scale Evolutionary and
Selective Pressure Analyses). VESPA automates selective pressure analyses and associated
prerequisite analyses and post-analysis summary statistics. VESPA can perform both
lineage-site specific and site-specific analyses whereas POTION presently performs
the site-specific analyses. Therefore, VESPA is unique in its capacity to perform the
complex set of tasks involved in assessing lineage specific selective pressure variation
across homologous gene families and across lineages. VESPA minimizes the majority of
data manipulation requirements for standard molecular evolutionary analyses and also
automatically implements and analyzes selective pressure variation analyses using codeML
(Yang, 2007). In addition, VESPA supplies an assessment of potential false positives and
produces summary files of the results that are easy to interpret. VESPA allows the user
to take advantage of the wealth of publically available genomic data from model and
non-model organisms to perform large-scale analyses of homology searching, alignment,
phylogeny reconstruction and selective pressure variation. All that VESPA requires is the
protein coding DNA sequences, which it will translate with the standard genetic code
and use to search and construct gene family alignments. This flexible toolkit can permit
large-scale analyses to be performed in an efficient manner and with fewer errors.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 2/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Table 1 Overview of the 5 Phases in the VESPA software package.

Phase Purpose Supported input type Supported file formats

1 Data preparation Protein coding DNA
sequencesa

FASTA

2 Homology searching BLAST/HMMER output
files

BLAST tabular, HMMER
standard

3 Alignment assessment and
phylogeny reconstruction

Multiple sequence
alignmentsa

FASTA, NEXUS, PHYLIP

4 Selective pressure analysis
preparation

Gene phylogenies (or
species phylogenies) with
corresponding Multiple
sequence alignmentsa

Trees: Newick, NEXUS;
Alignments: See above

5 Selective pressure analysis
assessment

Standard codeML output
files generated directly by
the software

VESPA formatted codeML
output

Notes.
aIndicates phases of VESPA that incorporate third-party programs.
The file formats required as input for each phase of VESPA are detailed. The numbering scheme is consistent with the num-
bering scheme for the phases as displayed in Figs. 1 and 2.

METHODS
VESPA helps automation by preparing input data files and processing results but program
executions are initiated by the user (e.g., via submission to an HPC queuing system.
VESPA has five major phases (Table 1 and Fig. 1), each of which is composed of a number
of functions. VESPA was developed as a toolkit of various independent functions within
each phase and the primary goal is to simplify the procedures involved in large-scale
selective pressure variation analyses. Each function either completes a specific phase of
the analysis (e.g., homologous gene identification) or facilitates/automates the use of
third-party software packages to complete more specialized procedures. The majority of
functions are written in Python 2.7 and are designed to operate on a UNIX command-line.
Functions are categorized as either basic or advanced, e.g., the function to identify single
gene orthologs is a basic function whereas confirming both SGOs and multi-gene families
(MGFs) is an advanced function (Fig. 2). This structure also provides users with a flexible
and adaptable framework for more specialist tasks. For in-depth description and format
requirements, please see the program manual, tutorials and documentation published on
the program website (http://www.mol-evol.org/VESPA).

Depending on the phase of VESPA, input is accepted from any program capable of
producing the supported file format(s) or a selected collection of third-party programs
(Table 1). For example, Phase 2 (the homology searching phase) currently parses the
output of BLAST (Altschul et al., 1997) or HMMER (Eddy, 1998), whereas the alignment
assessment and phylogeny reconstruction phase is limited only by file format requirements
(e.g., FASTA, NEXUS, PHYLIP). Functions in VESPA are invoked following the program
call (i.e., vespa.py) along with arguments to indicate the phase-relevant input data and
function-specific optional arguments. Depending on the function, optional arguments
enable users to modify parameter values (e.g., BLAST search thresholds, phylogenetic
reconstruction settings) or alter command-specific settings.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 3/16

https://peerj.com
http://www.mol-evol.org/VESPA
http://dx.doi.org/10.7717/peerj-cs.118


Figure 1 Overview of the 5 Phases implemented in VESPA. The five phases of VESPA are listed from
‘‘Data Preparation’’ to ‘‘Selective Pressure Analysis Assessment’’. Underneath each is a grey box enclosing
some representative commands from that phase. Each phase concludes with a black box indicating the use
of a third-party program to perform the necessary task (e.g., sequence alignment or phylogenetic recon-
struction). The output of the first four phases is then used as the input of the next phase. The final phase is
written in Perl and concludes with the creation of summary files that contain all the relevant information
from the selective pressure analyses.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 4/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Figure 2 Overview of the options available in the VESPA package. An overview of the basic and ad-
vanced analysis options at each of the five phases of VESPA. The functions of each phase are shown as
white boxes, and are invoked in the order shown (Note: that not all functions are shown here; a complete
set of VESPA functions are available in the manual). Within each phase the available alternatives for pro-
cessing the data are given on the left and right hand columns. These only vary for Phase 3 and 4. The left
most column may represent the processing of single gene orthologs that you wish to impose the species
tree onto. If this is the case then VESPA will allow you to generate the phylogenies from the species tree (as
shown on the left most side of (C)). (continued on next page...)

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 5/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Figure 2 (...continued)
However, you may wish to generate gene trees directly from the data for either multigene families or for
single gene families of uncertain orthology (this option is shown on the right most column in (C) and in-
volves selecting the substitution model of best fit and reconstructing the phylogeny). In addition to the
functions, the input and output of each phase are shown in dark grey boxes and if a third-party program
is required to analyze the output of the phase, the program is specified below the phase in a light grey box.
For three of the five phases (data preparation, homology searching, and selective pressure analysis assess-
ment) the functions invoked in both the basic and advanced options are identical. The primary difference
between the analyses (basic/advanced) is found in the alignment assessment and phylogeny reconstruction
phase. The advanced option uses ProtTest (Darriba et al., 2011) for substitution model selection and Mr-
Bayes (Ronquist & Huelsenbeck, 2003) for phylogenetic reconstruction. Beyond this major difference, the
selective pressure analysis preparation simply requires a function to import the output of MrBayes.

Functions in VESPA complete by producing the relevant output files without modifying
the original input files. While this design results in the generation of a number of
intermediate files (especially in the later stages of selection analysis), it enables users
to easily keep track of all data modifications. Each phase of VESPA’s analysis produces
the necessary data files for conducting a specialized analysis using third-party software
(Fig. 1). Homology searches in particular are not fully automated by VESPA for two reasons:
(i) they are best run as individual serial tasks on large high-end computing clusters, or (ii)
the submission processes differ across compute clusters. However, VESPA can generate the
BLAST formatted database for subsequent homology searches. All VESPA commands are
issued on the command line and are readily executable via a cluster scheduling system.

RESULTS
As detailed above, VESPA incorporates two analyses, a basic analysis for analyzing SGOs
and an advanced analysis for analyzing both SGOs and MGFs (Fig. 2). Here we provide
an example of an application of the basic analysis using ten genes from eleven species as a
small test dataset.

As seen in Fig. 2, the process begins with the user supplying transcript data for the
data preparation phase. The first phase begins with the clean function, a basic quality
control (QC) filtering step, followed by translate, to translate the filtered transcripts.
VESPA then proceeds to the make_database function to create a sequence database for
homology searching with either BLAST (Altschul et al., 1997) or HMMER3 (Eddy, 1998).
Upon completion of homology searching, the function reciprocal_groups is used to identify
proteins that share reciprocal similarity. Then, files containing these families of sequences
are produced. This function is highly configurable by optional arguments so that users can
evaluate various different similarity scenarios (i.e., different e-value cutoffs) with only a
single output file. The produced sequences files are then aligned using any multiple sequence
alignment (MSA) method that can produce a supported file format (e.g., programs such
as MUSCLE (Edgar, 2004) and PRANK (Loytynoja & Goldman, 2005) are supported). It
is advisable to explore a variety of MSA methods for every gene family (Muller et al.,
2010), and VESPA facilitates this the user to compare these different approaches. The
metal_compare function (within the Alignment Assessment and Phylogeny Reconstruction
phase) in VESPA allows alignment approaches to be compared and a single MSA of

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 6/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


best fit is retained for each gene family (i.e., Gene Family A may have an MSA from
MUSCLE while Gene Family B may have an MSA from PRANK). MSAs for each SGO
family can then used in combination with the user-defined species phylogeny to create
gene phylogenies using the function infer_genetrees or gene trees can be generated directly
from the MSAs The MSAs and gene phylogenies are then used for the selective pressure
analysis preparation phase. The function create_branch can be used to specifically label
internal nodes as ancestral lineages that the user may wish to explore. The MSAs and
gene phylogenies are then used by the function setup_codeml to automatically create the
complex codeML file structure and a task file for automating codeML (Yang, 2007). Upon
completion of codeML the codeml_reader function is used to automate the interpretation
of the results and producing summary files of the results. VESPA creates a number of
output files for each homologous gene family detailing the results of the codeML analysis.
There are two primary types of output files for each gene family tested: (1) a single csv
formatted summary text file (shown in Table 2) for one gene family, and (2), a multiple
sequence alignment for each model tested detailing the sites (protein/codon) proposed to
be under positive selection (this is also provided in html format so it can be viewed with
colour coding for ease of interpretation).

To compare the results of VESPA to other pipelines and predictions of positive selection,
we used a dataset of 18 gene families from the Selectome database (Moretti et al., 2014).
A tarball of the data and results of the VESPA analysis of 18 gene families (i.e., input
sequences, alignments, trees, codeml output, VESPA summary at each phase) have been
provided in File S1. Selectome is a publicly available database of genes under positive
selection. It was chosen to provide this benchmark dataset as it permits direct access to all
relevant files used in its calculations, and it also uses the codon based models of evolution
implemented in codeML and integrated in VESPA facilitating a direct comparison of
results.

We carried out two tests with the dataset from Selectome (Release 6) (Moretti et al.,
2014). Firstly we wished to assess if the way in which VESPA automatically sets up all
codon models, labeling of foreground lineages and LRTs produced comparable results with
those from the Selectome pipeline. To this end, we ran VESPA using the masked DNA
alignments and gene trees for each of the 18 homologous gene families from the Selectome
database. VESPA produced identical results for these input alignments. This demonstrates
that VESPA’s automation of the process is reliable and robust.

One challenge in performing these analyses is that different alignment methods applied to
the same gene families can produce different results. VESPA provides the ‘‘metal_compare’’
function, which allows alignments for a gene family to be compared. To highlight its utility,
we performed an additional test on this dataset of 18 homologous gene families from
Selectome. We used the unmasked sequences and generated a set of alternative alignments
through MUSCLE (Edgar, 2004) and PRANK (Loytynoja & Goldman, 2005) and using the
metal_compare function of VESPA we identified the best (i.e., most statistically significant)
alignment for each gene family. The VESPA functions ‘setup_codeml’ and ‘codeml_reader’
were then used to automate the codeML set up and analysis. The VESPA pipeline was able
to replicate the lineages identified as under positive selection in the Selectome database.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 7/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118#supp-2
http://dx.doi.org/10.7717/peerj-cs.118


Table 2 Sample of a summary output file created by ‘codeml_reader’ in Phase 5 of the VESPA package.

Model Tree Model
type

p W
(t =0)

lnL LRT
result

Parameter
estimates

Positive
selection

Positively
selected
sites in 8×45
alignment
(P(w > 1) > 0.5)

m0 Sample_MSA Homogeneous 1 2 −572.969394 N/A w =0.61924 No

m1Neutral Sample_MSA Site-specific 1 2 −568.572319 N/A p0=0.35065
p1=0.64935
w0=0.08403
w1=1.00000

Not
allowed

m2Selection Sample_MSA Site-specific 2 2 −568.521978 m1Neutral p0=0.37699
p1=0.00000
p2=0.62301
w0=0.10174
w1=1.00000
w2=1.11245

No

m3Discrtk2 Sample_MSA Site-specific 3 2 −568.521978 m3Discrtk2 p0=0.37699
p1=0.62301
w0=0.10174
w1=1.11245

Yes Alignment (28
NEB sites): 2 3 4 5
6 8 12 13 15 17 19
20 22 23 24 25 26
27 30 31 32 34 35
39 40 41 43 44

m3Discrtk3 Sample_MSA Site-specific 5 2 −568.521978 m3Discrtk2 p0=0.37699
p1=0.53897
p2=0.08404
w0=0.10174
w1=1.11245
w2=1.11246

No

m7 Sample_MSA Site-specific 2 2 −568.764172 N/A p=0.22135
q=0.11441

Not
allowed

m8 Sample_MSA Site-specific 4 2 −568.526486 m7, m8 p=11.60971
p0=0.37916
p1=0.62084
q=99.00000
w =1.11431

No

m8a Sample_MSA Site-specific 4 1 −568.578069 N/A p=9.38528
p0=0.35203
p1=0.64797
q=99.00000
w =1.00000

Not
allowed

(continued on next page)

W
ebb

etal.
(2017),P

eerJ
C

om
put.S

ci.,D
O

I10.7717/peerj-cs.118
8/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Table 2 (continued)

Model Tree Model
type

p W
(t =0)

lnL LRT
result

Parameter
estimates

Positive
selection

Positively
selected
sites in 8×45
alignment
(P(w > 1) > 0.5)

modelA Sample_MSA_Primates Lineage-site 3 2 −568.572319 m1Neutral,
modelAnll

p0=0.35065
p1=0.64935
p2=0.00000
p3=0.00000
w0=0.08403
w1=1.00000
w2=1.00000

No

modelAnull Sample_MSA_Primates Lineage-site 3 1 −568.572319 N/A p0=0.35065
p1=0.64935
p2=0.00000
p3=0.00000
w0=0.08403
w1=1.00000
w2=1.00000

modelA Sample_MSA_Chimp Lineage-site 3 2 −557.052657 modelA p0=0.33452
p1=0.59186
p2=0.02659
p3=0.04704
w0=0.1099
w1=1.00000
w2=999.000

Yes Alignment
(3 BEB sites): 24 25
32

modelAnull Sample_MSA_Chimp Lineage-site 3 1 −568.572319 N/A p0=0.35065
p1=0.64935
p2=0.00000
p3=0.00000
w0=0.08403
w1=1.0000
w2=1.00000

Not
allowed

modelA Sample_MSA_Human Lineage-site 3 2 −568.572319 m1Neutral,
modelAnull

p0=0.35065
p1=0.64935
p2=0.00000
p3=0.00000
w0=0.08403
w1=1.0000
w2=1.00000

No

(continued on next page)

W
ebb

etal.
(2017),P

eerJ
C

om
put.S

ci.,D
O

I10.7717/peerj-cs.118
9/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Table 2 (continued)

Model Tree Model
type

p W
(t =0)

lnL LRT
result

Parameter
estimates

Positive
selection

Positively
selected
sites in 8×45
alignment
(P(w > 1) > 0.5)

modelAnull Sample_MSA_Human Lineage-site 3 1 −568.572319 N/A p0=0.35065
p1=0.64935
p2=0.00000
p3=0.00000
w0=0.08403
w1=1.0000
w2=1.00000

No

Notes.
The following information is provided for each model tested: the lineage (internal or terminal branches) tested as foreground; the type of model (i.e. site-specific or branch-specific) being tested; number
of free parameters in the ÏĽ distribution that are estimate by codeML, the initial ÏĽ value used by codeML (each run within VESPA has multiple starting values to minimise the risk of reporting from a lo-
cal minimum on the likelihood plane); the resulting log likelihood (lnL) of the analysis; the resulting model of the likelihood ratio test (LRT); the parameter estimates of codeML; if positive selection was
detected (yes/no); and the alignment coordinates for any positively selected sites. NEB, NaÃŕve Empirical Bayes estimate and BEB, Bayes Empirical Bayes estimate.

W
ebb

etal.
(2017),P

eerJ
C

om
put.S

ci.,D
O

I10.7717/peerj-cs.118
10/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Table 3 Comparison of the results from 18 gene families from the Selectome database analysed in VESPA.

Original masked alignment Alternative alignment

Positive selection (ModelA) Positively selected sites Postive selection (ModelA) Positively selected sites

ATG16L1 Yes (HPGPoNM) Matched Yes (HPGPoNM , HP*, and HPG*) Matched
ATP5SL Yes (HP) Matched Yes (HP) Matched
AZGP1 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched 9/10 (107)
C12orf43 Yes (HPG) Matched Yes (HPG) Matched
CACNA1I Yes (HP) Matched Yes (HP) Matched 2/3 (1,095)
CASP1 Yes (HPG) Matched Yes (HPG) Matched
CD151 Yes (HPG) Matched Yes (HPG) Matched
COBLL1 Yes (HPGPoN) Matched Yes (HPGPoN) Matched
HUS1 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched
IDH3B Yes (HP) Matched Yes (HP) Matched
IFIT2 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched
INTS7 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched
ODC1 Yes (HPGPoN) Matched Yes (HPGPoN) Matched
PASK Yes (HP) Matched Yes (HP) Matched 2/3 (434)
RRP8 Yes (HPGPoN) Matched Yes (HPGPoN) Matched
RTP2 Yes (HP) Matched Yes (HP) Matched
SLC8A1 Yes (PG) Matched Yes (PG) Matched
TRIM40 Yes (HPGPo) Matched Yes (HPGPo) Matched 2/4 (140, 256)

Notes.
*False positives.
The 18 homologous families chosen from the Selectome database are given their HUGO identifier on the left. The results of analysis in VESPA as compared to Selectome re-
sults using the same masked alignments are shown in columns 2 and 3. The results from VESPA using the alternative alignment method as compared to the original alignments
is shown in columns 4 and 5. For both alignments (original and alternative) it is indicated if ModelA positive selection is identified in the same lineages, and if the sites identi-
fied as positively selected match. Using the alternative alignments, four cases where the sites identified as positively selected did not completely match the position in the original
alignment are indicated in parenthesis. The extant lineages with evidence of positive selection throughout are shown in parentheses and are abbreviated as follows: Human (H),
Chimp (P), Gorilla (G), Orangutan (Po), Gibbon (N) and Macaque (M). Ancestral nodes are denoted as a combination of the abbreviations for the extant lineages that the node
includes, e.g. the ancestral node joining Human (H) and Chimp (P) is denoted as HP.

However, the results presented in Table 3 include small differences in the number and
position of sites identified as positively selected. The ATP5SL alternative alignment was
found to have evidence of positive selection on two additional branches as compared
to the original result, however, closer inspection of the alternative alignments found the
additional branches to be false positives due to a poorly aligned segment of the MSA. Other
slight variations in results between the original and alternative alignments included three
instances where gene families (C12orf43, CACNA1I, and PASK) had a difference of one
positively selected site and one instance (TRIM40) of differences in two positively selected
sites reported in the original alignment that was not reported in the alternative alignment
for the same family. Finally, a single gene (SLC8A1) contained additional sites under
positive selection following VESPA analysis using the alternative alignments, it should be
noted that these sites are within a poorly conserved span of the protein. The remaining 13
genes were found to replicate the positively selected sites reported in Selectome.

To demonstrate the application of VESPA to very large datasets we have assembled a
novel dataset of 7,918 homologous gene families (each containing ≥ 8 sequences) from
the ENSEMBL 2016 genome database (release 82) (Yates et al., 2016). Using the VESPA

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 11/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


‘‘clean_ensembl’’ command we: (1) restricted the protein coding sequences per genome
to the longest transcript for each gene, and (2) logged and discarded sequences containing
internal stop codons or incomplete codons. Multiple sequence alignments were made
for all gene families identified and phylogenies for each homologous gene families were
inferred by VESPA from the topology of the Ensembl Compara species tree demonstrating
the flexibility of the VESPA package (Phase 3). Finally, the selective pressure analyses were
carried out in VESPA Phase 5 with human and then mouse labeled as the foreground
lineages of interest. Of the 7,918 genes analyzed by VESPA 1998 showed evidence of
site-specific positive selection (model ‘‘m8’’), 223 genes showed evidence of mouse-specific
positive selection (‘‘model A’’), and finally 80 showed evidence of human-specific positive
selection (‘‘model A’’); File S1 contains the full set of results.

For the dataset of 7,918 homologous gene families the majority of VESPA functions
completed in under four minutes (on the following system: Intel Xeon CPU E5420
(2.50 GHz), 16 GB RAM, using Ubuntu OS 16.04). The exceptions were the ’codeml_setup’
function which took approximately 45 min for this dataset, this function is slightly slower
as it is creating the complex directory/file structure for codeML. The second exception was
the ’codeml reader’ function which took approximately 6 h (this function is to analyze the
large number of files created by codeML and produce the output and summary files). The
codeML component of the analyses took 36,180 CPU hours in total to complete. (Note:
these time estimates are not inclusive of Phase 2—homology search).

DISCUSSION
One of the primary goals of VESPA is to simplify and streamline basic comparative
genomic procedures such as filtering poor quality protein coding sequences and generating
the most appropriate alignment for each gene family. VESPA also simplifies and streamlines
codeML-based selective pressure analyses. From our analysis of 7,918 homologous gene
families we found that the majority of tasks could be completed within minutes using
VESPA. However, the codeML-related functions for creating the input file structure and
examining the output files takes considerably longer to complete. As these functions are
an essential aspect of the pipeline, decreasing their execution time will be a primary goal in
future updates to VESPA. Two possible approaches that will be explored are: (i) increasing
the overall efficiency of the functions and (ii) developing a version of VESPA capable
running these scripts (and possibly others) in parallel on multi-core processors.

The modular nature of VESPA enables the pipeline (or specific functions) to be used in
conjunction with existing workflow systems. The only requirements would be having the
necessary data (MSAs, protein-coding transcripts, etc.) in a supported format. For example,
VESPA could be used to perform a selective pressure analysis on protein-coding transcripts
obtained from RNA-Seq. Also, it is possible to skip specific stages of the VESPA pipeline
for a preferred approach/software package, e.g., it is possible to use an alternative approach
for phylogenetic reconstruction and employ the resulting tree in a VESPA analysis. It
should also be possible to integrate the majority of VESPA’s functions (with the exception
of the built-in tree pruning function) into workflow systems such as GALAXY (Afgan et

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 12/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118#supp-2
http://dx.doi.org/10.7717/peerj-cs.118


al., 2016). This would allow VESPA to operate in conjunction with the scripts and tools
already available on GALAXY, enabling greater freedom for the user. This integration will
be implemented in future releases of VESPA.

We also employed VESPA in the analysis of 18 gene families from the Selectome
database (Moretti et al., 2014). Our initial comparison used both the alignment (masked)
and tree provided by Selectome. VESPA was able to confirm the findings reported within
the database. Secondly, we re-aligned the 18 gene families using two methods (MUSCLE
and PRANK) and then again performed the selective pressure analysis using VESPA.
The analysis of these alternative alignments revealed minor differences in the reported
positively selected sites of five genes (C12orf43, CACNA1I, PASK, SLC8A1, and TRIM40).
These differences illustrate that the input alignment may have an impact on the genes
and sites identified as positively selected as in Blackburne & Whelan, (2013). To avoid
biased results due to choice of alignment method and we highly recommend the use of the
‘‘metal_compare’’ function or programs such as AQUA (Muller et al., 2010) to select the
best alignment for each gene family.

CONCLUSION
VESPA provides a flexible software package designed to simplify large-scale comparative
genomic analyses and specifically selective pressure variation analysis implemented in
codeML (Yang, 2007). VESPA automates the entire comparative genomic process from
data quality checks and homology searching to phylogeny reconstruction and selective
pressure analyses, and it produces simple summary files for the user. VESPA offers users
various functions that automate many of the required prerequisite analyses and removes
error-prone data manipulation steps. Lastly, it is important to note that the processes
implemented in the 5 Phases of VESPA facilitates those working on de novo sequence data
or non-model organisms to perform large-scale comparative genomic analyses without
having pre-processed gene families. All that is required by VESPA is that the protein coding
DNA sequences are available.

ACKNOWLEDGEMENTS
We would also like to thank Louisse Mirabueno (funded by the Wellcome Trust Vacation
Scholarship programme) and other members of the community for their help in trouble-
shooting, testing and providing feedback on the VESPA software package and associated
manual and tutorials.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This work was undertaken on MARC1, part of the High Performance Computing facilities
at the University of Leeds. The research was supported by Science Foundation Ireland
Research Frontiers Programme grant (SFI RFP EOB2673) to MJO’C, and DCU Pierse
Trust Award and SoBT awards (to TAW and MJO’C) and the University of Leeds 250

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 13/16

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.118


Great Minds Academic Fellowship to MJO’C. The funders had no role in study design,
data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
Science Foundation Ireland Research Frontiers Programme: SFI RFP EOB2673.

Competing Interests
The authors have no competing interests.

Author Contributions
• Andrew E. Webb performed the experiments, analyzed the data, contributed
reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables,
performed the computation work, reviewed drafts of the paper.
• Thomas A. Walsh performed the experiments, contributed reagents/materials/analysis
tools, wrote the paper, performed the computation work, reviewed drafts of the paper.
• Mary J. O’Connell conceived and designed the experiments, analyzed the data,
contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or
tables, performed the computation work, reviewed drafts of the paper.

Data Availability
The following information was supplied regarding data availability:

GitHub: https://github.com/aewebb80/VESPA.
Supplementary File 1 (GitHub):

https://github.com/aewebb80/VESPA/blob/master/SupplementaryFile1.tar.bz2.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.118#supplemental-information.

REFERENCES
Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M, Chilton J,

Clements D, Coraor N, Eberhard C, Grüning B, Guerler A, Hillman-Jackson J, Von
Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Nekrutenko A, Goecks J. 2016.
The Galaxy platform for accessible, reproducible and collaborative biomedical analy-
ses: 2016 update. Nucleic Acids Research 44(W1):W3–W10 DOI 10.1093/nar/gkw343.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997.
Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Research 25(17):3389–3402 DOI 10.1093/nar/25.17.3389.

Alvarez-Ponce D, Aguade M, Rozas J. 2009. Network-level molecular evolutionary
analysis of the insulin/TOR signal transduction pathway across 12 Drosophila
genomes. Genome Research 19(2):234–242 DOI 10.1101/gr.084038.108.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 14/16

https://peerj.com
https://github.com/aewebb80/VESPA
https://github.com/aewebb80/VESPA/blob/master/SupplementaryFile1.tar.bz2
http://dx.doi.org/10.7717/peerj-cs.118#supplemental-information
http://dx.doi.org/10.7717/peerj-cs.118#supplemental-information
http://dx.doi.org/10.1093/nar/gkw343
http://dx.doi.org/10.1093/nar/25.17.3389
http://dx.doi.org/10.1101/gr.084038.108
http://dx.doi.org/10.7717/peerj-cs.118


Blackburne BP, Whelan S. 2013. Class of multiple sequence alignment algorithm
affects genomic analysis. Molecular Biology and Evolution 30(3):642–653
DOI 10.1093/molbev/mss256.

Darriba D, Taboada GL, Doallo R, Posada D. 2011. ProtTest 3: fast selection
of best-fit models of protein evolution. Bioinformatics 27(8):1164–1165
DOI 10.1093/bioinformatics/btr088.

Delport W, Poon AF, Frost SD, Kosakovsky Pond SL. 2010. Datamonkey 2010:
a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics
26(19):2455–2457 DOI 10.1093/bioinformatics/btq429.

Eddy SR. 1998. Profile hidden Markov models. Bioinformatics 14(9):755–763
DOI 10.1093/bioinformatics/14.9.755.

Edgar RC. 2004. MUSCLE: a multiple sequence alignment method with reduced time
and space complexity. BMC Bioinformatics 5:113 DOI 10.1186/1471-2105-5-113.

Hongo JA, De Castro GM, Cintra LC, Zerlotini A, Lobo FP. 2015. POTION: an end-
to-end pipeline for positive Darwinian selection detection in genome-scale data
through phylogenetic comparison of protein-coding genes. BMC Genomics 16:567
DOI 10.1186/s12864-015-1765-0.

Keane M, Semeiks J, Webb AE, Li YI, Quesada V, Craig T, Madsen LB, Van Dam
S, Brawand D, Marques PI, Michalak P, Kang L, Bhak J, Yim HS, Grishin NV,
Nielsen NH, Heide-Jorgensen MP, Oziolor EM, Matson CW, Church GM, Stuart
GW, Patton JC, George JC, Suydam R, Larsen K, Lopez-Otin C, O’Connell
MJ, Bickham JW, Thomsen B, De Magalhaes JP. 2015. Insights into the evolu-
tion of longevity from the bowhead whale genome. Cell Reports 10(1):112–122
DOI 10.1016/j.celrep.2014.12.008.

Kim PM, Korbel JO, Gerstein MB. 2007. Positive selection at the protein network
periphery: evaluation in terms of structural constraints and cellular context.
Proceedings of the National Academy of Sciences of the United States of America
104(51):20274–20279 DOI 10.1073/pnas.0710183104.

Kosiol C, Vinar T, Da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A.
2008. Patterns of positive selection in six Mammalian genomes. PLOS Genetics
4(8):e1000144 DOI 10.1371/journal.pgen.1000144.

Liu S, Lorenzen ED, Fumagalli M, Li B, Harris K, Xiong Z, Zhou L, Korneliussen TS,
Somel M, Babbitt C, Wray G, Li J, He W, Wang Z, Fu W, Xiang X, Morgan CC,
Doherty A, O’Connell MJ, McInerney JO, Born EW, Dalen L, Dietz R, Orlando
L, Sonne C, Zhang G, Nielsen R, Willerslev E, Wang J. 2014. Population genomics
reveal recent speciation and rapid evolutionary adaptation in polar bears. Cell
157(4):785–794 DOI 10.1016/j.cell.2014.03.054.

Loughran NB, Hinde S, McCormick-Hill S, Leidal KG, Bloomberg S, Loughran
ST, O’Connor B, O’Fagain C, Nauseef WA, O’Connell MJ. 2012. Functional
consequence of positive selection revealed through rational mutagenesis of
human myeloperoxidase. Molecular Biology and Evolution 29(8):2039–2046
DOI 10.1093/molbev/mss073.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 15/16

https://peerj.com
http://dx.doi.org/10.1093/molbev/mss256
http://dx.doi.org/10.1093/bioinformatics/btr088
http://dx.doi.org/10.1093/bioinformatics/btq429
http://dx.doi.org/10.1093/bioinformatics/14.9.755
http://dx.doi.org/10.1186/1471-2105-5-113
http://dx.doi.org/10.1186/s12864-015-1765-0
http://dx.doi.org/10.1016/j.celrep.2014.12.008
http://dx.doi.org/10.1073/pnas.0710183104
http://dx.doi.org/10.1371/journal.pgen.1000144
http://dx.doi.org/10.1016/j.cell.2014.03.054
http://dx.doi.org/10.1093/molbev/mss073
http://dx.doi.org/10.7717/peerj-cs.118


Loytynoja A, Goldman N. 2005. An algorithm for progressive multiple alignment of
sequences with insertions. Proceedings of the National Academy of Sciences of the
United States of America 102(30):10557–10562 DOI 10.1073/pnas.0409137102.

Moretti S, Laurenczy B, Gharib WH, Castella B, Kuzniar A, Schabauer H, Studer RA,
Valle M, Salamin N, Stockinger H, Robinson-Rechavi M. 2014. Selectome update:
quality control and computational improvements to a database of positive selection.
Nucleic Acids Research 42(Database issue):D917–D921 DOI 10.1093/nar/gkt1065.

Muller J, Creevey CJ, Thompson JD, Arendt D, Bork P. 2010. AQUA: automated quality
improvement for multiple sequence alignments. Bioinformatics 26(2):263–265
DOI 10.1093/bioinformatics/btp651.

Pond SL, Frost SD, Muse SV. 2005. HyPhy: hypothesis testing using phylogenies.
Bioinformatics 21(5):676–679 DOI 10.1093/bioinformatics/bti079.

Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: bayesian phylogenetic inference under
mixed models. Bioinformatics 19(12):1572–1574. DOI 10.1093/bioinformatics/btg180.

Swanson WJ, Yang Z, Wolfner MF, Aquadro CF. 2001. Positive Darwinian selection
drives the evolution of several female reproductive proteins in mammals. Proceedings
of the National Academy of Sciences of the United States of America 98(5):2509–2514
DOI 10.1073/pnas.051605998.

Webb AE, Gerek ZN, Morgan CC, Walsh TA, Loscher CE, Edwards SV, O’Connell MJ.
2015. Adaptive evolution as a predictor of species-specific innate immune response.
Molecular Biology and Evolution 32(7):1717–1729 DOI 10.1093/molbev/msv051.

Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology
and Evolution 24(8):1586–1591 DOI 10.1093/molbev/msm088.

Yang Z, Dos Reis M. 2011. Statistical properties of the branch-site test of positive selec-
tion. Molecular Biology and Evolution 28(3):1217–1228 DOI 10.1093/molbev/msq303.

Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C,
Clapham P, Fitzgerald S, Gil L, Giron CG, Gordon L, Hourlier T, Hunt SE, Janacek
SH, Johnson N, Juettemann T, Keenan S, Lavidas I, Martin FJ, Maurel T, McLaren
W, Murphy DN, Nag R, Nuhn M, Parker A, Patricio M, Pignatelli M, Rahtz M, Riat
HS, Sheppard D, Taylor K, Thormann A, Vullo A, Wilder SP, Zadissa A, Birney E,
Harrow J, Muffato M, Perry E, Ruffier M, Spudich G, Trevanion SJ, Cunningham
F, Aken BL, Zerbino DR, Flicek P. 2016. Ensembl 2016. Nucleic Acids Research
44(D1):D710–D716 DOI 10.1093/nar/gkv1157.

Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 16/16

https://peerj.com
http://dx.doi.org/10.1073/pnas.0409137102
http://dx.doi.org/10.1093/nar/gkt1065
http://dx.doi.org/10.1093/bioinformatics/btp651
http://dx.doi.org/10.1093/bioinformatics/bti079
http://dx.doi.org/10.1093/bioinformatics/btg180
http://dx.doi.org/10.1073/pnas.051605998
http://dx.doi.org/10.1093/molbev/msv051
http://dx.doi.org/10.1093/molbev/msm088
http://dx.doi.org/10.1093/molbev/msq303
http://dx.doi.org/10.1093/nar/gkv1157
http://dx.doi.org/10.7717/peerj-cs.118