Submitted 12 May 2019
Accepted 15 April 2020
Published 1 June 2020

Corresponding author
Binti Solihah,
binti.solihah@mail.ugm.ac.id

Academic editor
Sebastian Ventura

Additional Information and
Declarations can be found on
page 12

DOI 10.7717/peerj-cs.275

Copyright
2020 Solihah et al.

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

Enhancement of conformational B-cell
epitope prediction using CluSMOTE
Binti Solihah1,2, Azhari Azhari1 and Aina Musdholifah1

1 Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences,
Universitas Gadjah Mada, Yogyakarta, Indonesia

2 Department of Informatics Engineering, Universitas Trisakti, Grogol, Jakarta Barat, Indonesia

ABSTRACT
Background. A conformational B-cell epitope is one of the main components of
vaccine design. It contains separate segments in its sequence, which are spatially close
in the antigen chain. The availability of Ag-Ab complex data on the Protein Data Bank
allows for the development predictive methods. Several epitope prediction models also
have been developed, including learning-based methods. However, the performance of
the model is still not optimum. The main problem in learning-based prediction models
is class imbalance.
Methods. This study proposes CluSMOTE, which is a combination of a cluster-
based undersampling method and Synthetic Minority Oversampling Technique. The
approach is used to generate other sample data to ensure that the dataset of the
conformational epitope is balanced. The Hierarchical DBSCAN algorithm is performed
to identify the cluster in the majority class. Some of the randomly selected data is
taken from each cluster, considering the oversampling degree, and combined with the
minority class data. The balance data is utilized as the training dataset to develop a
conformational epitope prediction. Furthermore, two binary classification methods,
Support Vector Machine and Decision Tree, are separately used to develop model
prediction and to evaluate the performance of CluSMOTE in predicting conformational
B-cell epitope. The experiment is focused on determining the best parameter for optimal
CluSMOTE. Two independent datasets are used to compare the proposed prediction
model with state of the art methods. The first and the second datasets represent the
general protein and the glycoprotein antigens respectively.
Result. The experimental result shows that CluSMOTE Decision Tree outperformed the
Support Vector Machine in terms of AUC and Gmean as performance measurements.
The mean AUC of CluSMOTE Decision Tree in the Kringelum and the SEPPA 3 test
sets are 0.83 and 0.766, respectively. This shows that CluSMOTE Decision Tree is better
than other methods in the general protein antigen, though comparable with SEPPA 3
in the glycoprotein antigen.

Subjects Bioinformatics, Data Mining and Machine Learning
Keywords Cluster-based undersampling, SMOTE, Class imbalance, Hybrid sampling, Hierarchi-
cal DBSCAN, Vaccine design

INTRODUCTION
A B-cell epitope is among the main components of peptide-based vaccines (Andersen,
Nielsen & Lund, 2006; Zhang et al., 2011; Ren et al., 2015). It can be utilized in

How to cite this article Solihah B, Azhari A, Musdholifah A. 2020. Enhancement of conformational B-cell epitope prediction using CluS-
MOTE. PeerJ Comput. Sci. 6:e275 http://doi.org/10.7717/peerj-cs.275

https://peerj.com/computer-science
mailto:binti.solihah@mail.ugm.ac.id
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.275
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://doi.org/10.7717/peerj-cs.275


immunodetection or immunotherapy to induce an immune response (Rubinstein et
al., 2008). Many B-cell epitopes are conformational and originate from separate segments
of an antigen sequence, forming a spatial neighborhood in the antigen-antibody (Ag–Ab)
complex. Identifying epitopes through experiments is tedious and expensive work, and
therefore, there is a high risk of failure. Current progress in bioinformatics makes it possible
to create vaccine designs through 3D visualization of protein antigen. Many characteristics,
including composition, cooperativeness, hydrophobicity, and secondary structure, are
considered in identifying potential substances for an epitope (Kringelum et al., 2013). Since
no dominant characteristic helps experts to easily distinguish epitopes from other parts of
the antigen, the risk of failure is quite high.

The availability of the 3D structure of the Ag–Ab complex in the public domain
and computational resources eases the development of predictive models using various
methods, including the structure and sequence-based approaches. However, the
conformational epitope prediction is still challenging. The structure-based approach
can be divided into three, including dominant-characteristic-based, graph-based, and
learning-based categories.

There are several characteristic-based approaches, including (1) CEP, which uses solvent-
accessibility properties, (2) Discotope using both solvent-accessibility-based properties
and epitope log odds ratio of amino acid, (3) PEPITO that adds half-sphere exposure
(HSE) to log odds ratio of amino acid in Discotope and (4) Discotope 2.0, which is an
improved version of Discotope. It defines the log odd ratios in spatial contexts and adds
half-sphere exposure (HSE) as a feature, and (5) SEPPA, which utilizes exposed and
adjacent residual characteristics to form a triangle unit patch (Kulkarni-kale, Bhosle &
Kolaskar, 2005; Andersen, Nielsen & Lund, 2006; Kringelum et al., 2012; Sun et al., 2009).
The dominant-characteristic-based approach is limited by the number of features and the
linear relationships between them.

The graph-based method is yet another critical method, although only two from the same
study were found during the literature review. Zhao et al. (2012) developed a subgraph that
could represent the planar nature of the epitope. Although the model is designed to identify
a single epitope, it can also detect multiples. Zhao et al. (2014) used features extracted from
both antigens and the Ag–Ab interaction, which is expressed by a coupling graph and later
transformed into a general graph.

The learning-based approach utilizes machine-learning to work with a large number
of features. It also uses nonlinear relationships between features to optimize model
performance. Rubinstein, Mayrose & Pupko (2009) used two Naïve Bayesian classifiers to
develop structure-based and sequence-based approaches. SEPPA 2.0 combines amino
acid index (AAindex) characteristics in the SEPPA algorithm in the calculation of cluster
coefficients (Qi et al., 2014; Kawashima et al., 2008). Aaindex in SEPPA 2.0 is consolidated
via Artificial Neural Networks (ANN). However, SEPPA 3.0 adds the glycosylation triangles
and glycosylation-related AAindex to SEPPA 2.0 (Zhou et al., 2019). Glycosylation-related
AAindex is consolidated to SEPPA 3.0 via ANN. Several researchers utilized the advantages
of random forest (Dalkas & Rooman, 2017; Jespersen et al., 2017; Ren et al., 2014; Zhang et
al., 2011). The main challenge in developing a conformational B-cell epitope prediction

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 2/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275


model is the class imbalance problem. This is a condition where the sample of the target
or epitope class is less than that of the nontarget or the non-epitope classes.

Several methods have been proposed to handle the class imbalance problem. However,
studies that focus on handling this issue in epitope prediction models are still limited.
Ren et al. (2014) and Zhang et al. (2011) used simple random undersampling to handle
the class imbalance problem. Dalkas & Rooman (2017) used the Support Vector Machine
(SVM) Synthetic Minority Over-sampling Technique (SMOTE) method, which is a variant
of SMOTE. Another common approach used is weighted SVM, which is included in the
cost-sensitive algorithm level category (Ren et al., 2015). Additionally, Zhang et al. (2014)
used a cost-sensitive ensemble approach and proved that the method was superior to
Easy Ensemble, Balance Cascade and SMOTEBoost (Liu, Wu & Zhou, 2009; Chawla et al.,
2003).

Currently, several studies focus on class imbalance using various approaches that
are mainly divided into four, including data and algorithm levels, cost-sensitive, and
ensemble (Galar et al., 2012). In the data level approach, the resampling method is used to
ensure a balanced distribution of data (Gary, 2012). The approaches under this category
include undersampling, oversampling and a combination of both (Drummond & Holte,
2003; Estabrooks, Jo & Japcowick, 2004; Chawla et al., 2002; Chawla et al., 2008). In the
algorithm level, the minority class is specifically considered. Most algorithms are equipped
with a search system to identify rare patterns (Gary, 2012). The learning process of
classifiers usually ignores the minority class. Specific recognition algorithms are used to
detect rare patterns, providing different misclassification weights between minority and
majority classes or different weights (Elkan, 2001; Batuwita & Palade, 2012; Japkowicz,
Myers & Gluck, 1995; Raskutti & Kowalczyk, 2003). In general, adding cost to an instance
is categorized as cost-sensitive in the data level (Galar et al., 2012). The approach is
also applied in the ensemble method (Blaszczynski & Stefanowski, 2014). However, the
determination of the weight is carried out through trial and error.

The most common ensemble methods used to handle the class imbalance problem
include bagging and boosting. In bagging, a balanced class sample is generated using
the bootstrapping mechanism. The sampling methods used in this case include random
undersampling and oversampling, as well as SMOTE (Blaszczynski & Stefanowski, 2014;
Galar et al., 2012). In boosting, samples are selected iteratively and their weight calculated
based on the misclassification costs. Many boosting variations have been proposed, though
the most influential is the AdaBoost (Freund & Schapire, 1996).

Random oversampling and undersampling are the simplest sampling methods used in
balancing data distribution. Handling class imbalance in the preprocessing data is versatile
since it does not depend on the classifier used. Similarly, the random oversampling method
is versatile because it does not rely on the classifier used. However, its main drawback
is overfitting because new sample data are not added. The SMOTE technique avoids
overfitting by interpolating adjacent members of the minority class to create new sample
data (Chawla et al., 2002). Furthermore, oversampling that considers certain conditions,
such as the density distribution and the position of the sample point to the majority class,
improves the performance of the classifier (He & Garcia, 2009; Han et al., 2005). Random

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 3/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275


undersampling is a concern in the sense that loss of information from the dataset might
occur. This is because of pruning may considerably affect and reduce its performance. To
reduce the loss of information, several cluster-based methods have been used in resampling
(Yen & Lee, 2009; Das, Krishnan & Cook, 2013; Sowah et al., 2016; Tsai et al., 2019).

Cluster-based undersampling can be conducted by omitting class labels (Yen & Lee,
2009; Das, Krishnan & Cook, 2013). Alternatively, it can be performed only on the negative
class (Sowah et al., 2016; Lin et al., 2017; Tsai et al., 2019). Das, Krishnan & Cook (2013)
discarded the negative class data that overlap the positive in a specific cluster based on the
degree of overlapping. According to Yen & Lee (2009), the samples from the negative class
are proportional to the ones in the positive class in a particular cluster. Also, Sowah et al.
(2016) randomly selected several sample data from each cluster. In Tsai et al. (2019), the
cluster members were selected using optimization algorithms. Clustering samples of the
negative to positive class may lead to a suboptimal cluster number of data in the negative
class (Lin et al., 2017).

In this research, the cluster-based undersampling method is combined with SMOTE to
obtain a balanced dataset. The parameter r is defined to determine the proportion of the
majority class data sampled and compared with the minority. A classifier model is built
with the decision tree (DT) and SVM algorithms to assess the performance of the proposed
method.

MATERIAL AND METHODS
Dataset
This research uses Rubinstein’s dataset as training (Rubinstein, Mayrose & Pupko, 2009).
The formation criteria of the training dataset are explained by Rubinstein et al. (2008). The
study shows the following, (1) The Ag–Ab complex structure should contain antibodies
with both heavy and light chains, (2) Contact between antigens and antibodies must occur
in the complementarity-determining regions, (3) The amount of antigen residues binds
to antibodies is large, and (4) The complex used cannot be similar to other complexes, as
stated in the Structural Classification of Proteins criteria (Murzin et al., 1995). The training
dataset consists of 76 antigen chains derived from 62 3D structure Ag–Ab complexes. The
chain list is shown in Table S1. The complexes are downloaded from the Protein Data Bank
(PDB) (Berman et al., 2000).

Two independent test sets are used, including Kringelum and SEPPA 3.0 (Kringelum et
al., 2012; Chou et al., 2019). Kringelum’s test set consists of 39 antigen chains. Data were
filtered from 52 antigen chains and thirteen antigens were excluded from the list because
they were used as training data with the compared method. The data released include 1AFV,
1BGX, 1RVF, 2XTJ, 3FMG, 3G6J, 3GRW, 3H42, 3MJ9, 3RHW, 3RI5, 3RIA, and 3RIF. The
details of Zhang’s test set are presented in Table S2A. The test set represents protein antigen
in the general category. It is used to compare the CluSMOTE DT with the Discotope
1.2, Ellipro, Epitopia, EPCES, PEPITO and Discotope 2 (Andersen, Nielsen & Lund, 2006;
Ponomarenko et al., 2008; Rubinstein, Mayrose & Pupko, 2009; Liang et al., 2009; Sweredoski
& Baldi, 2008; Kringelum et al., 2012). The SEPPA 3.0 test set is a glycoprotein category

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 4/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275#supp-1
http://dx.doi.org/10.7717/peerj-cs.275#supp-1
http://dx.doi.org/10.7717/peerj-cs.275


test set. This dataset consists of 98 antigen chains and eight were excluded because they
were multiple epitopes, including 5KEM A1, 5KEM A2, 5T3X G1, 5T3X G2, 5TLJ X1, 5TLJ
X2, 5TLK X1, and 5TLK X2. The test set was used to compare the CluSMOTE DT with
the SEPPA 2.0 SEPPA 3.0, PEPITO, Epitopia, Discotope 2, CBTOPE and BepiPred 2.0
methods (Qi et al., 2014; Ansari & Raghava, 2010; Jespersen et al., 2017). The antigen list
for the test set is presented in Table S2B.

Conformational B-cell epitope prediction method
Conformational epitopes are residues of exposed antigens that are spatially close, though
they form separate segments when viewed from sequences (Andersen, Nielsen & Lund,
2006). To build a conformational epitope prediction model, the steps needed, as shown in
Fig. 1 are include (1) preparing the dataset, (2) balancing the dataset, and (3) creating a
classification model for the prediction of residual potential epitopes. The preparation step
aims to build the training and testing datasets. The number of exposed residues considered
as epitopes is less than the exposed residues that are not-epitopes. Balancing the dataset
is meant to overcome the class imbalance found in Step 1, while the classification model
categorizes residues as members of the epitope or non-epitope class.

Data preprocessing
The creation of feature vectors and epitope annotations for the training and testing data
is conducted on surface residues only. Relatively accessible surface area (RSA) is used as a
parameter to distinguish surface and buried residues. Different values were used as limits,
including the 0.05, 0.01, 0.1, and 0.25 thresholds (Rubinstein, Mayrose & Pupko, 2009;
Zhang et al., 2011; Kringelum et al., 2012; Ren et al., 2014; Dalkas & Rooman, 2017). This
variation affects the imbalance ratio between the data epitope and non-epitope classes.
Although the standard burial and non-burial threshold are 0.05, the value of 0.01 is used
as the limit. This is because of the larger the surface exposure threshold, the smaller the
predictive performance (Basu, Bhattacharyya & Banerjee, 2011; Kringelum et al., 2012).
Choosing 0.01 as the limit is relevant to the finding of Zheng et al. (2015), where all RSA
values of epitopes are positive, though slightly larger than zero.

The feature vectors used include accessible surface area (ASA), RSA, depth index (DI),
protrusion index (PI), contact number (CN), HSE, quadrant sphere exposure (QSE),
AAindex, B factor, and log odds ratio, as shown in Table 1.

ASA and RSA are the key features in determining if a residue is likely to bind to other
molecules for accessibility reasons. Although several programs can be used to calculate
ASA, the most commonly used include NACCESS and DSSP (Hubbard & Thornton, 1993;
Kabsch & Sander, 1983). DSSP only calculates the total ASA per residue, while NACCESS
computes the backbone, side chain, polar, and nonpolar ASA. However, NACCESS can
only count one molecular structure at a time. These users need to create additional scripts
to count several molecular structures at a time (Mihel et al., 2008). This study used the
PSAIA application was used (Mihel et al., 2008). The PSAIA is not only limited to counting
one molecular structure but can be used to calculate other features, including RSA, PI,
and DI. No significant difference is observed between the ASA calculation results using

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 5/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275#supp-1
http://dx.doi.org/10.7717/peerj-cs.275


Data preprocessing 
(feature extraction)

CLuSMOTE

Classify at residual level

Classification Model
3D Structure of 

Antigen complex

Database of 3D Structure of Antigen Antibody 
Complex contain validated epitope

Conformational B cell epitope 
prediction

Feature extraction

Epitope candidate

Figure 1 Development stage of conformational B-cell epitopes prediction. .
Full-size DOI: 10.7717/peerjcs.275/fig-1

Table 1 Features for antigenic determinant and the methods to compute.

Category Feature Data Source/Method

Structural ASA PSAIA Mihel et al. (2008)
RSA PSAIA Mihel et al. (2008)
Protrusion Index PSAIA Mihel et al. (2008)
CN Nishikawa & Ooi (1980)
HSE Hamelryck (2005)
QSE Li et al. (2011)

Physicochemical AAIndex Kawashima et al. (2008)
B factor Ren et al. (2014) and Ren et al. (2015)

Statistic Log-odd ratio Andersen, Nielsen & Lund (2006)

Notes.
ASA, solvent-accessible surface area; RSA, Relative solvent-accesible Surface Area; CN, Contact Number; HSE, Half-
Sphere Exposure; QSE, Quadrant Sphere Exposure; AAIndex, Amino Acid Index.

NACCESS and PSAIA. The ASA attribute values used include the backbone, side chain,
polar (including oxygen, nitrogen, and phosphorus), and nonpolar atoms (carbon atoms).

RSA is the result of the ASA value with the maximum figure calculated based on the GXG
tripeptide theory, where G is glycine and X is the residual sought (Lee & Richards, 1971).

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 6/17

https://peerj.com
https://doi.org/10.7717/peerjcs.275/fig-1
http://dx.doi.org/10.7717/peerj-cs.275


The maximum value of ASA is obtained from Tien et al. (2013), which is an improvement
of Rost & Sender (1994) and Millerl et al. (1987). There was no difference in the list of
datasets obtained using the three methods, as presented in Appendix 1. The RSA attribute
values used include the total RSA of all atoms, backbone atoms, side-chain atoms, polar
atoms (including oxygen, nitrogen, and phosphorus), and nonpolar atoms (carbon atoms).

DI: The DI of the i th atom refers to its minimum distance to the exposed atoms. The
DI attribute values used include the average and standard deviation of all atoms, average
of side-chain atoms, maximum, and minimum.

PI: The PI is the ratio of the space of a sphere with a radius of 10 A centered on C α
divided by the area occupied by the heavy atoms constituting proteins (Pintar, Carugo
& Pongor, 2002). In this study, PI was calculated using the PSAIA software (Mihel et al.,
2008). The PI attribute values used include the average and standard deviation of all atoms,
average of side-chain atoms, maximum, and minimum.

CN, HSE, QSE: The CN is the total number of C α adjacent to the residue measured
under the microsphere environment. It is limited by a ball with radius r centered on C
α (Nishikawa & Ooi, 1980). In HSE, the number of C α was distributed in two areas,
including the upper hemisphere and the lower hemisphere balls (Hamelryck, 2005). In
QSE, the number of C α was specifically distributed in eight regions in the microsphere
environment (Li et al., 2011).

AAindex: The AAindex consists of 544 indices representing the amino acid
physicochemical and biochemical properties (Kawashima et al., 2008). The AAindex value
of each residue was extracted from component I of the HDRATJCI constituents in the
aaindex1.txt file. The detail of component I of aaindex file is attached as a Table S3.

B factor: The B factor indicates the flexibility of an atom or a residue. An exposed
residue has a larger B factor than a latent residue. The B factor for each atom is derived
from the PDB data. The attribute values used include the B factor of C α and the average
of all atoms or residues (Ren et al., 2014).

Log odds ratio: This feature is extracted based on the primary protein structure and
calculated based on Andersen, Nielsen & Lund (2006). A sliding window of size 9 residues
was run in each sequence of antigens in the dataset to form overlapping segments that
can be used in the calculation of the appearance of the individual residues. Each segment
was grouped as an epitope or non-epitope depending on its center. The log odds ratio was
calculated at the fifth position residue based on Nielsen, Lundegaard & Worning (2004). In
this study, a segment would be included in the calculation in case the fifth position residue
is exposed.

Epitope annotation on the antigen residue is carried out by analyzing the interaction
in the PSAIA software (Mihel et al., 2008) using contact criterion, threshold, and Van der
Waals radii. The maximum distance of 4 was derived from the chotia.radii file. The ASA
change parameters include the Delta ASA, Z_Slice Size, and Probe Radius with values 1.0,
0.25, and 1.4, respectively. The interaction analyzer output is a list of all adjacent residual
pairs within the allowable distance range. A procedure for selecting antigen residues that
bind to antibodies is created to obtain a list of epitopes.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 7/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275#supp-1
http://dx.doi.org/10.7717/peerj-cs.275


Class imbalance Dataset 

Minority class    
(epitope)

Majority class                    
(non-epitope)

Clustering-based 
undersampling

Randomly Selected data 
from each cluster 

SMOTE Balance DatasetNew dataset  

Figure 2 CluSMOTE sampling.
Full-size DOI: 10.7717/peerjcs.275/fig-2

Handling class imbalance with CluSMOTE
Resampling with undersampling and oversampling has advantages and disadvantages.
Therefore, cluster-based sampling was conducted to minimize the loss of information
caused by the pruning effect of undersampling. Oversampling with SMOTE has often
proven to be reliable. Merging the two increases classifier performance. A parameter
stating the degree of oversampling is used to identify the optimal combination.

This study proposed CluSMOTE, a cluster-based undersampling mechanism combined
with SMOTE as shown in Fig. 2. Negative class data are clustered using the Hierarchical
Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm. This
is meant to identify the optimal clusters based on stability (Campello, Moulavi & Sander,
2013). The number of clusters is less than the positive class data. This means each cluster
contains several data. The simplest sampling mechanism is random selection. To select
data, the cluster size and degree of oversampling should be considered.

The proposed CluSMOTE method uses the following steps,
1. Separate the positive and the negative class data.
2. Cluster the negative class ( −) using the HDBScan algorithm.
3. Take a certain number of data items from each cluster. Consider the ratio of the number

of clusters to the overall members of the negative class. The samples from the Ci cluster
is defined in (1), according to Sowah et al. (2016). where MI is the number of minority
class samples, MA is the total number of majority classes, M_ci is the number of Ci
cluster members, and r is the negative class dataset ratio from the cluster. In case r =2,
the number of negative class datasets to be formed is twice the positive class datasets.
The samples are taken from each cluster randomly.
Size_Ci=r×MI×M_ci/MA (1)

4. Combine positive classes with all datasets taken in Step 3.
5. Carry out SMOTE on the results obtained in Step 4.
Program implementation was conducted in the Java programming environment with

NetBeans IDE 8.2. A new class for implementing the CluSMOTE method was implemented

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 8/17

https://peerj.com
https://doi.org/10.7717/peerjcs.275/fig-2
http://dx.doi.org/10.7717/peerj-cs.275


in the Java Language Programming, supported by the JSAT statistics library version 0.09
(Raff, 2017).

Classification algorithm
Two classification algorithms, SVM and DT, were used to evaluate the performance of
CluSMOTE. Generally, SVM is a popular learning algorithm used in previous studies of
conformational epitope prediction. DT is often used to handle the class imbalance problem
and classified as one of the top 10 data mining algorithms (Galar et al., 2012).

This study uses the JSAT (Raff, 2017) software package, utilizing the Pegasos SVM with
a mini-batch linear kernel (Shalev-shwartz, Singer & Srebro, 2007). Pegasos SVM works
fast since the primal update process is carried out directly, and no support vector is stored.
The default values used for the epoch, regularization, and batch size parameters include
5, 1e−4, and 1, respectively. The decision tree is formed by nodes that are built on the
principle of decision stump (Iba & Langley, 1992). Also, the study used a bottom-up
pessimistic pruning with error-based pruning from the C4.5 algorithm (Quinland, 1993).
The proportion of the data set used for pruning is 0.1.

Performance measurement of the conformational epitope prediction
model
A dataset used for conformational epitope prediction contains the class imbalance problem.
The area used is mainly under the ROC curve (AUC) as a performance parameter. In class
imbalance, the AUC is a better measure than accuracy, which is biased to the majority
class. Another performance parameter used is F-measure, as expressed in Eq. (2):

Fm=
2∗PPV ∗SE
PPP+SE

=2∗TP/(2∗TP+FN +FP) (2)

where PPV=TP/(TP + FP) and SE denote sensitivity or TPR. The F-measure is not affected
by imbalance conditions provided the training data used are balanced (Batuwita & Palade,
2009). Other metrics that can be used to assess performance include Gmean and adjusted
Gmean (AGm). The Gmean is expressed in Eq. (3) below,

Gmean=
√
SP∗SE (3)

where SP denotes specificity/TPR and SE denotes sensitivity/FPR. AGm is expressed in Eq.
(4):

AGm=

{
(Gm+SP∗N)/(1+Nn) if SE >0
0 if SE =0

(4)

where Gm is Gmean, SP specificity, SE sensitivity, and Nn the proportion of negative
samples in the dataset. AGm is suitable for application in case an increase in TPR is achieved
with minimal reduction in TNR. Generally, this criterion is suitable for bioinformatics
cases, where errors in the identification of negative classes are unexpected (Batuwita &
Palade, 2009). In the case of epitope prediction, the false negative is not expected to be
high. The selection of the wrong residue leads to the failure of the subsequent process.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 9/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275


RESULTS AND DISCUSSIONS
The complex-based leave-one-out cross-validation method is used to test the reliability of
the classifier model. Each training set is built from the n − 1 complexes and tested with a
test set from the n-th complex. Model performance was measured using seven parameters,
including TPR, TNR, precision, AUC, Gmean, AGm, and F-measure.

Effect of the selection of the r value on model performance
In the original dataset, the ratio of imbalance between negative and positive classes is 10:1.
To assess the effectiveness of sampling, this study utilized several r values derived using the
ratio of negative to positive class data. The value r =1 indicates that only the clustering and
undersampling steps are applied. The value r =2 indicates that the number of negative class
datasets is twice the number of positive ones. The test results obtained without a balancing
mechanism show the effectiveness of the proposed resampling method. The results of the
assessment of the performance of the classification model expressed by the TPR, TNR,
precision, AUC, Gmean, AGm, and F-measure parameters are shown in Table 2.

The results of internal model validation on several variations of the r-value are also
shown in Table 2. Where the r-value varies from r =1 to r =5, both in CluSMOTE DT
and CluSMOTE SVM, the TPR and the FPR value tends to decrease with the increase in
the r-value. The larger the degree of oversampling, the smaller the TPR. The TNR value,
as well as precision, also tends to increase with the increase in the r-value. The increase
in TNR values means more negative classes are recognized. This can also be interpreted
as TNR value increases means less information loss of the negative class. These two
conditions indicate a trade-off between the degrees of oversampling and undersampling.
Oversampling without undersampling yields TNR and precision values greater than
undersampling without oversampling. Similarly, undersampling without oversampling
yields TPR and FPR values greater than oversampling without undersampling. This finding
indicates the undersampling mechanism is more effective in increasing positive class
recognition than the oversampling, which is consistent with previous studies. Also, the
resampling mechanism increases the TPR and FPR values compared to no resampling.
However, the overall performance improvement indicated by the AUC, Gmean, AGm, and
F-measure is not significant.

In CluSMOTE DT, AUC and Gmean have the same tendency. The best AUC and Gmean
are 0.815 and 0.811 at r =2 respectively. The AGm and F-measure values also have the
same tendency, though the values are different. In DT, the best AGm and F-measure are
obtained using the SMOTE oversampling method. In the SVM classifier, the best AGm
is obtained using the SMOTE oversampling mechanism. However, the best F-measure is
obtained using CluSMOTE at r =5.

Previous studies on class imbalance stated that the hybrid resampling method could
significantly improve performance. However, this was not the case in epitope prediction
using the CluSMOTE DT method. No r value significantly influenced the overall
performance improvement expressed by the AUC, Gmean, AGm, and F-measure. In
case the TPR and TNR values are considered together, the selection of r =2 is quite good
as shown by the AUC and Gmean values. The selection of r values based on the experiment

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 10/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275


Table 2 Performance of classification model with variations in the r-value.

No Resampling method r classifier TPR
(recall)

TNR Precision
(PPV)

FPR AUC Gmean Adjusted
Gmean

Fmeasure

1 Cluster-based only 1 DT 0.855a 0.769 0.454 0.231a 0.812 0.806 0.791 0.581
2 CluSMOTE 2 DT 0.797 0.834 0.526 0.163 0.815a 0.811a 0.823 0.622
3 CluSMOTE 3 DT 0.764 0.862 0.558 0.138 0.813 0.807 0.833 0.634
4 CluSMOTE 4 DT 0.730 0.881 0.575 0.119 0.806 0.796 0.835 0.631
5 CluSMOTE 5 DT 0.724 0.880 0.591 0.120 0.802 0.794 0.834 0.641
6 SMOTE only – DT 0.644 0.939a 0.732a 0.061 0.791 0.771 0.848a 0.675a

7 No Resampling – DT 0.637 0.939 0.730 0.061 0.788 0.767 0.846 0.669
8 Cluster-based only 1 SVM 0.591b 0.668 0.393 0.328b 0.629 0.579 0.620 0.388
9 CluSMOTE 2 SVM 0.577 0.746 0.441 0.254 0.661b 0.60b 0.666 0.400
10 CluSMOTE 3 SVM 0.498 0.790 0.486 0.210 0.644 0.580 0.675 0.396
11 CluSMOTE 4 SVM 0.475 0.801 0.508 0.199 0.638 0.566 0.672 0.387
12 CluSMOTE 5 SVM 0.468 0.819 0.529 0.178 0.643 0.572 0.683 0.401b

13 SMOTE only – SVM 0.384 0.881b 0.606b 0.119 0.632 0.532 0.688 0.368
14 No Resampling – SVM 0.409 0.874 0.569 0.126 0.641 0.557 0.699b 0.392

Notes.
TPR, True Positive Rate; TNR, True Negaitive Rate; AUC, Area Under ROC Curve; Gmean, Geometric mean.

aThe best parameter value in DT model.
bThe best parameter vaue in SVM model.

shows opposing conditions between the TPR and TNR. From Table 2, the best performance
using AUC and Gmean is fairer compared to Agm and F-score. In the best AUC and AGm, a
balanced proportion was obtained between the TPR and TNR. The best AGm and F-score
resulted from the lowest TPR value. Generally, the performance models built with DT
exhibit better performance than those from SVM. The performance of SVM is likely to
be affected by kernel selection problems. Linear kernels are cannot separate the classes in
polynomial cases. Other configurations or models may be explored for future work.

Comparison of the proposed method with previous methods
CluSMOTE DT was evaluated on an independent test set from Kringelum et al. (2012)
by filtering the dataset from the details used in the training process of the method being
compared. A total of 39 antigen data were used in the comparison, as listed in tab4. The final
results of the test show that CluSMOTE with r =2 is superior to the other methods with
an average AUC value of 0.83. The average AUC values of Discotope, Ellipro, Epitopia,
EPCES, PEPITO, and Discotope 2.0 were 0.727, 0.721, 0.673, 0.697, 0.746, and 0.744,
respectively.

CluSMOTE DT with r =2 was evaluated on the independent test set of glycoprotein
antigen by Zhou et al. (2019). Testing with glycoprotein antigen showed that the
performance of CluSMOTE DT was similar to that of SEPPA 3.0, with the AUC values
of 0.766 and 0.739, respectively. Both CluSMOTE DT and SEPPA 3.0 were superior to
Epitopia, Discotope 2.0, PEPITO, CBTOPE, SEPPA 2.0, and BepiPred 2.0. The detailed
performance of the eight methods compared is shown in tab5. The AUC achieved by
CluSMOTE DT is comparable to the one from SEPPA 3.0, showing that the proposed

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 11/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275


method might handle epitope cases with glycoprotein well. The model developed with
CluSMOTE uses the dataset presented by Andersen, Nielsen & Lund (2006), which consists
of 76 antigen structures. The number of complex structures used in the CluSMOTE model
is less than that used in SEPPA 3.0, which consists of 767 antigen structures. The small
number of antigen structures speeds up the training time for model development.

CONCLUSIONS
An epitope is a small part of the exposed antigen that creates class imbalance problems in
the prediction of learning-based conformational epitopes. In this study, the CluSMOTE
method was proposed to overcome the class imbalance problem in the prediction of the
conformational epitope. The study shows that CluSMOTE considerably increases the TPR
compared to SMOTE only. The comparison of the proposed model with state-of-the-art
methods in the two datasets shows that CluSMOTE DT is comparable to or better than
other methods. Its mean AUC values in Kringelum and the SEPPA 3.0 test sets are 0.83
and 0.766, respectively. This result shows that CluSMOTE DT is better than other methods
in classifying the general protein antigen, though it is comparable to SEPPA 3.0 in the
glycoprotein antigen.

ACKNOWLEDGEMENTS
The authors thank the Publishing and Publication Agency of Universitas Gadjah Mada for
the English proof-reading of this manuscript.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This work was supported by Universitas Trisakti (Doctoral scholarship). The funders had
no role in study design, data collection and analysis, decision to publish, or preparation of
the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
Universitas Trisakti (Doctoral scholarship).

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Binti Solihah conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the paper, and approved the final draft.
• Azhari Azhari and Aina Musdholifah conceived and designed the experiments, authored
or reviewed drafts of the paper, and approved the final draft.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 12/17

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.275


Data Availability
The following information was supplied regarding data availability:

The data and source code are available at GitHub:
- https://github.com/BSolihah/LIBFROMJSAT
- https://github.com/BSolihah/conformational-epitope-predictor.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.275#supplemental-information.

REFERENCES
Andersen PH, Nielsen M, Lund OLE. 2006. Prediction of residues in discontinu-

ous B-cell epitopes using protein 3D structures. Protein Science 15:2558–2567
DOI 10.1110/ps.062405906.2558.

Ansari HR, Raghava GPS. 2010. Identification of conformational B-cell Epitopes
in an antigen from its primary sequence. Immunome Research 6(1):1–26
DOI 10.1186/1745-7580-6-6.

Basu S, Bhattacharyya D, Banerjee R. 2011. Mapping the distribution of packing
topologies within protein interiors shows predominant preference for specific
packing motifs. BMC Bioinformatics 12(195)1–26 DOI 10.1186/1471-2105-12-195.

Batuwita R, Palade V. 2009. A new performance measure for class imbalance learning.
Application to bioinformatics problems. In: International conference on machine
learning and applications. Miami Beach, Florida. Florida: IEEE Computer Society,
545–550 DOI 10.1109/ICMLA.2009.126.

Batuwita R, Palade V. 2012. Class imbalance learning methods for support vector. In:
He H, Ma Y, eds. Imbalanced learning: foundations, algorithms, and applications.
Hoboken: John Wiley & Sons, Inc, 83–99.

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN,
Bourne PE. 2000. The protein data bank. Nucleic Acids Research 28(1):235–242
DOI 10.1093/nar/28.1.235.

Blaszczynski J, Stefanowski J. 2014. Neighbourhood sampling in bagging for imbalanced
data. Neurocomputing 150:529–542 DOI 10.1016/j.neucom.2014.07.064.

Campello RJGB, Moulavi D, Sander J. 2013. Density-based clustering based on hierar-
chical density estimates. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G, eds. Advances
in knowledge discovery and data mining PAKDD Part II LNAI. Berlin: Springer,
160–172.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
DOI 10.1613/jair.953.

Chawla NV, Cieslak DA, Hall LO, Joshi A. 2008. Automatically countering imbalance
and its empirical relationship to cost. Data Mining and Knowledge Discovery
17(2):225–252 DOI 10.1007/s10618-008-0087-0.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 13/17

https://peerj.com
https://github.com/BSolihah/LIBFROMJSAT
https://github.com/BSolihah/conformational-epitope-predictor
http://dx.doi.org/10.7717/peerj-cs.275#supplemental-information
http://dx.doi.org/10.7717/peerj-cs.275#supplemental-information
http://dx.doi.org/10.1110/ps.062405906.2558
http://dx.doi.org/10.1186/1745-7580-6-6
http://dx.doi.org/10.1186/1471-2105-12-195
http://dx.doi.org/10.1109/ICMLA.2009.126
http://dx.doi.org/10.1093/nar/28.1.235
http://dx.doi.org/10.1016/j.neucom.2014.07.064
http://dx.doi.org/10.1613/jair.953
http://dx.doi.org/10.1007/s10618-008-0087-0
http://dx.doi.org/10.7717/peerj-cs.275


Chawla NV, Lazarevic A, Hall LO, Bowyer KW. 2003. In: SMOTEBoost: Improving
prediction of the minority class in boosting 7th european conference on principles and
practice of knowledge discovery SMOTEBoost: improving prediction of the minority class
in boosting. In: Lecture Notes in Computer Science. DOI 10.1007/978-3-540-39804-2.

Dalkas GA, Rooman M. 2017. SEPIa, a knowledge-driven algorithm for predicting
conformational B-cell epitopes from the amino acid sequence. BMC Bioinformatics
18(95):1–12 DOI 10.1186/s12859-017-1528-9.

Das B, Krishnan NC, Cook DJ. 2013. Handling class overlap and imbalance to detect
prompt situations in smart homes. In: IEEE 13th international conference on data
mining workshops. IEEE Computer Society. 266–273
DOI 10.1109/ICDMW.2013.18.

Drummond C, Holte RC. 2003. C4. 5, Class imbalance, and cost sensitivity : Why under-
sampling beats over-sampling. In: ICML workshop on learning from imbalanced data
sets II. Washington, D.C.

Elkan C. 2001. The foundations of cost-sensitive learning. In: Proceedings of the seven-
teenth international joint conference on artificial intelligence. 973–978.

Estabrooks A, Jo T, Japcowick N. 2004. A multiple resampling method for learning from
imbalanced data sets. Computational Intelligence 20(1):18–36
DOI 10.1111/j.0824-7935.2004.t01-1-00228.x.

Freund Y, Schapire RE. 1996. Experiments with a new boosting algorithm. In: Machine
learning: proceedings of the thirteenth international conference.

Galar M, Fern A, Barrenechea E, Bustince H. 2012. Hybrid-based approaches. IEEE
Transactions on Systems Man and Cybernetics Part C (Applications and Reviews)
42(4):463–484 DOI 10.1109/TSMCC.2011.2161285.

Gary MW. 2012. Foundation of imbalanced learning. In: He H, Ma Y, eds. Imbalanced
learning: foundations, algorithms, and applications. Hoboken: John Wiley & Sons,
Inc, 13–42.

Hamelryck T. 2005. An amino acid has two sides : a new 2D measure provides a different
view of solvent exposure. 2005. Proteins Structure, Funct Bioinforma 59:38–48
DOI 10.1002/prot.20379.

Han H, Wang W, Mao B. 2009. Borderline-SMOTE: a new over-sampling method in
imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB, eds. Advances in
Intelligent Computing. ICIC 2005. Lecture notes in computer science, vol. 3644, Berlin,
Heidelberg: Springer, 878–887.

He H, Garcia EA. 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge
and Data Engineering 21(9):1263–1284.

Hubbard SJ, Thornton JM. 1993. NACCESS. Computer Program Version 2.1.1.
London: Department of Biochemistry and Molecular Biology, University College
London.

Iba W, Langley P. 1992. In: Induction of one-level decision trees, in ML92: proceedings
of the ninth international conference on machine learning. Aberdeen, Scotland:
233–2401–3 1992, San Francisco, CA: Morgan Kaufmann, 1992(July).

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 14/17

https://peerj.com
http://dx.doi.org/10.1007/978-3-540-39804-2
http://dx.doi.org/10.1186/s12859-017-1528-9
http://dx.doi.org/10.1109/ICDMW.2013.18
http://dx.doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x
http://dx.doi.org/10.1109/TSMCC.2011.2161285
http://dx.doi.org/10.1002/prot.20379
http://dx.doi.org/10.7717/peerj-cs.275


Japkowicz N, Myers C, Gluck M. 1995. A novelty detection approach to classification. In:
The fourteenth joint conference on artificial intelligence. New York: ACM, 518–523.

Jespersen MC, Peters B, Nielsen M, Marcatili P. 2017. epitope prediction using confor-
mational epitopes. Nucleic Acids Research 45:24–29 DOI 10.1093/nar/gkx346.

Kabsch W, Sander C. 1983. Dictionary of protein secondary structure:pattern recog-
nition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637
DOI 10.1002/bip.360221211.

Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M.
2008. AAindex: amino acid index database, progress report 2008. Nucleic Acids
Research 36(2007):202–205 DOI 10.1093/nar/gkm998.

Kringelum JV, Lundegaard C, Lund O, Nielsen M. 2012. Reliable B cell epitope
predictions: impacts of method development and improved benchmarking. PLOS
Computational Biology 8(12):e1002829 DOI 10.1371/journal.pcbi.1002829.

Kringelum JV, Nielsen M, Padkjaer S, Lund O. 2013. Structural analysis of B-cell
epitopes in antibody: protein complexes. Molecular Immunology 53(1–2):24–34
DOI 10.1016/j.molimm.2012.06.001.

Kulkarni-kale U, Bhosle S, Kolaskar AS. 2005. CEP : a conformational epitope
prediction server. Nucleic Acids Research 33(Web Server issue):168–171
DOI 10.1093/nar/gki460.

Lee B, Richards FM. 1971. The interpretation of protein structures: estimation of static
accessibility. Journal of Molecular Biology 55:379–400
DOI 10.1016/0022-2836(71)90324-X.

Li P, Pok G, KSJ A, Shon HS, Ryu KH. 2011. QSE: a new 3-D solvent exposure measure
for the analysis of protein structure. Proteomics 11:3793–3801
DOI 10.1002/pmic.201100189.

Liang S, Zheng D, Zhang C, Zacharias M. 2009. consensus scoring. BMC Bioinformatics
10(302):1–10 DOI 10.1186/1471-2105-10-302.

Lin W, Tsai C, Hu Y, Jhang J. 2017. Clustering-based undersampling in class-imbalanced
data. Information Sciences 409–410:17–26 DOI 10.1016/j.ins.2017.05.008.

Liu X, Wu J, Zhou Z. 2009. Exploratory Undersampling for. IEEE Transaction on
Cybernetics 39(2):539–550 DOI 10.1109/TSMCB.2008.2007853.

Mihel J, Šiki M, Tomi S, Jeren B, Vlahovi K. 2008. PSAIA–protein structure and
interaction analyzer. BMC Structural Biology 11:1–11
DOI 10.1186/1472-6807-8-21.

Millerl S, Janin J, Leskv AM, Chothial C, Laboratories CI, Physicochimique DB.
1987. Interior and surface of monomeric proteins t. Journal of Molecular Biology
196:641–656 DOI 10.1016/0022-2836(87)90038-6.

Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP : a structural classification
of proteins database for the investigation of sequences and structures. Journal of
Molecular Biology 247:536–540.

Nielsen M, Lundegaard C, Worning P. 2004. Improved prediction of MHC class
I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics
20(9):1388–1397 DOI 10.1093/bioinformatics/bth100.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 15/17

https://peerj.com
http://dx.doi.org/10.1093/nar/gkx346
http://dx.doi.org/10.1002/bip.360221211
http://dx.doi.org/10.1093/nar/gkm998
http://dx.doi.org/10.1371/journal.pcbi.1002829
http://dx.doi.org/10.1016/j.molimm.2012.06.001
http://dx.doi.org/10.1093/nar/gki460
http://dx.doi.org/10.1016/0022-2836(71)90324-X
http://dx.doi.org/10.1002/pmic.201100189
http://dx.doi.org/10.1186/1471-2105-10-302
http://dx.doi.org/10.1016/j.ins.2017.05.008
http://dx.doi.org/10.1109/TSMCB.2008.2007853
http://dx.doi.org/10.1186/1472-6807-8-21
http://dx.doi.org/10.1016/0022-2836(87)90038-6
http://dx.doi.org/10.1093/bioinformatics/bth100
http://dx.doi.org/10.7717/peerj-cs.275


Nishikawa K, Ooi T. 1980. Prediction of the surface-interior diagram of globular
proteins by an empirical method.pdf. International Journal of Peptide and Protein
Research 16:19–32.

Pintar A, Carugo O, Pongor S. 2002. CX, an algorithm that identifies protruding atoms
in proteins. Bioinformatics 18(7):980–984 DOI 10.1093/bioinformatics/18.7.980.

Ponomarenko J, Bui H, Li W, Fusseder N, Bourne PE, Sette A, Peters B. 2008. ElliPro : a
new structure-based tool for the prediction of antibody epitopes. BMC Bioinformat-
ics 9(514):1–8 DOI 10.1186/1471-2105-9-514.

Qi T, Qiu T, Zhang Q, Tang K, Fan Y, Qiu J, Wu D, Zhang W, Chen Y, Gao J, Zhu R,
Cao Z. 2014. SEPPA 2.0—more refined server to predict spatial epitope considering
species of immune host and subcellular localization of protein antigen. Nucleic Acids
Research 42(May):59–63 DOI 10.1093/nar/gku395.

Quinland JR. 1993. C4.5 programs for machine learning. San Mateo: Morgan Kaufmann.
Raff E. 2017. JSAT: java statistical analysis tool, a library for machine learning. Journal of

Machine Learning Research 18:1–5.
Raskutti B, Kowalczyk A. 2003. Extreme Re-balancing for SVMs: a case study. In:

Workshop on learning from imbalanced datasets II. Washington, D.C.
Ren J, Liu Q, Ellis J, Li J. 2014. Tertiary structure-based prediction of conformational B-

cell epitopes through B factors. Bioinformatics 30:264–273
DOI 10.1093/bioinformatics/btu281.

Ren J, Liu Q, Ellis J, Li J. 2015. Positive-unlabeled learning for the prediction of confor-
mational B-cell epitopes. BMC Bioinformatics 16(Suppl 18):1–15.

Rost B, Sender C. 1994. Conservation and prediction of solvent accesibility in pro-
tein families. Proteins Structure, Function Genetics 20(November):216–226
DOI 10.1002/prot.340200303.

Rubinstein ND, Mayrose I, Halperin D, Yekutieli D, Gershoni JM, Pupko T. 2008.
Computational characterization of B-cell epitopes. Molecular Immunology
45:3477–3489 DOI 10.1016/j.molimm.2007.10.016.

Rubinstein ND, Mayrose I, Pupko T. 2009. A machine-learning approach for predicting
B-cell epitopes. Molecular Immunology 46:840–847
DOI 10.1016/j.molimm.2008.09.009.

Shalev-shwartz S, Singer Y, Srebro N. 2007. Pegasos: Primal Estimated sub-GrAdient
SOlver for SVM. In: International Conference on Machine Learning (ICML). New
York. 807–814.

Sowah RA, Agebure MA, Mills GA, Koumadi KM, Fiawoo SY. 2016. New cluster
undersampling technique for class imbalance learning. International Journal of
Machine Learning and Computing 6(3):205–214 DOI 10.18178/ijmlc.2016.6.3.599.

Sun J, Wu D, Xu T, Wang X, Xu X, Tao L, Li YX, Cao ZW. 2009. SEPPA: a computa-
tional server for spatial epitope prediction of protein antigens. Nucleic Acids Research
37:612–616 DOI 10.1093/nar/gkp417.

Sweredoski MJ, Baldi P. 2008. PEPITO: improved discontinuous B-cell epitope pre-
diction using multiple distance thresholds and half sphere exposure. Bioinformatics
24(12):1459–1460 DOI 10.1093/bioinformatics/btn199.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 16/17

https://peerj.com
http://dx.doi.org/10.1093/bioinformatics/18.7.980
http://dx.doi.org/10.1186/1471-2105-9-514
http://dx.doi.org/10.1093/nar/gku395
http://dx.doi.org/10.1093/bioinformatics/btu281
http://dx.doi.org/10.1002/prot.340200303
http://dx.doi.org/10.1016/j.molimm.2007.10.016
http://dx.doi.org/10.1016/j.molimm.2008.09.009
http://dx.doi.org/10.18178/ijmlc.2016.6.3.599
http://dx.doi.org/10.1093/nar/gkp417
http://dx.doi.org/10.1093/bioinformatics/btn199
http://dx.doi.org/10.7717/peerj-cs.275


Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. 2013. Maximum allowed
solvent accessibilites of residues in proteins. PLOS ONE 8(11):e80720
DOI 10.1371/journal.pone.0080635.

Tsai C, Lin W, Hu Y, Yao G. 2019. Under-sampling class imbalanced datasets by
combining clustering analysis and instance selection. Information Sciences 477:47–54
DOI 10.1016/j.ins.2018.10.029.

Yen S, Lee Y. 2009. Expert systems with applications cluster-based under-sampling
approaches for imbalanced data distributions. Expert Systems with Applications
36:5718–5727 DOI 10.1016/j.eswa.2008.06.108.

Zhang J, Zhao X, Sun P, Gao B, Ma Z. 2014. Conformational B-cell epitopes prediction
from sequences using cost-sensitive ensemble classifiers and spatial clustering.
BioMed Research International 2014:1–12 DOI 10.1155/2014/689219.

Zhang W, Xiong Y, Zhao M, Zou H, Ye X, Liu J. 2011. Prediction of conformational B-
cell epitopes from 3D structures by random forests with a distance-based feature.
BMC Bioinformatics 12(341):1–10 DOI 10.1186/1471-2105-12-341.

Zhao L, Hoi SCH, Li Z, Wong L, Nguyen H, Li J. 2014. Coupling graphs, efficient
algorithms and B-cell epitope prediction. IEEE/ACM Transactions on Computational
Biology and Bioinformatics 11(1):7–16 DOI 10.1109/TCBB.2013.136.

Zhao L, Wong L, Lu L, Hoi SCH, Li J. 2012. B-cell epitope prediction through a graph
model. BMC Bioinformatics 13((Sup 17)(S20)):1–12.

Zheng W, Ruan J, Hu G, Wang K, Hanlon M, Gao J. 2015. Analysis of conformational
B-Cell epitopes in the antibody-antigen complex using the depth function and the
convex hull. PLOS ONE 10(8):1–16 DOI 10.1371/journal.pone.0134835.

Zhou C, Chen Z, Zhang L, Zhang L, Yan D, Mao T, Tang K, Qiu T, Cao Z. 2019. SEPPA
3.0—enhanced spatial epitope prediction enabling glycoprotein antigens. Nucleic
Acids Research 47(May):388–394 DOI 10.1093/nar/gkz413.

Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 17/17

https://peerj.com
http://dx.doi.org/10.1371/journal.pone.0080635
http://dx.doi.org/10.1016/j.ins.2018.10.029
http://dx.doi.org/10.1016/j.eswa.2008.06.108
http://dx.doi.org/10.1155/2014/689219
http://dx.doi.org/10.1186/1471-2105-12-341
http://dx.doi.org/10.1109/TCBB.2013.136
http://dx.doi.org/10.1371/journal.pone.0134835
http://dx.doi.org/10.1093/nar/gkz413
http://dx.doi.org/10.7717/peerj-cs.275