13e.xps


World Applied Sciences Journal 31 (Applied Research in Science, Engineering and Management): 69-75, 2014
ISSN 1818-4952
© IDOSI Publications, 2014
DOI: 10.5829/idosi.wasj.2014.31.arsem.516

Corresponding Author: Jakub M. Tomczak, Institute of Computer Science, Wroclaw University of Technology, wyb.
Wyspianskiego 27, 50-370, Wroclaw, Poland

69

Application of Classification Restricted Boltzmann Machine to Medical Domains

Jakub M. Tomczak

Institute of Computer Science, Wroclaw University of Technology,
wyb. Wyspianskiego 27, 50-370, Wroclaw, Poland

Abstract: Recent developments have demonstrated deep models to be very powerful generative models 
which are able to extract features automatically and obtain high predictive performance. Typically, a 
building block of a deep architecture is Restricted Boltzmann Machine (RBM). In this work, we focus on a 
variant of RBM adopted to the classification setting, which is known as Classification Restricted 
Boltzmann Machine. We claim that this model should be used as a stand-alone non-linear classifier which 
could be extremely useful in medical domains. Addit ionally, we show how to obtain sparse representation 
in RBM by adding a regularization term to the learning objective which enforces sparse solution. The 
considered classifier is then applied to five different medical domains.

Key words: Restricted boltzmann machine • classification • sparse • medical domain • diabetes • oncology

INTRODUCTION

Deep learning paradigm becomes a crucial part of 
modern machine learning methods because it allows to 
extract features automatically and obtain high
predictive performance [1]. A building block of a deep 
architecture could be a probabilistic model called
Restricted Boltzmann Machine (RBM), used to
represent one layer of the deep structure. Restricted 
Boltzmann Machines are interesting because they are 
capable of le arning complex featuers and they form a 
bipartie graph which makes inference easy in them.
Moreover, it has been proven that in the sense of
Kullback-Leibler divergence RBM is able to arbitrarily 
well approximate any distribution over binary inputs for 
properly chosen number of hidden units [11]. Basing on 
this result it has been proposed to add an additional 
layer to RBM representing an output variable. In other 
words, it has been advocated to use RBM as a stand-
alone non-linear classifier [8].

Learning flexible classifiers, which are capable of 
extracting features in an automatic manner and
capturing non-linear dependencies, is especially
important in medical domain [7]. It has been shown that 
such classifiers can be very helpful in medical decision 
support systems, e.g., Boosted SVM for lung cancer 
patients [21], Graph-based Rule Inducer for diabetes 
treatment [19], Bagging of decision trees for breast 
cancer recurrence [17]. Recently, RBM for
classification was applied to breast cancer recurrence 
and in comparison to other methods it obtained very 

promising results [18]. In this paper, we aim at
verifying the applicability of RBM as a predictive
model (or diagnostic tool) in five various medical
domains.

The typical learning algorithm for RBM is \textit
{Stochastic Gradient Descent} (SGD) technique which 
scales well for large data and ensures convergence at 
rate equal to inverse of the desired accuracy (under 
mild conditions) [3]. However, there could be two
problems with learning RBM using SGD. First, the
model could be overcomplete [13], i.e., there are too 
many hidden units. Second, learnt features could be not 
enough discriminative, i.e., they are suitable for
reconstruction but insufficient for prediction. A
possible solution to both of these issues is application 
of sparse learning.

Typical sparse learning is based on adding a
regularization term to the learning objective (typically -
negative log-likelihood) which penalizes dense
solutions. An example of such regularization is L1
norm of model's parameters. It has been shown that this 
approach allows to learn with overcomplete
representations [13] but also leads to worst predictive 
performance [14]. Another approach is sparse Bayesian 
learning which aims at introducing latent variable
models which enfo rces sparse solutions [14]. In deep 
learning there are several methods which introduce
sparse solutions. Most of them are based on using 
regularization term which encourages activation of
hidden units to be at low rate. One approach applies 
cross-entropy  between   desired   activation   level  and


World Appl. Sci. J., 31 (Applied Research in Science, Engineering and Management): 69-75, 2014

70

estimated probability of activation [15]. Very similar
solution   proposes   to   use    L2   norm  instead  of  the 
corss-entropy [10]. Other one applies Bhattacharyya
distance in order to differentiate probability of
activation among hidden units for given input [12]. In 
this paper, we give a different view on how to use 
probability distance measures and show that for some 
probabilistic similarity measures (e.g. Kullback-Leibler
divergence, Bhattacharyya distance, Mahalanobis
distance) one obtains the same regularization term.

The paper is organized as follows. In Section II the 
model of RBM for classification is outlined and its 
discriminative learning (Section II-B) and sparse
learning (Section II-C) are described. In Section III the 
empirical study is carried out. We compare the RBM 
for classification with discriminative and sparse
learning with well-known classifiers in five medical
domains. At the end of the paper (Section IV)
conclusions are drawn and directions of future research 
are indicated.

CLASSIFICATION RESTRICTED 
BOLTZMANN MACHINE

Classification Restricted Boltzmann Machine
(ClassRBM) [8, 9] is a three-layer undirected graphical 
model where the first layer consists of visible input 
variables x∈{0,1}D, the second layer consists of hidden 
variables (units) h∈{0,1}M and the third layer
represents observable output variable y∈{1,2,…,K}.
We use the 1-to-K coding scheme which results in 
representing output as a binary vector of length K
denoted by y, such that if the output (or class) is k, then 
all elements are zero except element yk which takes the 
value 1. We allow only the inter-layer connections, i.e., 
there are no connections within layers.

A RBM with M hidden units is a parametric model 
of the joint distribution of visible and hidden variables, 
that takes the following form:

(1)

with parameters θ = {b, c, d, W 1, W 2} and where:

(2)

is an energy function and

(3)

It  can  be  shown  that the following expressions 
hold  true  for  ClassRBM  [8,  9]  (Further in the paper,

sometimes  we  omit  explicit  conditioning  on
parameters θ)

(4)

(5)

(6)

(7)

(8)

where sigm(⋅) is the logistic sigmoid function, iW
 . is 

the ith row of weights matrix W , , jW
 , is jth column of 

weights matrix W , ijW
  is the element of weight 

matrix W .

Prediction: For given parameters θ it is possible to 
exactly   compute   the   distribution  p(y|x,θ)  which
can be further used to choose the most probable class 
label. This conditional distribution take s the following 
form [8, 9]:

(9)

Pre-computing the terms  allows to
reduce the time needed for computing the conditional 
distribution to O(MD+MK) [8, 9].

Learning: We assume given N data D = {xn,  yn},
where nth example consists of an observed inputs  xn and 
a target class yn. Typically, learning of a probabilistic 
model is based on the likelihood function [2]. However, 
in order to train ClassRBM we may consider two
approaches. The first one, called \textit{generative
approach}, aims at maximizing the likelihood function 
for joint distribution p(y|x,θ). The second one, which 
we refer to as discriminative approach, considers the 
likelihood function for conditional distribution p(y|x,θ).
The problem with generative approach is that it is 
impossible to calculate exact gradient of the likelihood 
function for joint distribution (only approximation can 
be applied, e.g., Contrastive Divergence [6]). On the 
other hand, the latter approach allows to compute exact 
gradient [9]. Additionally, we are interested in
obtaining high predictive accuracy, thus, it is more
advantageous  to  learn  ClassRBM  in  a discriminative


World Appl. Sci. J., 31 (Applied Research in Science, Engineering and Management): 69-75, 2014

71

manner. Therefore, to train ClassRBM we consider
minimization of the negative log-likelihood in the
following form:

(10)

A s s tated before, since the distribution p(y|x,θ) can 
be calculated exactly, the gradient of (10) can be
computed exactly too which yields (Function 1a=b  i san 
indicator function which returs 1 if a and b are equal 
and 0-otherwise):

(11)

(12)

Sparse learning: Sparse representations have been 
shown  to  be  beneficial  in   practical applications. 
From  the  information-theoretic  point  of  view,
sparse representations obtain better generalization
performance in comparison to non-sparse ones because 
the training examples should encoded with as small 
number of bits as possible. On the other hand, learning 
sparse features helps to improve discriminative
capabilities of features, i.e., sparse features become
more class-specific.

There are different approaches to introduce sparse 
representations (see Section 1 for a short review). In 
this paper, we follow the approach that adds a
regularization term to the objective function which
enforces sparse solution. Our new objective takes the 
following form:

(13)

where λ>0 is a regularization coefficient and Ω(θ) is 
regularization term.

The most popular regularization term is L2 norm of 
model's parameters which is known as weight decay, 
i.e., 22( ) || ||Ω θ = θ . However, we will try to sparsify
expected activations of hidden units by forcing them to 
be kept at fixed small level µ. In order to obtain the 
desired effect the regularization term could be thought 
as  a  measure that compares the difference between 
two probability distributions. The higher is the
difference between expected hidden units activations 
and µ, the stronger is the regularization. Hence, Ω can 
be,  for example, Kullback-Leibler divergence (KLdiv),

Bhattacharyya distance (Bdist), Mahalanobis distance 
(Mdist)(These three mentioned measures can be
calculated analytically for normally distributed random 
variables, which is not true for other measures, e.g., 
Kolmogorov distance [20]).

Let us assume that the hidden unit activity is 
normally distributed with mean equal p(h j|x,y) and unit 
variance, i.e., N(p(h j|x,y),1) and the desired activation 
level is N(µ,1) (For KLdiv, Bdist and Mdist the value 
of variance turns to be a scaling factor which can be 
further included in the regularization coefficient.
Therefore, for simplicity, we choose unit variance).
Then, it turns out that application of KLdiv, Bdist or 
Mdist to our problem yields the following
regularization term:

(14)

It is interesting that we have obtained exactly the 
same regularization term as in [10], i.e., squared
difference between µ and p(h j|xn,  yn). However, our 
derivation of the regularization term has formal
justification while the result in [10] follows from
considerations in neuroscience and is given rather ad 
hoc as a L2 norm of a difference between expected 
hidden activations and µ. Moreover, our proposition 
can be further developed by weakening the assumption 
about unit variance. However, we leave these
investigations for further research.

The   objective   function  for  learning  is  a  sum
of the negative log-likelihood and the regularization 
term. Therefore, we can apply stochastic gradient
descent algorithm to the new objective (13). For the 
negative  log-likelihood  the gradient is given in (11) 
and (12) and for the regularization term given in
(\ref{eq:regularizationTerm}) we get (for simplicity we 
denote p(h j|xn, yn) by p jn):

(15)

Further, we will refer sparse learning of ClassRBM 
to as sparseClassRBM.

EXPERIMENT

Setup: During learning ClassRBM and
s parseClassRBM   for   different   datasets   we  kept
all  parameters  fixed. The learning rate was set to 
0.001, the number of hidden units was 100.
Additionally,  we used Nesterov's Accelerated Gradient 
technique (with parameter's value 0.5) which is special
kind  of  momentum  term  [16]  and  a  mini-batch of 
size 100.


World Appl. Sci. J., 31 (Applied Research in Science, Engineering and Management): 69-75, 2014

72

ClassRBM and sparseClassRBM were compared 
with the following classifiers:

• CART,
• AdaBoost,
• LogitBoost,
• Tree Bagging,
• Random Forest,
• SVM,
• NeuralNetwork (three hidden layers).

Datasets: We evaluate ClassRBM and
sparseClassRBM on the following medical datasets:

• Heart (joined cleveland and hungarian datasets)
[5]: Diagnostic  problem  of  heart  disease  (less 
than  50% diameter narrowing in many major
vessel or not),

• Diabetes [5]: classification problem of the patient 
as tested positive for diabetes or not,

• Indian liver [5]: classification problem of the
patient as healthy or with a liver issue,

• Sick [5]: diagnostic problem of thyroid disease,
• Oncology [17]: predic tion problem of the patient 

whether there will be a recurrence of breast cancer 
or not.

The datasets are summarized in Table 1. The
number of features and the number of examples for 
each  dataset  are  given.  Additionally,  we  provide the

imbalance ratio  defined  as  the  number  of  majority 
class examples that are divided by the number of
minority class examples.

RESULTS AND DISCUSSION

The  results  of  the  experiment  are  presented in 
Fig. 1-5 as boxplots. In Table 2 best three methods for 
each dataset are presented.

It can be noticed that ClassRBM together with 
sparseClassRBM were the most stable methods which; 
they four times out of five among best three methods 
(in  the  5th case they were on 4th place). Next three 
best  model  were:  AdaBoost  (three times) and 
Random Forest and LogitBoost (two times). Therefore, 
we claim that ClassRBM and its sparse version are 
strong and robust classifiers. However, they are prone 
to highly imbalanced data (see poor performance on 
sick dataset, Fig. 4).

Table 1:The  number  of  features,  the  number  of  examples  and
the imbalance ratio for the medical datasets used in the 
experiment

Number Number of Imbalance
Dataset of inputs examples ration
Heart 46 597 1.45
Diabetes 20 768 1.87
Indian liver 48 583 2.49
Sick 57 3772 15.33
Oncology 55 949 1.60

Table 2: Ranking of classifiers according to Kappa for considered five datasets
Dataset 1st place 2nd place 3rd place
Heart Sparse class RBM Logit boost Class RBM
Diabetes Class RBM Sparse class RBM Ada Boost
Indian liver Random forest Sparse RBM Ada Boost
Sick Cart Random forest Class RBM
Oncology Logit boost Ada Boost Class RBM and Sparse class RBM

Fig. 1: Boxplot of Kappa values for heart dataset


World Appl. Sci. J., 31 (Applied Research in Science, Engineering and Management): 69-75, 2014

73

Fig. 2: Boxplot of Kappa values for diabetes dataset

Fig. 3: Boxplot of Kappa values for indian dataset

Fig. 4: Boxplot of Kappa values for sick dataset


World Appl. Sci. J., 31 (Applied Research in Science, Engineering and Management): 69-75, 2014

74

Fig. 5: Boxplot of Kappa values for onko dataset

Comparing ClassRBM to its sparse version basing 
on the considered datasets is rather inconclusive. It is 
difficult to tentatively state whether application of
sparse regularization term gives better results.
However, on heart, diabetes and indian sparsification 
performes slightly better than ClassRBM, comparably 
on  onko  but  worst  on sic k. Nonetheless, we believe 
that for more hidden units the results would be more 
conclusive  and  the  regularization  term  would
prevent from overfitting and maybe even improve
predictive  performance.  We  leave  this  aspect  for 
further research.

CONCLUSION

In this paper, we have outlined a deep model called 
Classification Restricted Boltzmann Machine in
application to five medical domains. We follow the way 
of reasoning given in [8] which says that ClassRBM 
can be used as stand-alone non-linear classifier.
Moreover, we claim that this model is very stable and 
should be used as a state-of-the-art classifier in any 
domain and especially in medical domain which
demands stable solutions. In the experiments, for five 
different medical problems, we have shown that both 
discriminative and sparse learning of ClassRBM give 
very promising results and outperforms well-known
strong classifiers like AdaBoost, LogitBoost,
TreeBagging and Random Forest.

In our study two important issues have arisen 
which have indicated possible future research. First, 
ClassRBM fails when data are highly imbalanced and 
thus there is a need to propose some remedy for that 
issue. Second, in our proposition of sparse learning we 
have assumed unit variance. However, such approach is 
very simplistic and weakening this assumption could 
give very interesting solutions.

ACKNOWLEDGMENT

The research conducted by Jakub M. Tomczak has 
been partially co-financed by the Ministry of Science 
and Higher Education, Republic of Poland (grant No. 
B30098I32).

REFERENCES

1. Bengio, Y., 2009. Learning deep architectures for 
AI. Foundations and  trends  in  Machine Learning, 
2 (1): 1-127.

2. Bishop, C.M., 2006. Pattern recognition and
machine learning. New York: Springer.

3. Bottou, L., 2012. Stochastic Gradient Descent
Tricks. In: Neural Networks: Tricks of the Trade. 
Springer Berlin Heidelberg, pp: 421-436.

4. Cohen, J., 1960. A coefficient of agreement for
nominal scales. Educational and Psychological
Measurement. 20 (1): 37-46.

5. Frank, A. and A. Asuncion, 2010. UCI machine 
learning repository.

6. Hinton, G.E., 2002. Training products of experts 
by minimizing contrastive divergence. Neural
Computation, 14 (8): 1771-1800.

7. Kononenko,  I., 2001. Machine learning for
medical diagnosis: History, state of the art and 
perspective.  Artificial  Intelligence  in  Medicine, 
23 (1): 89-109.

8. Larochelle, H. and Y. Bengio, 2008. Classification 
using discriminative restricted Boltzmann
machines. ICML.

9. Larochelle, H., M. Mandel, R. Pascanu and Y.
Bengio, 2012. Learning algorithms for the
classification   restricted   Boltzmann  machine. 
The   Journal   of   Machine   Learning  Research, 
13: 643-669.


World Appl. Sci. J., 31 (Applied Research in Science, Engineering and Management): 69-75, 2014

75

10. Lee, H., C. Ekanadham and A. Ng, 2007. Sparse 
deep belief net model for visual area V2. In:
Advances in Neural Information Processing
Systems, pp: 873-880.

11. Le Roux, N. and Y. Bengio, 2008.
Representational power of restricted Boltzmann
machines and deep belief networks. Neural
Computation, 20 (6): 1631-1649.

12. Le Roux, N., H. Larochelle and Y. Bengio, 2008.
Discriminative Training of RBMs using
Bhattacharyya Distance. Learning Workshop, Cliff 
Lodge, Snowbird, Utah.

13. Lewicki, M.S. and T.J. Sejnowski, 1998. Learning 
nonlinear overcomplete representations for
efficient coding. Advances in Neural Informatio n
Processing Systems, pp: 556-562.

14. Mohamed, S., Ghahramani, Z., Heller, K.A.
Bayesian and L1, 2012. Approaches for Sparse
Unsupervised Learning. In: Proceedings of the 29th 
International Conference on Machine Learning
(ICML-12), pp: 751-758.

15. Nair, V. and G.E. Hinton, 2009. 3D object
recognition  with  deep  belief  nets. In: Advances 
in   Neural   Information   Processing   Systems, 
pp: 1339-1347.

16. Sutskever, I., Martens, J. Dahl and G. Hinton,
2013. On the importance of initialization and
momentum in deep learning. ICML.

17. Štrumbelj, E., Z. Bosni´c, I. Kononenko, B.
Zakotnik and C. Grašic Kuhar, 2010. Explanation 
and reliability of prediction models: The case of 
breast cancer recurrence. Knowledge and
Information Systems, 24: 305-324.

18. Tomczak, J.M., 2013. Prediction of breast cancer 
recurrence using Classification Restricted
Boltzmann Machine with Dropping. arXiv preprint 
arXiv:1308.6324.

19. Tomczak, J.M. and A. Gonczarek, 2013. Decision 
rules  extraction  from  data stream in the presence
of changing context for diabetes treatment.
Knowledge  and  Information  Systems, 34 (3): 
521-546.

20. Zhou, S.K. and R. Chellappa, 2006. From sample 
similarity to ensemble similarity: Probabilistic
distance measures in reproducing kernel hilbert 
space. IEEE Transactions on Pattern Analysis and 
Machine Intelligence. 28 (6): 917-929.

21. Zieba, M., J.M. Tomczak, M. Lubicz, J. S´wia?tek, 
2014. Boosted SVM for extracting rules from
imbalanced data in application to prediction of the 
post-operative life expectancy in the lung cancer 
patients. Applied Soft Computing, 14 (A): 99-108.