Parameterizing neural networks for disease classification


Received: 17 September 2018 Revised: 22 March 2019 Accepted: 13 July 2019

DOI: 10.1111/ exsy.12465

S P E C I A L I S S U E P A P E R

Parameterizing neural networks for disease classification

Guryash Bahra1 Lena Wiese2

1 Institute of Computer Science, University of

Göttingen, Göttingen, Germany
2 L3S Research Center/ Knowledge Based

Systems Group, Leibniz University Hannover,

Hannover, Germany

Correspondence

Lena Wiese, L3S Research Center/ Knowledge

Based Systems Group, Leibniz University

Hannover, Appelstraße 4, Hannover 30167,

Germany.

Email: wiese@l3s.de

Abstract

Neural networks are one option to implement decision support systems for health care

applications. In this paper, we identify optimal settings of neural networks for medical diagnoses:

The study involves the application of supervised machine learning using an artificial neural

network to distinguish between gout and leukaemia patients. With the objective to improve

the base accuracy (calculated from the initial set-up of the neural network model), several

enhancements are analysed, such as the use of hyperbolic tangent activation function instead

of the sigmoid function, the use of two hidden layers instead of one, and transforming the

measurements with linear regression to obtain a smoothened data set. Another setting we study

is the impact on the accuracy when using a data set of reduced size but with higher data quality.

We also discuss the tradeoff between accuracy and runtime efficiency.

K E Y W O R D S

artifical neural network, disease classification, MIMIC-III, supervised machine learning

1 I N T R O D U C T I O N

Future precision medicine can vastly profit from data analytics and machine learning to obtain a patient-specific personalized diagnosis based on

various individual health indicators. Based on a body of prior experience documented in electronic medical records of other patients, machine

learning techniques can form the underpinning of medical decision support systems, ideally relying on an integration of both the capabilities of

storage and data analytics as for example in Tashkandi, Wiese, and Wiese (2018). An artificial neural network (ANN) is an established method

to perform classification (one of the supervised learning techniques) on clinical data. In medical data analysis, machine learning is a common

procedure used for classification of patients suffering from different diseases. In this paper, an ANN is used to predict and classify the diagnoses

of either gout or leukaemia based on uric acid measurements. However, we address the topic from a more technical perspective to find out

which enhanced settings (in terms of data preprocessing and neural network configuration) provide the most benefit.

1.1 Medical background

We concentrate on the differentiation of two diseases that both can be identified by measuring the concentration of uric acid. In particular, one

of the common factors in leukaemia and gout diseases is the abnormal uric acid signature in blood. Uric acid concentration in a healthy human

in developed countries ranges from 3.5 mg/ dl in infants to about 6 mg/ dl in adults (Alvarez-Lario & Macarron-Vicente, 2011; Lasko et al., 2013;

Wilcox, 1996). In patients suffering from gout or leukaemia, the uric acid concentration increases more than the normal range and is therefore

regularly monitored and treated. In case of gout, a combination of genetic mutations and environmental factors causes uric acid concentration

to increase. This further results in formation of uric acid crystals, which precipitates into joints and causes painful arthritis (Lasko et al., 2013).

As for leukaemia, turnover of white blood cells increases the uric acid concentration. The two diseases have different pathophysiology, which in

turn combined with their treatment procedures, resulting in different signatures of uric acid concentrations. Therefore, in this article, supervised

learning is performed on uric acid measurements to classify the two diseases leukaemia and gout. In this way, we can derive an automated

recommendation for medical staff regarding the likeliness of one disease or the other.

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any

medium, provided the original work is properly cited and is not used for commercial purposes.

© 2019 The Authors. Expert Systems published by John Wiley & Sons Ltd

Expert Systems. 2019;e12465. wileyonlinelibrary.com/ journal/ exsy 1 of 14
https:/ / doi.org/ 10.1111/ exsy.12465

https://doi.org/10.1111/exsy.12465
https://orcid.org/0000-0003-3515-9209


2 of 14 BAHRA AND WIESE

1.2 Machine learning techniques

Machine learning is one of the top growing fields of recent times and is being utilized in various areas for many applications, such as financial

trading, health care, recommendations, online search, and many more. Machine learning involves building of models from the given data set, which

can be utilized to make future predictions. This process is executed in two phases: (a) From the input data set, calculate unknown dependencies,

and (b) from those dependencies, predict outcomes on so far unseen data sets.

The two common types of machine learning are supervised and unsupervised learning. Supervised learning involves use of a labelled set of input

data to predict the output. In contrast, unsupervised learning involves unlabelled data. Because of this, there is no designated output such that the

learning model has to identify patterns in the input data. The work in this article addresses the supervised learning problem of classification, which

involves categorizing the data into finite classes. More precisely, uric acid concentrations are classified into the two classes: leukaemia and gout.

1.3 Objectives

The main objective of this article is to perform supervised machine learning with neural networks and assess the accuracy of the system to

distinguish between the patients with either gout or leukaemia. We verify and extend our previous preliminary analysis in Bahra and Wiese

(2018) by considering several advanced settings of the neural network; we also analyse the tradeoffs between these different settings of the

neural network model. To achieve this objective, several tasks have to be performed, and these are summarized as follows:

1. Identify files to be used from the Medical Information Mart for Intensive Care (MIMIC) data set provided by Goldberger et al. (2000) and

Johnson et al. (2016).

2. Identify data (patients with either gout or leukaemia and their uric acid signatures) required for the study.

3. Perform preprocessing on the identified data (uric acid concentrations) with linear regression.

4. Perform supervised learning with threefolds cross validation on original and linear regression transformed uric acid measurements in different

settings.

5. Improve the data quality by reducing the data set to patients with at least three uric acid measurements and observing the impact on the

accuracy.

6. Assess the impact on the runtime efficiency of these enhancements.

In more detail, our study starts with identifying the tables from the MIMIC-III database. From those tables, patients with gout and leukaemia

diseases and their corresponding uric acid measurements are identified. The data are then cleansed and are further used for supervised learning.

The neural network for the supervised learning is designed using one hidden layer and sigmoid activation function. Accuracy is then calculated

to measure the effectiveness of the model. Furthermore, to improve the initial accuracy of the model, a couple of enhancements are tried. These

enhancements include the use of the hyperbolic tangent (tanh) activation function, two hidden layers, and linear regression for transformation of

the original uric acid measurements. Lastly, the size of the data set is reduced in order to avoid a bias that results from singular uric acid values in

the measurement sequence of a patient. It is achieved by removing the sequences with less than three measurements.

1.4 Outline of the article

After discussing the background and objectives of the article in Section 1, Section 2 surveys related work in the area. Section 3 covers the detailed

description of the MIMIC database and includes tables identification and data set selection and creation. Section 4 covers the supervised learning

performed on our use case, introducing the basics of neural networks along with their implementation. Results and observations of supervised

learning are then discussed in Section 5; this section also covers various enhanced techniques that include the use of normalization, hyperbolic

tangent function as activation function, two hidden layers instead of one, and linear regression for data preprocessing or transformation. Results

with a reduced data set size are presented in Section 6. The runtime behaviour of our settings is discussed in Section 7. Section 8 summarizes

our work.

2 R E L A T E D W O R K

Several studies in health care are based on machine learning. In this section, we survey some applications of machine learning in health care in

the last two decades.

Lee, Liao, and Embrechts (2000) analysed heart disease databases with techniques like data visualization and correlation analysis to identify

important features in heart disease and to build a multivariate relationship model to visualize the relationship between any two features. They

describe two different approaches, discriminant analysis, and neural networks classification to identify high-risk patients.

García-Gómez et al. (2004) study classification of soft tissue tumors into either benign or malignant classes based on a pattern-recognition

approach. They use MR images as input data and perform machine learning with several approaches—ANN, support vector machine (SVM), and

k-nearest neighbour—for the classification task. They conclude that neural networks give relatively good results compared with the other two

techniques.


BAHRA AND WIESE 3 of 14

Joshi, Pakhomov, Pedersen, and Chute (2006) perform machine learning on electronic medical records to unambiguously expand acronyms

in clinical reports. The learning is carried out with three different algorithms, Naïve Bayes, decision tree, and SVM. They observe that accuracy

of the model is better with all the features combined as compared with when features are used individually and in pairs, independent of the

algorithm used.

Huang, McCullagh, Black, and Harper (2007) perform supervised learning on diabetes data to classify patients into diabetic and nondiabetic

classes. The study also involves identification of features that give a good accuracy for the model. They used several learning algorithms—Naïve

Bayes, IB1 (an instance-based learning algorithm), and C4.5 (a decision tree algorithm)—to achieve the objective.

Nguyen, Moore, McCowan, and Courage (2007) use SVM to classify lung cancer patients into multiple classes of T (the tumor stage that

describes size and position of the tumor) and N (the node stage that describes presence of spread into the lymph nodes) stages of lung cancer.

Juhola (2008) classifies otoneurological data into six classes: vestibular schwannoma, benign positional vertigo, Menière's disease, sudden

deafness, traumatic vertigo, and vestibular neuritis. Seven techniques—k-nearest neighbour searching, discriminant analysis, Naïve Bayesian

decision rule, k-means clustering, decision trees, multilayer perceptron neural networks, and Kohonen networks—are employed for this task. It is

observed that linear discriminant analysis performed better than the other six techniques.

Maes, Twisk, and Johnson (2012) apply supervised learning methods, such as linear discriminant analysis, pattern recognition, and factor

analysis, to differentiate between myalgic encephalomyelitis, chronic fatigue syndrome, and chronic fatigue.

Lasko, Denny, and Levy (2013) use an unsupervised learning method to identify features in gout and leukaemia diseases with the use of the uric

acid signatures. The work described in their paper is implemented on data extracted from electronic medical records Roden et al. (2008). In their

paper, it is mentioned that the data are noisy, sparse, and irregular; therefore, it is smoothened with the use of Gaussian process regression. On

this data, deep learning is performed with the use of sparse autoencoders. Furthermore, the features learned from first and second layers of sparse

autoencoders are then utilized for a supervised learning classification task (to classify between gout and leukaemia) using logistic regression.

Shouval et al. (2014) survey different supervised learning algorithms, namely, ANN, decision tree, and SVM, on a haematopoietic stem cell

transplantation database for the classification task.

Kourou, Exarchos, Exarchos, Karamouzis, and Fotiadis (2015) review supervised learning methods, namely, ANN, Bayesian Networks, SVM,

and decision trees for prognosis of cancer and its prediction. Their paper even highlights the case studies used for machine learning tools to

predict cancer susceptibility, cancer recurrence, and cancer survival.

Beaulieu-Jones and Greene (2016) develop a semisupervised learning method and use it to improve survival predictions of patients suffering

from ALS. Semisupervised learning results from a combination of supervised and unsupervised learning. The method for semisupervised learning

is developed using denoising autoencoders (an unsupervised learning approach) for phenotype stratification, in combination with random forests

(as supervised learning) for classification. In their paper, unsupervised learning is performed to reduce the amount of features of the input data

set, which is then followed by performing supervised learning on this reduced data set.

Weng, Reps, Kai,Garibaldi, and Qureshi (2017) use supervised learning for cardiovascular risk prediction. To achieve this, the following

algorithms are used: random forest, logistic regression, gradient boosting machines, and neural networks. It is concluded in their paper that neural

networks perform better than the rest.

It can be seen from this survey that classification of diseases is a popular use case and ANN is a widely used technique for this classification

task. In contrast to the surveyed approaches, in this paper, we apply supervised learning on a data set different from the ones used before. In

addition, we study the effect of several enhancements (in particular of linear regression for data smoothening). Moreover, we assess the runtime

behaviour of the different settings.

3 D A T A S E T

The MIMIC-III is the third iteration of a large clinical database called MIMIC as reported by Goldberger et al. (2000) and Johnson et al. (2016). It

comprises the medical data of patients admitted to critical care units, Coronary Care Unit, Cardiac Surgery Recovery Unit, Medical Intensive Care

Unit (ICU), Neonatal ICU, Surgical ICU, and Trauma Surgical ICU, at the hospital Beth Israel Deaconess Medical Center in Boston. According to

Goldberger et al. (2000), the current version of the database is 1.4 as of September 4, 2016, and consists of health-related records of deidentified

46,520 subjects out of which 38,645 are adults, and 7,875 are neonates. The patients are deidentified according to Health Insurance Portability

and Accountability Act (HIPAA) standards that involved removal of 18 fields, such as patients' name, telephone numbers, addresses, and so on, as

listed in HIPAA. It also involved shifting of the dates, including date of birth, by a random offset, as directed by the HIPAA. The database not only

includes the information of the vital sign measurements, medicines administered, laboratory measurements, fluid balance, imaging reports, and

out-of-hospital mortality but also includes patients' demographics, nurses' and physicians' notes, procedure and diagnostic codes, and more. Note

that the MIMIC database was prepared by compiling data from two data sources, CareVue and Metavision ICU databases, used at the hospital.

3.1 Data set identification

There are 26 comma-separated values files in the MIMIC-III data set. Out of those 26 files, the following four files are considered for our case

study:


4 of 14 BAHRA AND WIESE

TA
B

L
E

1
S

e
le

ct
io

n
o

f
h

o
sp

it
a

la
d

m
is

si
o

n
s

fo
r

a
n

a
ly

si
s

A
d

m
is

si
o

n
s

N
u

m
b

e
r

o
f

u
n

iq
u

e
a

d
m

is
si

o
n

s
N

u
m

b
e

r
o

f
a

d
m

is
si

o
n

s
le

u
k

a
e

m
ia

N
u

m
b

e
r

o
f

a
d

m
is

si
o

n
s

g
o

u
t

In
it

ia
la

d
m

is
si

o
n

s
2

,8
3

7
6

1
8

2
,2

1
9

A
d

m
is

si
o

n
s

a
ft

e
r

e
lim

in
a

ti
n

g
p

a
ti

e
n

ts
w

it
h

b
o

th
d

ia
g

n
o

se
s

2
,7

7
3

5
8

4
2

,1
8

9

A
d

m
is

si
o

n
s

a
ft

e
r

e
lim

in
a

ti
n

g
p

a
ti

e
n

ts
w

it
h

n
o

u
ri

c
a

ci
d

m
e

a
su

re
m

e
n

ts
1

,0
7

6
3

9
8

6
7

8

F
in

a
la

d
m

is
si

o
n

s
a

ft
e

r
e

lim
in

a
ti

n
g

a
d

m
is

si
o

n
s

w
it

h
n

o
u

ri
c

a
ci

d
m

e
a

su
re

m
e

n
ts

6
4

0
3

0
6

3
3

4


BAHRA AND WIESE 5 of 14

TA
B

L
E

2
S

e
le

ct
io

n
o

f
p

a
ti

e
n

ts
fo

r
a

n
a

ly
si

s

P
a

ti
e

n
ts

N
u

m
b

e
r

o
f

u
n

iq
u

e
p

a
ti

e
n

ts
N

u
m

b
e

r
o

f
p

a
ti

e
n

ts
le

u
k

a
e

m
ia

N
u

m
b

e
r

o
f

p
a

ti
e

n
ts

g
o

u
t

In
it

ia
lo

b
se

rv
a

ti
o

n
s

2
,2

5
9

4
5

4
1

,8
0

5

O
b

se
rv

a
ti

o
n

s
a

ft
e

r
e

lim
in

a
ti

n
g

p
a

ti
e

n
ts

w
it

h
b

o
th

d
ia

g
n

o
se

s
2

,2
1

5
4

3
2

1
,7

8
3

O
b

se
rv

a
ti

o
n

s
a

ft
e

r
e

lim
in

a
ti

n
g

p
a

ti
e

n
ts

w
it

h
n

o
u

ri
c

a
ci

d
m

e
a

su
re

m
e

n
ts

7
6

1
2

7
3

4
8

8

F
in

a
lp

a
ti

e
n

ts
a

ft
e

r
e

lim
in

a
ti

n
g

a
d

m
is

si
o

n
s

w
it

h
n

o
u

ri
c

a
ci

d
m

e
a

su
re

m
e

n
ts

5
6

7
2

5
6

3
1

1


6 of 14 BAHRA AND WIESE

TABLE 3 Selection of uric acid
measurements for analysis

Measurements Number of uric acid measurements

Initial observations 19,906

Observations after eliminating patients with different diagnosis 7,076

Final observations corresponding to final admission IDs 5,665

• D_ICD_DIAGNOSES gives the disease codes (ICD-9 codes) for gout and leukaemia diagnoses.

• DIAGNOSES_ICD identifies the hospital admissions and patients suffering from gout and leukaemia.

• D_LABITEMS gives the ID for uric acid signatures.

• LABEVENTS gives the data about uric acid measurements of the patients identified with gout and leukaemia.

There are 78 ICD-9 codes for leukaemia and 11 for gout identified from the D_ICD_DIAGNOSES table. Then, from the DIAGNOSES_ICD

table, a total of 2,837 hospital admissions are identified with the above diagnoses, out of which 618 hospital admissions are for leukaemia and

2,219 are for gout. These many admissions correspond to 2,259 patients or unique SUBJECT_IDs; in R, the duplicated function with logical

negation operator (!) was used to find these unique SUBJECT_IDs. Among these subject IDs, 454 patients suffered from leukaemia, and 1,805
suffered from gout. Furthermore, 22 patients are common for both the diagnoses. After removing IDs of those patients, a total of 2,773 hospital

admissions were identified for the patients suffering either leukaemia (584 HADM_IDs) or gout (2,189 HADM_IDs) but not both. These 2,773

admissions correspond to 2,215 patients, out of which 1,783 patients suffered from gout and 432 from leukaemia. Moreover, the hospital

admissions are reduced from 2,773 to 1,076 by removing records with no uric acid measurements for those patients. The number of admissions

further decreased to 640 as there are no uric acid observations corresponding to those admissions. And finally, these 640 unique admissions

correspond to 567 unique patients, out of which 311 suffered from gout and the remaining 256 patients suffered from leukaemia. Tables 1 and

2 summarize the above information.

As for the uric acid signatures, 3 IDs were identified from the D_LABITEMS table. Corresponding to those IDs, 19,906 observations representing

uric acid measurements are identified from the LABEVENTS table. These 19,906 observations reduced to 7,076, as the removed observations did

not correspond to the identified SUBJECT_IDs. Furthermore, 7,076 observations reduced to 5,665, as those observations did not correspond to

the identified hospital admission IDs. The Table 3 summarizes the above information.

3.2 Data set creation

The data, that is, uric acid concentrations (from Table 3), are then arranged into 567 sequences, grouped according to the patient IDs (from

Table 2). These sequences are then broken down to a size of 17 values per row, where the first two values are the label (1 for leukaemia and 0 for

gout) and patient ID, and the remaining 15 values are the uric acid concentrations. Note that the sizes of measurement sequences are unequal.

In order to make use of all measurements in the data set, patients with more than 15 measurements occur multiple times: each sequence of 15

consecutive measurements is treated as a new sequence. On the other hand, for sequences with less than 15 measurements, value 0 is used for

the remaining part of the sequence. This resulted in a total of 813 sequences. The sequences are then shuffled with the use of sample function

provided by R Team (2014).

4 M E T H O D

4.1 Neural networks

Neural networks are one class of several supervised learning classification techniques (as introduced in Section 1.2) and are based on the

concept of perceptrons, which was originally introduced by Rosenblatt (1958) and is now widely discussed in standard textbooks and surveys

like Kotsiantis (2007) and Nielsen (2015). In this work, a three-layered neural network is used, where the first layer, L1 , is the input layer, the

second layer, L2 , is the hidden layer, and the last layer, L3 , is the output layer. Note that the output of one layer is the input of the next layer, and

there are sl number of nodes in layer l. Figure 1 illustrates these settings.

4.1.1 Defining the layer sizes

To begin, the number of nodes sl in all the layers are defined. According to our data set (from Section 3.2), the input size is 15, that is, the number of

nodes (s1 ) in Layer 1 (L1 ) is 15. This is because, from Section 3.2, out of 17 values per row, 15 values are the uric acid measurements. The number of

output nodes is 1, as the label (leukaemia or gout) per row is single valued (from Section 3.2). The number of nodes in the hidden layer s2 is varied

in the upcoming tests in order to assess the changes (positive, negative, or no change) in the result. We chose to test s2 = 5, s2 = 10, and s2 = 25.

4.1.2 Forward propagation

As the nodes are calculated starting from the Layer L1 up to Layer L3 in the network, this step is called forward propagation. The activation unit,

a(l)
i

, is used to define the output of the ith unit in layer l. Therefore, for the L1 layer, a
(1)
i

= xi consists of one measurement value from an input


BAHRA AND WIESE 7 of 14

FIGURE 1 Our neural network settings: weight matrix W and bias b are
recalibrated in the learning step; weight decay parameter 𝜆 and the amount of
nodes in Layer 2 s2 are varied in our tests

sequence; as for the Layers L2 and L3 , the nodes are computational, and therefore, activations are calculated as a function of an input vector a
(l)

(the activations of the previous layer), a weights matrix W and a bias vector b:

1. For Layer l = 2: a(2) = f
(

W(1) ∗ a(1)
)
+ b(1),

note that a(1) is an input data sequence and f is the sigmoid function (see below).

2. For Layer l = 3: a(3) = f(W(2) ∗ a(2)) + b(2), this is the final output of the neural network denoting the classification into either gout or leukaemia.
Here, W(l−1)

ij
is the weight or the parameter of the connection between the jth unit in Layer l − 1 and the ith unit in Layer l, bias unit bl

i

corresponds to the ith unit in Layer l.

4.1.3 Initialization of weights W and biases b and activation function

The weights in matrix W are to be initialized to a value close to 0 and therefore are randomly initialized from the interval [−0.5, 0.5] (as in Nguyen
& Widrow, 1990). The biases b are initialized to 0.

Function f ∶ R → R is the activation function and is usually defined with sigmoid function as shown by Equation (1); it produces an output
ranging over (0, 1).

f(z) = 1
1 + exp(−z)

. (1)

An important identity to note is that the derivative f′(z) of sigmoid function f(z) (in Equation 1) is given by Equation (2) and is used in the
learning phase.

f′(z) = f(z)(1 − f(z)). (2)

In order to adapt the bias vector b and the weight matrix W, we apply the squared-error cost function and minimize its value by backpropagation

in the neural network.

4.1.4 Cost function

For a single training example (x, y), where x is one input sequnence of 15 uric acid measurements and y is the class label for either gout or
leukaemia, the squared-error cost function is given in Equation (3).

J(W, b; x, y) = 1
2
||hW,b(x) − y||2, (3)

where, hW,b(x) = a
(3)
1

is the final activation of the neural network; the size of hW,b(x) is 1 × 1, as there is one output node and one training example.
More generally, for a given training set

{
(x(1), y(1)), .., (x(r), y(r)), .., (x(m), y(m))

}
of m training examples, the cost function is given as Equation (4),

J(W, b) =

[
1
m

m∑
r=1

J
(

W, b; x(r), y(r)
)]

+ 𝜆
2

nl−1∑
l=1

sl∑
i=1

sl+1∑
j=1

(
W(l)

ij

)2
, (4)

where the first term is the sum of squares error term averaged over all input sequnences and the second term is the regularization term (or the

weight decay term) that decreases the value of the weights and avoids overfitting. Parameter 𝜆 in Equation (4) is the weight decay parameter.

To minimize the cost function J(W, b) as a function of weights matrix W and bias vector b, every parameter W(l)
ij

and b(l)
i

is initialized to a random

value near to 0. Then, an optimization algorithm is applied for minimization. We chose the limited-memory Broyden-Fletcher-Goldfarb-Shanno

(L-BFGS) method introduced by Liu and Nocedal (1989) for the minimization of the cost function because it usually faster than a basic gradient

descent algorithm. The L-BFGS algorithm is employed with the use of function minFunc (see Schmidt, 2005).


8 of 14 BAHRA AND WIESE

4.1.5 Define 𝜆, and update parameters W and b

The weight decay parameter 𝜆 (in Equation 4) needs to be defined. In the upcoming tests, the value of weight decay parameter 𝜆 is varied in order

to assess the changes (positive, negative, or no change) in the result. We chose to compare three different settings for 𝜆, namely, 𝜆 = 0.0001,
𝜆 = 0.00001, and 𝜆 = 0.000001.

The weights matrix W and the bias vector b for Layers L1 and L2 , which minimize the cost function J(W, b), are recalibrated after every iteration
of the L-BFGS algorithm. In other words, W and b are equal to final values of the partial derivatives of the overall cost function, 𝜕

𝜕W(l)
J(W, b) and

𝜕

𝜕b(l)
J(W, b), respectively.

4.2 Accuracy calculation

To calculate the accuracy of the machine learning model on the training and the testing sets, forward propagation (as described in Section 4.1.2)

is performed such that for Layers l = 2, 3: a(l) = f(W(l−1) ∗ a(l−1)) + b(l−1). Note that the activation matrix of Layer l = 1, a(1) = data (either training or
testing), f is the sigmoid activation function, and W and b are calculated using the L-BFGS algorithm.

Then, the activation vector of the output layer, a(3), is used to assign labels to the data (either training set or testing set). If the value of

each element in a(3) is greater or equal to 0.5, then Label 1 (corresponding to leukaemia) is assigned to the element, else 0 (corresponding

to gout) is assigned. These assigned labels then form the prediction vector of size 542 × 1 in case of training set and of 271 × 1 in case of
testing set.

Each element in prediction vector is then compared with the corresponding actual label of the data. If the labels are the same, 1 is assigned

to a comparison vector; if labels are not the same, then 0 is assigned to the comparison vector. Furthermore, the average is calculated for the

comparison vector, which is of size 542 × 1 in case of training set and of 271 × 1 in case of testing set. The average multiplied with 100 gives the
accuracy of the model in percent.

4.3 K-fold cross validation

To create training and testing sets used for the learning, the k-fold cross-validation method is employed. In k-fold cross validation, the data are

randomly divided into k subsets of equal size, and a single subset is referred to as fold. Of the k folds, k − 1 folds are combined to form the training
set and the remaining fold is used as the testing set, and the accuracy is calculated for the training and the testing sets, which describes the

stability of the model. This is then repeated for k iterations, and for every iteration, the testing set comprises a fold, used exactly once. Moreover,

to use the model for new predictions and to estimate the overall accuracy of the model, consider the classifier for which the highest accuracy is

achieved for the testing set.

We applied threefold cross validation as well as 10-fold cross validation. As described in the previous paragraph, first, the data are divided

into equal subsets. Therefore, 813 sequences (from Section 3.2) are divided into three equal subsets as well as 10 subsets, respectively. Then,

supervised learning is performed on these subsets for three iterations—each iteration either executing threefold or 10-fold cross validation. For

each iteration, the testing set is formed with a single subset used exactly once and the remaining subsets are used as the training set. The

accuracy (calculated as in Section 4.2) is reported as the average of all iterations.

5 R E S U L T S O N C O M P L E T E D A T A S E T

This section describes the results of supervised learning. The results represent the neural network model's ability to distinguish between patients

suffering from either gout or leukaemia based on abnormal uric acid measurements. The accuracies are determined for different cases, resulting

from the change in the values of weight decay parameter 𝜆 and the number of hidden layer nodes s2 (as in Section 4.1.1).

The accuracy is computed three times (iteration I1-I3) per case; this is because weights in W are randomly initialized (see Section 4.1.3) and

therefore give slightly different values for each iteration. For final accuracy of the case, these accuracies are averaged out. Accuracy is measured

in percent.

5.1 Results of cross validation on original data set

The results in Figure 2 are of threefold cross validation performed on the original data set with one hidden layer. The following observations

can be seen from Figure 2. All average test accuracies are very similar. The highest average test accuracy is 83% in case of five neurons in the

hidden layer—that is, with the lowest number s2 of nodes in the hidden layer—and the lowest setting for the weight decay parameter 𝜆. However,

the lower weight decay parameter incurs a much more increased runtime. The results of 10-fold cross validation in Figure 3 present a similar

picture with the highest test accuracy of roughly 83% in case of 10 neurons in the hidden layer and the lowest setting for the weight decay

parameter 𝜆.

We tested different settings with two hidden layers. In the first case, the first hidden layer was fixed to Size 5, and only the size of the second

hidden layer is increased; in the second case, both the first and second hidden layer sizes are increased. In the first case in Figure 4, adding the

second layer reduced the accuracy for the largest weight decay parameter (𝜆 = 0.01). For the other two weight decay parameters, the accuracy


BAHRA AND WIESE 9 of 14

FIGURE 2 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on original data set using
neural network with one hidden layer

FIGURE 3 Accuracies (in percent) and runtime (in seconds) of supervised learning with 10-fold cross validation on original data set using neural
network with one hidden layer

remained in the same range as with one hidden layer. The second case in Figure 5, with both layer sizes increased, also shows a decrease in

accuracy for the case of the largest weight decay parameter and low sizes of the first layer.

When trained with 10-fold cross validation, due to the larger training set size, the decrease in accuracy did not occur (full results are not

shown here due to space restrictions). In other words, the results of 10-fold cross validation with two hidden layers are comparable with

Figure 3.

5.2 Preprocessing of the data set using linear regression

The time-series data present in MIMIC-III are incomplete, inconsistent, sparse, and noisy. In order to regularize the data, we applied linear

regression to condition the data so as to handle irregularities in the data. We expected to smoothen the data by linear regression to obtain a

general trend of each of the uric acid signatures.

Linear regression tries to model the relationship between input and output variables by fitting a linear equation to the input (or observed) data.

To perform linear regression to smoothen the data, the lm1 function provided by R, is used. The lm function is called for a single patient ID at a

time. The result of linear regression for two different patient IDs (or sequences) can be seen in Figure 6.

1 Syntax: lm(x ∼ y). In this work, x is uric acid concentrations, and y corresponds to time measurements.


10 of 14 BAHRA AND WIESE

FIGURE 4 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on original data set using
neural network with two hidden layers and the first layer size fixed to 5

FIGURE 5 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on original data set using
neural network with two hidden layers and varying both layer sizes

FIGURE 6 Illustration of Linear regression transformation for two different sequences


BAHRA AND WIESE 11 of 14

FIGURE 7 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on data set with linear
regression and using neural network with one hidden layer

FIGURE 8 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on data set with linear
regression and using neural network with two hidden layers

The results presented in the Figure 7 are of threefold cross validation performed on on the data set transformed with linear regression before

supervised learning with one hidden layer. Interestingly, for the high weight decay parameter (𝜆 = 0.01), the linear regression of the data set
increased the accuracy to 83.3%). In the other cases, the linear regression did not show any effect on the accuracy. The same observations were

made with 10-fold cross validation.

With two hidden layers, the observations with linear regression resemble the observation without linear regression: For the largest weight

decay parameter, the accuracy for low layer sizes is decreased when adding a second layer (shown in Figure 8). When fixing the first layer

size to 5, the accuracy decreased even more in this case. In all other cases, that is, for different layer sizes as well as for all 10-fold cross

validation cases, the linear regression with two hidden layers did not show any effect on the accuracy (results not shown here due to space

restrictions).

6 R E D U C E D D A T A S E T S

In this section, sequences with less than three nonzero data points are removed in order to remove biases in the result due to too short uric

acid measurement sequences. In other words, threefold cross validation is carried out on the data of the patients, which have more than two

nonzero data points (uric acid measurements) per sequence. Therefore, the number of sequences reduced to 462 (out of 813 from Section 3.2).

Subsequently, the size of a single fold for cross validation was reduced accordingly, too. As in Section 5, the accuracies are determined for

different settings; the accuracy is computed three times per setting, and for the final accuracy, these accuracies are averaged out.


12 of 14 BAHRA AND WIESE

6.1 Cross validation on reduced original data set

The results presented here are of cross validation performed on the reduced original data set. The following observations can be seen from

Figure 9 for one hidden layer with threefold cross validation. The accuracy increased when using the higher quality data set to up to 88.4% in the

case of medium weight decay parameter (𝜆 = 0.001). A similar improvement was observed with 10-fold cross validation (not shown here due to
space restrictions). In the case of two hidden layers, these improvements also manifest for the case that both layer sizes are increased. Our main

observation is that accuracies are overall better for the reduced data set (as compared with the original one).

6.2 Cross validation on reduced transformed data set

We next tested the case of one hidden layer for the data set that is both reduced to retain only sequences of length at least three as well as

transformed by linear regression. The results are shown in Figure 10. The accuracy increased to up to 89.1% for the case of 𝜆 = 0.01 with hidden
layer size 20 as well as 25. This is the best setting that could be obtained; in all other cases (two hidden layers or 10-fold cross validation), this

improvement could not be achieved.

The overall observation is hence that the best result can be achieved with the reduced and transformed data set lowest setting for the weight

decay parameter 𝜆. Transformation with linear regression in this case pays off with improved accuracies (as compared with the nontransformed

case). In particular, again we can observe that accuracies are overall better for the reduced data set (as compared with the original one).

FIGURE 9 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on reduced data set using
neural network with one hidden layer

FIGURE 10 Accuracies (in percent) and runtime (in seconds) of supervised learning with threefold cross validation on reduced data set with
linear regression and using neural network with one hidden layer


BAHRA AND WIESE 13 of 14

7 R U N T I M E C O M P A R I S O N

We implemented the steps to carry out the supervised learning presented in the previous sections in Octave (Eaton, 2019). Data preprocessing

was done in R. We measured the runtime of the neural network model learning phases for all settings for which we reported the accuracies in

the previous sections. Executions are run on a Ubuntu 16.04.2 LTS system with 8-GB RAM and 1 TB of hard disk. The right-hand sides of all the

above figures show the runtime in seconds.

In case of the original data set, we can observe that the higher amount s2 of nodes in the hidden layer(s) has a noticeable impact on the

runtime. In those cases where the higher amount of hidden layer nodes only gives a marginal improvement of accuracy (as in Section 5.2), the

lower amount of hidden layer nodes might be preferred in terms of runtime efficiency. Notably, the lowest value for the weight decay parameter

(𝜆 = 0.001) has a very strong impact on the runtime. With this 𝜆 value the backpropagation has difficulties finding the optimization quickly.
Overall, this 𝜆 value does not pay off neither in terms of accuracy nor in terms of efficiency.

In case of the reduced data set, we can observe that not only the accuracies are better but also the overall runtime decreased. Hence, having

a smaller set of data but with a higher overall data quality can lead to better classification results.

8 D I S C U S S I O N A N D C O N C L U S I O N

In our experiments, a neural network is designed to classify gout and leukaemia patients based on their uric acid measurements. Tests showed

that using more layers only improved the accuracy insignificantly. Yet it can be observed for our use case that the lower value for the weight decay

parameter leads to a runtime increase due to a more involved optimization steps and low values for this parameters should be avoided. In our

settings, using a high weight decay parameter and 20 hidden layers turned out to be the best setting for both accuracy and efficiency. Moreover,

learning on the reduced data set (with better data quality) performed better than on the complete data set both in terms of accuracy and

efficiency. Hence, regarding the tradeoff of having higher data quality with a reduced data set size versus having overall lower data quality with

a larger data set size, we observed that a reduced data set size provided the most benefit. It can also be observed that the using linear regression

transformed data did improve the accuracy of the system best when used in combination with the reduced data set. Overall, we can conclude

that for our use case, this additional preprocessing step only provides a benefit on the reduced data set in terms of accuracy but not in terms of

performance. To sum up, we conclude that several enhancements and settings of neural networks might not lead to optimal accuracy results.

Hence, it should be carefully assessed which settings provide optimal results (both in terms of accuracy and efficiency) for the use case at hand.

In future work, our study can be extended by more features in addition to the uric acid signatures in order to improve the accuracy results. An

extension to cover other diseases than gout and leukaemia can also be a worthwhile topic of future work. Moreover, an in-depth comparison

and combination with other related approaches (in particular, the feature learning approach in Lasko et al. (2013)) can be performed in order to

assess the overall reliability of disease classification as well as quantify their runtime impact.

F U N D I N G I N F O R M A T I O N

None reported.

C O N F L I C T O F I N T E R E S T

The authors declare no potential conflict of interests.

O R C I D

Lena Wiese https:/ / orcid.org/ 0000- 0003- 3515- 9209

R E F E R E N C E S

Alvarez-Lario, B., & Macarron-Vicente, J. (2011). Is there anything good in uric acid? QJM: An International Journal of Medicine, 104, 1015–1024.

Bahra, G., & Wiese, L. (2018). Classifying leukemia and gout patients with neural networks. In International conference on database and expert systems

applications workshops, pp. 150–160.

Beaulieu-Jones, B. K., & Greene, C. S. (2016). Semi-supervised learning of the electronic health record for phenotype stratification. Journal of Biomedical

Informatics, 64, 168–178.

Eaton, J. W. (2019). GNU Octave version 5.1.0 manual: A high-level interactive language for numerical computations. https:/ / octave.org/ doc/ interpreter/

García-Gómez, J. M., Vidal, C., Martí-Bonmatí, D. L., Galant, J., Sans, N., Robles, M., & Casacuberta, F. (2004). Benign / malignant classifier of soft tissue

tumors using mr imaging. Magnetic Resonance Materials in Physics, Biology and Medicine, 16(4), 194–201.

Goldberger, A. L., Amaral, LuisA. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., ... Stanley, H. E. (2000). Physiobank, physiotoolkit, and physionet:

Components of a new research resource for complex physiologic signals, Vol. 101: Circulation Electronic Pages.

https://orcid.org/0000-0003-3515-9209
https://orcid.org/0000-0003-3515-9209
https://octave.org/doc/interpreter/


14 of 14 BAHRA AND WIESE

Huang, Y., McCullagh, P., Black, N., & Harper, R. (2007). Feature selection and classification model construction on type 2 diabetic patients' data. Artificial

Intelligence in Medicine, 41(3), 251–262.

Johnson, A. E., Pollard, T. J., Shen, L., wei H. Lehman, L., Feng, M., Ghassemi, M., ... Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database:

Scientific Data.

Joshi, M., Pakhomov, S., Pedersen, T., & Chute, C. G. (2006). A comparative study of supervised learning as applied to acronym expansion in clinical reports.

AMIA Annual Symposium Proceedings, 399–403.

Juhola, M. (2008). On machine learning classification of otoneurological data. In ehealth beyond the horizon - get it there, IOS Press, pp. 211–216.

Kotsiantis, S. B. (2007). Supervised machine learning: A review of classification techniques. In Maglogiannis, I. G., Karpouzis, K., & Wallace, M. (Eds.),

Emerging artificial intelligence applications in computer engineering: Real word ai systems with applications in eHealth, HCI, information retrieval and pervasive

technologies (pp. 3–24). IOS Press.

Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2015). Machine learning applications in cancer prognosis and prediction.

Computational and Structural Biotechnology Journal, 13, 8–17.

Lasko, T. A., Denny, J. C., & Levy, M. A. (2013). Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical

data: PLOS ONE 8(8).

Lee, I.-N., Liao, S.-C., & Embrechts, M. (2000). Data mining techniques applied to medical information. Medical Informatics and the Internet in Medicine, 25(2),

81–102.

Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1-3), 503–528.

Maes, M., Twisk, FrankN. M., & Johnson, C. (2012). Myalgic encephalomyelitis (ME), chronic fatigue syndrome (CFS), and chronic fatigue (CF) are

distinguished accurately: Results of supervised learning techniques applied on clinical and inflammatory data. Psychiatry Research, 200(2), 754–760.

Nguyen, A., Moore, D., McCowan, I., & Courage, M. J. (2007). Multi-class classification of cancer stages from free-text histology reports using support

vector machines. In 29th Annual International Conference of The IEEE Engineering in Medicine and Biology Society, pp. 5140–5143.

Nguyen, D., & Widrow, B. (1990). Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. In Ijcnn

International Joint Conference on Neural Networks (Vol.3).

Nielsen, M. A. (2015). Neural networks and deep learning: Determination Press. http:/ / neuralnetworksanddeeplearning.com/

Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large-scale de-identified DNA

biobank to enable personalized medicine, Vol. 84: Clinical Pharmacology and Therapeutics.

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.

Schmidt, M. (2005). minFunc: unconstrained differentiable multivariate optimization in MATLAB. https:/ / www.cs.ubc.ca/ ~schmidtm/ Software/ minFunc.html

Shouval, R, Bondi, O, Mishan, H, Shimoni, A, Unger, R, & Nagler, A (2014). Application of machine learning algorithms for clinical predictive modeling: A

data-mining approach in SCT. Bone Marrow Transplantation, 49, 332–337.

Tashkandi, A., Wiese, I., & Wiese, L. (2018). Efficient in-database patient similarity analysis for personalized medical decision support systems. Big Data

Research.

Team, R. C. (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https:/ / www.r- project.

org/

Weng, S. F., Reps, J., Kai, J., Garibaldi, J. M., & Qureshi, N. (2017). Can machine-learning improve cardiovascular risk prediction using routine clinical data?

PLOS ONE, 12(4), 1–14.

Wiese, L. (2015). Advanced data management for SQL, noSQL, cloud and distributed databases. DeGruyter.

Wilcox, W. D. (1996). Abnormal serum uric acid levels in children, Vol. 128: The Journal of Pediatrics.

A U T H O R B I O G R A P H I E S

Guryash Bahra received her B.Tech degree from Guru Tegh Bahadur Institute of Technology, India, and M.Sc. degree from University

of Göttingen, Germany, in 2011 and 2018 respectively. Her research interests include topics like machine learning, data analysis, data

mining and database systems.

Lena Wiese is a member of the L3S Research Center Hannover. She also leads the research group ‘‘Knowledge Engineering’’ (at the

Institute of Computer Science, University of Goettingen). She holds a PhD and a Master degree from TU Dortmund. After her PhD

she worked as a postdoctoral researcher at the National Institute of Informatics in Tokyo and as a visiting lecturer at the University of

Hildesheim and the University of Salzburg. Dr. Wiese is author of the text book Wiese (2015) on Advanced Data Management. Her

research interests lie in the area of efficient and secure data management and analysis. She is an active member of the German Informatics

Society (GI) and regularly acts as a reviewer for conferences and journals.

How to cite this article: Bahra G, Wiese L. Parameterizing neural networks for disease classification. Expert Systems. 2019;e12465.

https:/ / doi.org/ 10.1111/ exsy.12465

http://neuralnetworksanddeeplearning.com/
https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html
https://www.r-project.org/
https://www.r-project.org/
https://doi.org/10.1111/exsy.12465

	Parameterizing neural networks for disease classification
	Abstract
	INTRODUCTION
	Medical background
	Machine learning techniques
	Objectives
	Outline of the article

	RELATED WORK
	DATA SET
	Data set identification
	Data set creation

	METHOD
	Neural networks
	Defining the layer sizes
	Forward propagation
	Initialization of weights Wand biases band activation function
	Cost function
	Define , and update parameters Wand b

	Accuracy calculation
	K-fold cross validation

	RESULTS ON COMPLETE DATA SET
	Results of cross validation on original data set
	Preprocessing of the data set using linear regression

	REDUCED DATA SETS
	Cross validation on reduced original data set
	Cross validation on reduced transformed data set

	RUNTIME COMPARISON
	DISCUSSION AND CONCLUSION
	Funding information
	Conflict of interest
	REFERENCES