Submitted 4 August 2020
Accepted 1 October 2020
Published 9 November 2020

Corresponding author
Zaini Abdul Halim, zaini@usm.my

Academic editor
Shuihua Wang

Additional Information and
Declarations can be found on
page 31

DOI 10.7717/peerj-cs.309

Copyright
2020 Lee and Abdul Halim

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

Stochastic computing in convolutional
neural network implementation: a
review
Yang Yang Lee and Zaini Abdul Halim
School of Electrical and Electronic Engineering, Universiti Sains Malaysia, Nibong Tebal, Penang, Malaysia

ABSTRACT
Stochastic computing (SC) is an alternative computing domain for ubiquitous deter-
ministic computing whereby a single logic gate can perform the arithmetic operation
by exploiting the nature of probability math. SC was proposed in the 1960s when
binary computing was expensive. However, presently, SC started to regain interest
after the widespread of deep learning application, specifically the convolutional neural
network (CNN) algorithm due to its practicality in hardware implementation. Although
not all computing functions can translate to the SC domain, several useful function
blocks related to the CNN algorithm had been proposed and tested by researchers. An
evolution of CNN, namely, binarised neural network, had also gained attention in the
edge computing due to its compactness and computing efficiency. This study reviews
various SC CNN hardware implementation methodologies. Firstly, we review the
fundamental concepts of SC and the circuit structure and then compare the advantages
and disadvantages amongst different SC methods. Finally, we conclude the overview of
SC in CNN and make suggestions for widespread implementation.

Subjects Artificial Intelligence, Computer Architecture, Data Mining and Machine Learning,
Embedded Computing, Real-Time and Embedded Systems
Keywords Stochastic computing, Convolutional Neural Network, Deep learning, FPGA, IoT

INTRODUCTION
Deep learning algorithms have been widely and silently integrated into our daily life;
for example, image enhancer, voice search and linguistic translation. Meanwhile, the
Internet of things (IoT) has gained industrial recognition, and many applications rely
on edge computing whereby data are processed on the fly rather than relayed to cloud
computing for reliability and security reasons (Naveen & Kounte, 2019). People have been
heavily dependent on a widely accessible central processing unit (CPU) and general-
purpose graphics processing unit (GPU) for deep learning research and application
deployment. Although users strive to achieve great real-time response by offloading many
computationally intensive tasks, such as object recognition to edge devices, those computing
devices become extremely inefficient despite the utmost priority of power efficiency in IoT.
Although the field-programmable gate array (FPGA) and application-specific integrated
circuit (ASIC) could overcome the power efficiency issue, economically implementing deep
learning hardware logic is not ideal. Thus, researchers are trying to explore alternatives to
conventional binary in this specific use case, driving the rise of stochastic computing (SC).

How to cite this article Lee YY, Abdul Halim Z. 2020. Stochastic computing in convolutional neural network implementation: a review.
PeerJ Comput. Sci. 6:e309 http://doi.org/10.7717/peerj-cs.309

https://peerj.com/computer-science
mailto:zaini@usm.my
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.309
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://doi.org/10.7717/peerj-cs.309


SC was proposed in the 1960s when the cost of implementing binary computing was
prohibitive but soon ran out of favour in the semiconductor industry. Unlike binary
computing, SC can perform the arithmetic operation with a single logic gate. The most
evident advantage of SC is its ability to reduce the area and power draw by reducing
the number of active transistors (De Aguiar & Khatri, 2015). SC is also an inherently
progressive precision where the output converges from the most significant figure; thus,
SC is capable of making early decision termination (EDT). Power efficiency and EDT
capability make deep learning application favourable (Kim et al., 2016), particularly in
convolutional neural network (CNN) application.

CNN received extensive development since its introduction in 2012 due to its
unprecedented performance in object recognition. CNN model development was trending
from being deep and massive (highly accurate) to responsive (fast inference). In response
to the IoT requirements in edge computing, researchers had attempted to reduce the
math precision to save computing resources. With a reasonable trade-off for accuracy, an
extreme simplification version of CNN, that is, binarised neural network (BNN), emerged
with a promising hardware implementation capability and computing efficiency, rivalling
SC methodology.

SC in CNN lacks widespread attention due to its cross-disciplinary nature in the
computer science study. CNN is impactful in the field of machine learning, but the rising of
IoT edge computing which pursues efficient computing pushed back CNN implementation
hard. While many researchers focus on innovating CNN algorithms for different use cases
such as medicine and agriculture, only a few of them consider how to implement CNN
realistically since CNN execution is computationally intensive by itself. Given that no
comprehensive and updated review exists on this specific area, in this review paper we thus
attempt to investigate and survey the SC implementation in the CNN application.

REVIEW METHODOLOGY
This review intends to answer the following research questions:
(1) What are the major developments of SC elements and SC CNN in recent years? Due

to the narrow field of study in SC, the related studies are scattered, let alone the SC
in CNN implementation; thus, impede the development of SC CNN without a more
centralised reference, increasing difficulty in identifying the research trends.

(2) How exactly is the CNN being computed/executed in the stochastic domain? SC is a
unique computing methodology which is not often being mentioned in the academic
study, despite its unique advantage in the surge of CNN application. Thus, there is a
need to have a big picture on the SC CNN mechanism.

(3) What are the open problems and/or opportunities to implement SC CNN? SC CNN
does have its implementation hurdles. Thus, it is necessary to summarise them before
moving forward in this field of research.
With the research questions in mind, we first reviewed the basic concepts of SC and

CNN in modern perspectives. It is necessary to understand the background of SC and
CNN due to the vastly different field of studies between them. Moreover, there is a need to

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 2/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


aggregate the knowledge of SC elements in the face of rising trends in SC developments.
Then we examined the recent developments and contributions of SC in CNN computation
and compared the implementation methodologies across various recent studies. Finally,
we made a conclusion and some suggestions for the future of deep learning research in the
SC domain.

Search criteria
An initial search was carried out to identify an initial set of papers which have the prior
works on SC and CNN in hardware implementation. The search strings were then inferred
and developed as follows:

(‘Stochastic computing’) OR (‘Stochastic computing deep learning’) OR (‘Stochastic
computing convolutional neural network’) OR (‘Stochastic computing neural network’)
OR (‘Stochastic computing image processing’)

‘Stochastic’ alone has a lot of meaning in a wide area of study. Thus, the keyword of
‘Stochastic computing’ is a necessity to narrow down the search scope. The search strings
were applied to the indexed scientific database Scopus and Web of Science (peer-reviewed).
Domain-oriented databases (ACM Digital Library, IEEE Xplore Digital Library) were also
referred extensively. Finally, Google Scholar (limited to peer-reviewed papers) were
used to find any omitted academic literature, especially in this multi-disciplinary search
scope. Peer-reviewed articles were preferred to ensure only confirmed works were to be
summarised in this review paper.

Scope of review
Notably, SC is not the only methods existed for efficient CNN computing. We only cover
the topics of SC and SC related to CNN computing in this review study. Many articles
may not directly involve CNNs, but their novel SC elements are worthy as part of the
significant SC developments and potentially useful for the future SC CNN function blocks,
thus, will be mentioned in this review. Some elemental studies on CNNs were referred to
understand the nature of CNN algorithms better. Some surveys on CNN implementation
in FPGA merely or never discuss the SC, but they shared a similar concern on efficient
CNN computation. Thus, their surveys were also considered and referred to in this review
study if any.

BASIC CONCEPTS
SC and CNN are different fields of studies and worth a separate explanation. Thus, SC will
be described first, then secondly CNN and BNN will be explained. Lastly, the competitive
relationship between SC and BNN implementations will be discussed. SC is a unique
concept of computing relative to traditional binary computing and has to be understood
before an in-depth discussion on SC implementation in CNN at the next section.

SC
SC is favourable in IoT application due to its extreme simplicity of computing elements,
where power efficiency is of utmost priority. Unlike deterministic computing that tolerates

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 3/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


x>y

y

x
Comparator

Stochastic 
logic circuits

outstream

Counter

Random 
number 
generator

Binary number

Binary 
number

Binary domain Stochastic domain Binary domain

Stochastic bitstream

Figure 1 Process of SC and its elements.
Full-size DOI: 10.7717/peerjcs.309/fig-1

10011010 (4/8)
01101010 (4/8)

S1
S2

00001010 (2/8)
S3

S1

S2

D

C ENB

Multiplexer
10100110 (4/8)

11011011 (6/8)

10011010 (4/8)

10110011 (5/8)S1

S2

S3

S4

00100100 (2/8)
01101001 (4/8)

S1
S2

00100000 (1/8)
S3

00100100 (2/8)
11001001 (4/8)

S1
S2

00000000 (0/8)
S3

A B

C D

Figure 2 SC arithmetic operation. (A)AND gate as SC unipolar multiplier. (B) MUX as SC scaled adder.
(C) Uncorrelated bit streams give accurate output. (D) Correlated bit streams give inaccurate output.

Full-size DOI: 10.7717/peerjcs.309/fig-2

absolutely no error, SC allows errors to a certain degree, thus the name approximate
computing. SC initially decodes a binary number into a bitstream in such a way that
the frequency of 1’s bit represents the magnitude of value. For example, [0,0,0,0,0,1,1,1]
stochastic stream is equal to 3/8 or 0.375 because it has three 1’s bits. Then, the number can
be computed in the stochastic domain with a simple logic gate instead of gate combinations
in the binary domain. Finally, the stochastic stream will be converted back to binary
numbers with a simple counter by counting the frequency of 1’s bit, as shown in Fig. 1.
SC took advantage of probability math to reduce the logic components required to perform
an arithmetic operation. Taking Figs. 2A and 2B as examples, in the AND gate multiplication
operation, the output can be defined as:

S3=P(S3)=P(S1)P(S2)=S1×S2. (1)

In the case of addition operation, the output will be scaled by half with MUX select
input with a bitstream value of 0.5. The MUX scaled adder can be defined as:

S4=P(S3)P(S1)+(1−P(S3))P(S2)=
P(S1)+P(S2)

2
=

S1+S2
2

,P(S3)=0.5, (2)

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 4/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-1
https://doi.org/10.7717/peerjcs.309/fig-2
http://dx.doi.org/10.7717/peerj-cs.309


where P is the probability of the stochastic stream. The AND gate multiplier only applies
to unipolar math where the real number ∈ [0,1]. In the case of bipolar math where real
number ∈ [−1,1] (0’s bit decodes as −1), the XNOR gate can be used as a multiplier,
whereas the same MUX can function as a bipolar adder.

The stochastic number generator (SNG) becomes the heart of the SC to perform
arithmetic operations in the stochastic domain. SNG consists of a random number
generator (RNG) and a comparator; both worked synchronously to generate stochastic
bitstream from a given binary number. However, the RNG was the biggest challenge in the
previous SC circuit design because the correlation between the operating bitstreams plays
a great role in SC accuracy. An SC output will be accurate only if both working streams are
not correlated. Taking Figs. 2C and 2D as examples, [0,1,1,0,1,0,0,1] and [1,1,0,0,1,0,0,1]
bitstreams can represent the value of 4/8, but the output on Fig. 2D is far from accurate due
to a high correlation to the opposite bitstream. The correlation index of both bitstreams
can be defined as:
n∑

i=1

S1(i)S2(i)=
∑n

i=1S1(i)×
∑n

i=1S2(i)
n

, (3)

where ‘S’ is the respective stochastic bitstream and ‘n’ is the bit length. Thus, the accuracy is
highly dependent on the randomness and the lengths of the stochastic stream. Nevertheless,
not all of the SC elements are sensitive to stochastic correlation such as MUX scaled adder
(Alaghi, Qian & Hayes, 2018).

Presently, a pseudo-random RNG called a linear-feedback shift register (LFSR) is widely
accepted due to its simple design and effectiveness in lowering bitstream correlation (Alaghi
& Hayes, 2013). LFSR consists of a feedback XOR gate and a bit shift register as shown
in Fig. 3A. The register will be initialised with a specific value, and then, the register will
generate pseudo-random binary values in every bit shifting clock cycle. The binary number
generated from RNG will be compared with the user input binary number. Two circuits
can be used as a comparator, namely, binary comparator and weighted binary generator
(WBG) as shown in Figs. 3B and 3C respectively. Both are capable of generating stochastic
bitstream. After the stochastic stream passed through the stochastic logic circuits, the
computed stochastic streams can be converted back to the binary domain by using a simple
flip-flop counter.

SC never stops improving and keep achieving great accuracy whilst using less area and
power. SNG is the major overhead of the SC circuit. As such, Ichihara et al. (2014) proposed
a circular shifting technique to share LFSR. Then, (Kim, Lee & Choi, 2016a) proposed a
method very similar to memoisation computing to reduce the number of LFSRs in large
scale SC. Xie et al. (2017) attempted to share LFSR with wire exchange technique with
additional random bit flip, whereas Joe & Kim (2019) proposed symmetrical exchange of
odd wire and even wire. Even better, Salehi (2020) showed that simple wire permutation
paired with WBG could deliver the lowest correlation index, thus achieving great accuracy
whilst requiring fewer logic gates. Interestingly, Chen, Ting & Hayes (2018) replaced LFSR
with up-counter in conjunction with WBG to take advantage of WBG binary weighting
characteristics to assure SC progressive precision behaviour. As such, zero-error EDT is

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 5/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


1000

L3 L2 L1 L0

L3
X3

L2
X2

L1
X1
L0
X0

Stochastic bitstream

L3

L2

L1

L0

X3 X2 X1 X0

Stochastic bitstream

A

C

B

Figure 3 SNG components. (A) RNG with LFSR. (B) True comparator. (C) WBG.
Full-size DOI: 10.7717/peerjcs.309/fig-3

achievable without extra hardware cost. The WBG could also be shared partially because
some WBG logics could be redundant (Yang et al., 2018).

More advanced operations, such as square, division and non-linear functions, had
also gained attention and innovations to fit modern applications. The stochastic square
is already in its simplest form as shown in Fig. 4A. Squaring stochastic stream can be
conducted by delaying the input stream with D flip-flop before multiplying itself. In the
case of a non-linear function, such as hyperbolic tangent (TanH), stochastic TanH (Stanh)
uses k-state finite state machine (FSM), such as that in Fig. 4B. FSM is a class of logic
circuits that will have a specific logical output pattern only if the input reached a designated
sequential threshold. Stanh function can be described as:

Stanh(K,x)= tanh
(
Kx
2

)
, (4)

where K is the number of states (must be multiples of 2) and x is the input stream. Brown
& Card (2001) showed that Stanh function responds closely to the true TanH function
with K =16. However, too many states will result in random walk behaviour (Kim et al.,
2016); thus, an optimum amount of state for accurate reproduction of TanH function

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 6/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-3
http://dx.doi.org/10.7717/peerj-cs.309


Q

Q
SET

CLR

D

X

Y=X
2

clk

S0 S1 Sn/2‐1

Sn Sn‐1 Sn/2

X’

X

X’

X  X 

X’
X’

X 

X X 

X’ X’

Y=0

Y=1

A B

Figure 4 SC with advance arithmetic operations. (A) Stochastic squaring with D flip-flop. (B) K-state
FSM for Stanh function which will be widely utilised in SC CNN.

Full-size DOI: 10.7717/peerjcs.309/fig-4

exists in the stochastic domain. An improvement in FSM design can also emulate linear
and exponential functions (Najafi et al., 2017).

The real challenge in SC (also the missing part of SC) is the stochastic divider design.
Stochastic division traditionally used FSM with extra SNG components for gradient descent
approach as shown in Figs. 5A and 5B, but the gradient descent convergence time incurred
inaccuracy on the output. Newer SC divider from Chen & Hayes (2016) exploited the
stochastic correlation properties to perform stochastic division without using expensive
SNG as shown in Figs. 5C and 5D. This event is possible because if stochastic stream p(x)
and p(y) are perfectly correlated, and p(x)<p(y), then by probability math:

p(x =1,y =1)=p(x =1). (5)

Given that conditional probability p(x =1|y =1) (probability of x =1 given that y =1)
can also be expressed as:

p(x =1|y =1)=
p(x =1,y =1)

p(y =1)
=

p(x)
p(y)

, (6)

the desired divider function on the SC domain is derived as a result. Hence, the stochastic
division can be performed if both stochastic streams are perfectly correlated by evaluating
the conditional probability of x and y. In the case of p(x)>p(y), the output will be clipped
to a value of 1. Chu (2020) improved the circuit by using JK flip-flop, but only for unipolar
SC division.

The overall structure of SC is thus explained. Other than the benefit of power efficiency,
SC is also inherently error resilient where accidental bit flips will not affect the overall
operation of the stochastic circuits; otherwise, it could be catastrophic in deterministic
computing. Secondly, SC is inherently progressive precision whereby the output value
converges from the most significant figures. For instance, if the output is 0.1234, then
‘0.1...’ will show first in the stochastic compute cycles instead of ‘...4’ in the conventional
binary. This characteristic is useful in specific applications, such as weather forecasting,
where only the most significant figure matters in decision making. Thus, performing EDT
in SC without waiting for complete computation is possible. With that said, its simplicity

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 7/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-4
http://dx.doi.org/10.7717/peerj-cs.309


SNG

SNG
SNGout

down

up
counter

X1

X2

PX1

PX2 Binary 
gradient 
descent

PX1/X2
SNG

SNG

SNGout

down

up
counterX1

X2

PX1

PX2 Binary 
gradient 
descent

PX1/X2

Q

Q
SET

CLR

D

clk

PX2
2

PX1PX2

‐ZPX2
2

Z

u1

LFSR

x1
comparator

u2

LFSR

x2
comparator

LFSR

X1

X2

S1

S2

D

C ENB

Multiplexer Q

Q
SET

CLR

D

PX1

PX2

PX1/X2

clk

u1

LFSR

x1
comparator

u2

LFSR

x2
comparator

LFSR

X1

X2

S1

S2

D

C ENB

Multiplexer Q

Q
SET

CLR

D

PX1

PX2

|PX1/X2|

clk

S1

S2

D

C ENB

Multiplexer

0.5

0.5(PX1+PX2)

Sign(X1)

Sign(X2)

PX1/X2

A B

C
D

Figure 5 SC divider circuits. (A) Former gradient descent unipolar divider. (B) Former SC bipolar di-
vider. (C) Newer SC unipolar divider by exploiting correlation. (D) Newer SC bipolar divider by adding
sign information.

Full-size DOI: 10.7717/peerjcs.309/fig-5

did come at a cost. Increasing math precision in SC also requires long bit lengths, thus
increasing computing time latency by 2n folds. For instance, doubling numerical precision
from 4 to 8 bits requires increasing bit length from 24=16 bits to 28=256 bits, or 24 times
exponential increase in computing latency.

SC becomes unfavourable to modern computation needs due to the ever-increasing
efficiency in binary computing. Nevertheless, certain niche applications can still benefit
from SC topology, such as a low-density parity-check decoder in a noisy data-transmission
environment; very robust image processing tasks, such as gamma correction and Sobel
edge detection (Joe & Kim, 2019); and the recent interest in CNN algorithm.

CNN
With the advancement of computing technology, many applications are getting highly
reliant on probabilistic computation. Deep neural network (DNN) is a widely accepted
class of machine learning algorithms to process complex information, such as images and
videos. The nature of DNN consists of layers of addition and multiplication of numerical
weights that end up computing the overall dimensionless probability values of an output
class, which in turn allows the computer to decide based on the output value. Many DNN
algorithm variations exist, each for a particular purpose, such as CNN for image processing
and long short-term memory for neural-linguistic processing. CNN, for example, can
reduce multidimensional images into simple classes; thus, CNN is very popular in image
classification and object recognition.

The most distinctive component that discriminates CNN from other DNN algorithms is
its convolution layer. CNN can reduce large matrices into a single value representation, as
shown in Fig. 6A, which explains its superior capability of dimensional reduction in image

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 8/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-5
http://dx.doi.org/10.7717/peerj-cs.309


X
l‐1 W

l‐1

X
l

Neuron model

f(∑WiXi + b)
W2X2 f(X)

Class 1x1

1x32x32 4x28x28

4x14x14

16x10x10

16x5x5

1x64

5x5 
convolution

2x2 
subsampling

5x5 
convolution

2x2 
subsampling

Fully 
connected

A
B

C

Flatten

Figure 6 CNN’s convolution and activation. (A) Matrix convolution. (B) Neural network model after
the convolution. (C) Architecture of classical LeNet-5 CNN.

Full-size DOI: 10.7717/peerjcs.309/fig-6

processing. The convolution process can be generalised as:

ylj = f
(
xlj
)
= f

(
n∑

i=1

(
xl−1i ×w

l−1
ij

)
+blj

)
, (7)

where xlj is the convolved feature of the next layer, x
l−1
i is the feature from the previous

layer, wl−1ij is the kernel weight matrix, and b
l
j is bias. ‘l’ is the layer number, ‘i’ denotes

scan window number, ‘n’ is the total number of scan window, and ‘j’ is the depth of next
feature map. After the convolution, the activation function f

(
xlj
)
exists, which can be a

linear or non-linear function. Rectified linear unit (ReLU) and Tanh are just the names of
a few popular activation functions. The final product ylj will be aggregated, and the process
repeats, depending on the structure of the CNN model.

The convolution and activation layers are fundamental in CNN, albeit other optional
layers exist, such as normalisation layer (to reduce output variance) (Ioffe & Szegedy, 2015),
pooling layer (to save memory), and dropout layer (to prevent overfitting) (Hinton et al.,

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 9/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-6
http://dx.doi.org/10.7717/peerj-cs.309


2012). At the end of convolution, the convolved pixel matrix will be flattened into a single
list of data. Then, those data will be fed to a highly traditional biological neuron-inspired
model, so-called fully connected neuron or dense layer, as shown in Fig. 6B. Moreover,
multiplication and addition repeat until the model converges to the size of the desired
class output. A simple LeNet-5 model (Liew et al., 2016) as depicted in Fig. 6C shows the
end-to-end structure of a typical CNN, from the input image to output class.

Its cascaded arithmetic operation is where the CNN algorithm execution stressed the
modern processor. It either spends too much processor time to serialise the process, or
taking many hardware resources for parallelisation. The convolution arithmetic does
multiplication and addition exhaustively. If the matrices of scanning windows are large
or the network is deep/wide, then the computational demands required are high. As
the multiplication and accumulation operations increase, memory access bottlenecking
becomes the major limitation for DNNs (Capra et al., 2020). Traditional computing also
uses floating-point units (FPU) which takes a wide area with high power consumption due
to the colossal amount of logic gates involved. As the edge computing gains interest as the
future trends of computation, energy efficiency has become a major concern for the CNN
development and urged the researchers to rethink another way to process the information
efficiently. Most of the modern FPU is of 32-bit floating-point (full precision). Thus,
reducing the precision to 16 bits (half precision) or lower is one of the ways to improve
CNN computation efficiency without much accuracy degradation.

BNN
In an extreme case, the parameters are reduced to only a single bit representation. This
radical simplification of CNN is called BNN and gained attention among researchers
in the industry due to its compactness in memory usage and practicality (Simons &
Lee, 2019). In BNN, the parameters can only have two possible values, that is, −1 or 1.
Despite some considerable degree of accuracy degradation, BNN does have several unique
advantages. First is its model size; for instance, 64 MB of parameter data can be reduced
to 2 MB, thus allowing the deployment of small embedded systems. Its little memory
usage also allows memory-centric computing where the parameters can be stored directly
beside the processing elements, thereby speeding up the computation by eliminating
data communication cost (Angizi & Fan, 2017). Second is its hardware implementation
capability. BNN requires some amount of arithmetic logic unit (ALU) to process fixed-point
image data at the frontend (still cost less than FPU). However, the multiplication of the
hidden layer can be simply an array of XNOR gates because the multiplication of −1
and 1 is of bipolar math. The high hardware utilisation of BNN in FPGA results in high
throughput, whereas being an order of magnitude if not more energy-efficient than CPU
and GPU despite lower clock speed (Nurvitadhi et al., 2017). Another unique advantage is
that BNN is less susceptible to adversarial attack with stochastic weight quantisation in the
hidden layer (Galloway, Taylor & Moussa, 2018). The adversarial attack is where data, for
example, an image, are injected with noise and adversely affect the output class decision of
a fine-tuned CNN model, albeit the doctored image has no perceptual difference to human
eyes.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 10/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


However, non-linear functions become useless due to the extreme information loss
by the parameter quantisation. Instead, a threshold function can simply replace the
normalisation and activation functions (Simons & Lee, 2019). The BNN also suffer accuracy
degradation from highly challenging datasets, such as ImageNet classification, due to
extreme information loss. As many studies explore for better BNN optimiser algorithms,
Zhu, Dong & Su (2019) found that training optimisers might not help much due to BNN
insensitivity to small changes in value. Instead, BNNs in parallel with ensemble technique
(multiple trained BNNs in parallel and final decision with a majority vote) is a perfect fit,
improving the overall BNN accuracy on large image classification datasets.

SC CNN vs BNN
The evolution of CNN to BNN challenged the idea of SC due to the competitiveness in
hardware implementation capability. SC implementation is technically more challenging
than BNN due to various custom logics and substantial uncertainty in future community
support. After all, SC is still at its infancy in the CNN domain. Regardless of the different
intentions and directions of development of SC and BNN, both studies try to explore
alternatives for a highly efficient computing paradigm in the future of the IoT edge
computing. With the massive exploitation and integration of DNN algorithms into small
or remote devices, such as a smartwatch or surveillance camera, both fields of studies will
contribute to the development of a highly realistic edge computing ecosystem.

SC implementation in CNNs
SC is considered the next frontier in energy-efficient edge computing (Jayakumar et al.,
2016) due to its energy-efficient operation and ability to tolerate errors in domains of
recognition, vision, data mining and so on. Meanwhile, many applications attempt to
offload challenging workloads from cloud computing to edge devices. Thus, SC had
become the hotspot of research interest.

Integral SC: a radical change in SC methodology for the sake of CNN
CNN is very popular in vision application due to its simplicity and accuracy. However,
SC does not provide out-of-box experience as SC is not yet customised and explicitly
optimised for the CNN algorithm. Hence, Ardakani et al. (2017) proposed a radical idea to
use an integer stream instead of the traditional bitstream. The stochastic byte is∈ [0,1,2...]
so that to repurpose simple binary multiplier and bitwise AND as shown in Figs. 7B and 7C
to process integer number in the stochastic domain, or integral SC. The idea is to preserve
information across different precisions within a limited stochastic length.

The effect of information loss becomes apparent when many MUX half-adding many
stochastic streams exist together. The resultant precision requirements will only increase
and require long overall bitstreams to preserve the precision of the half-added stochastic
number. For example, a value of 0.5625 (9/16) requires a 16-bit length stochastic stream,
whereas 0.875 (7/8) only requires 8-bit length. Although 0.875 can be expressed in 16-bit
length, half-adding both numbers result in 0.71875 (23/32), or at least 32-bit length to
preserve the output precision in the stochastic domain. If so, the overall stochastic bit length
will need to be extended to 32-bit length. Cascading MUX adders in the CNN convolution

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 11/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


u1

x2

x1

Integral SC 
Elements

01001011 101010110.5625 = 

11102022

0.875 = 11110111

 
Binary multiplier

8/8 = 20110211

10/8 = 12201202 10/8 = 20200402

S1

S2

D

C ENB

Multiplexer

0

12/8 = 12221202

2/8 = 10010000

3/8 = 10020000

S

X

S

X

S

X Integral SC 
AND

S

X

Integral SC 
AND

S

X

Integral SC 
AND

S

X

Integral SC 
AND

S

X

Integral SC 
AND

S

X

Tree Adder

NStanh
Stochastic 
bitstream

binary

binary

A

B

C

D

Integral SC 
Neuron

Figure 7 Integral SC methodology. (A) High precision stochastic number can be represented with
shorter stream length with integer value. (B) Binary radix multiplier as integral SC scaled adder. (C)
Modified MUX as integral SC multiplier. (D) Integer SC neuron block.

Full-size DOI: 10.7717/peerjcs.309/fig-7

stage will drive up the bit length requirements drastically, thus incurring considerable
computational latency. The same problem also applies to the multiplier.

Then, the integral SC comes into play. Take Fig. 7A as an example. A value of 0.5625
can be effectively represented in the same length as the 8-bit length value of 0.875. Given
that integral SC can preserve the stochastic information in an integer value, the final
batch adding operation in CNN can be processed with tree adder as shown in Fig. 7D,
eliminating parallel counter. Integer stream also allows short stochastic stream length,
thus speeding up the SC time. They also proposed integer version of TanH k-state FSM
because the traditional stochastic TanH (Stanh) function on FSM only accepts stochastic
bits, thereby leading to the modern TanH FSM design. However, integral SC only solved
the precision degradation issue, and many other CNN functions are yet to translate to SC
domain. Moreover, the usage of binary adder and multiplier may not scale well in the case
of deploying large CNN models. They claimed energy saving of 21% compared with the
full binary radix computing but is still far from the expected power reduction in the SC
transition.

Extended stochastic logic (ESL): another radical approach
ESL made an extreme modification to the SC methodology if integral SC is not radical
enough. Instead of using a single stochastic bitstream for a value, ESL used two stochastic
streams such that their ratio of division represents the actual value (Canals et al., 2016).
ESL intends to compute the entire range of real numbers in the stochastic domain. For
example, if x* is a whole number, then x* = p*/q*, where p* and q* are the ESL stochastic
pair for x*. p* and q* remain in real number∈ [−1,1] in the bipolar format, but obtaining
its ratio x* can translate to the entire range of real numbers ∈ [−∝,∝].

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 12/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-7
http://dx.doi.org/10.7717/peerj-cs.309


S1

S2

D

C ENB

Multiplexer

0.5

p*
s*

q*
r*

q*
s*

t*

u*

p*
r*

q*
s*

t*

u*

ESL multiplier

p*
s*

q*
r*

t*

u*

ESL divider

ESL Adder/Subtractor

A B

C

Figure 8 ESL arithmetic unit. (A) ESL multiplier. (B) ESL divider by crossing multiplication. (B) ESL
adder and subtractor circuit.

Full-size DOI: 10.7717/peerjcs.309/fig-8

ESL requires dedicated logic gate for p* and q* stochastic streams. Taking Figs. 8A and
8B as an example, if x* = p*/q* and y* = r*/s*, then by probability math, multiplication
between two separable stochastic streams will be:

x∗×y∗=
p∗×r∗

q∗×s∗
=

t∗

u∗
, (8)

where t* and u* are the output pair of stochastic streams. Division can be done simply
by flipping the nominator and denominator of the second stochastic pair. In the case of
stochastic addition, the stochastic pair can be processed such that:

x∗+y∗=
p∗

q∗
+

r∗

s∗
=

p∗×s∗+q∗×r∗

q∗×s∗
=

t∗

u∗
, (9)

whereas subtraction can be done by NOT gate inversion as shown in Fig. 8C.
Value splitting is feasible in the stochastic domain due to the nature of probabilistic

computing. However, splitting into double stochastic streams complicated everything,
including a compulsory custom bipolar divider (convert t* and u* back to real number
representation) before bipolar TanH function blocks. The custom block extensively used
comparator and RNG, which add a red flag for efficient computing. The neural network
may compute in the real number ∈ [−∝,∝] on the early day, but the CNN nowadays

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 13/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-8
http://dx.doi.org/10.7717/peerj-cs.309


commonly compute in bipolar math. After all, the final output class of CNN only need to
tell the computer whether the probability ∈ [0,1]. ESL did provide an insight into how SC
can perform normal arithmetic full-range computation. However, verifying whether ESL
is better than other SC methods for CNN use case despite the attractive circuit simplicity in
primary arithmetic operations is hard due to the non-linear activation function complexity
in ESL implementation.

Approximate parallel counter (APC) and Btanh: a simple and
energy-efficient approach
Implementing radical changes in every SC component might not be easy. Thus, another
highly effective approach with traditional stochastic bitstream is APC. Other than the
frontend binary to stochastic conversion stage of SNGs, the final stochastic to binary
conversion stage is also equally important (Kim, Lee & Choi, 2016a; Kim, Lee & Choi,
2016b).

In the case of accumulating multiple bitstreams, MUX adder could become inaccurate
due to loss of n−1 bits input information (Li et al., 2017c). In this case, a parallel counter as
the one in Fig. 9B is preferred consisting of an array of full adders (FA), but FA is relatively
expensive as it uses binary adder logic circuits. The accurate parallel counter should no
longer be used as SC is already based on approximate computation. Thus, an APC has
been proposed to reduce the FA components with a slight trade-off on accuracy whilst
achieving the same counting function at lower area and power consumption as shown in
Fig. 9A. The proposed APC could save area and power by 38.3% and 49.4%, respectively.
The caveat is that the output from APC is in the binary domain; thus, directly removing
any stochastic stream from the stochastic domain computation.

Although the traditional Stanh uses single input k-state FSM, with the inspiration from
integral SC research, the binary output from APC is cleverly reused as an input for another
modified binary input FSM called Btanh. TanH activation function is essential in CNN. For
example, if the binary output value is 4, then the FSM will directly jump four states instead
of step-wise jumps in Stanh. More granular Tanh step-function could also be achieved,
which is not possible with Stanh FSM. In the end, the binary output values from APC will
be indirectly converted back to stochastic stream with TanH non-linear function applied,
completing the stochastic convolution computation as depicted in Fig. 9C. Moreover,
energy usage can be further reduced by 69% by sacrificing 1.53% of accuracy with EDT,
that is, terminating computation at 50% of the computing time. Then, their APC and
Btanh components had become the foundation for other SC CNN approaches in the next
coming years.

Near-perfect SC implementation in CNN algorithm
Ren et al. (2016), Li et al. (2017c); Li et al. (2017b), Ren et al. (2017) and Li et al. (2018a)
proposed a complete overview of a near-perfect CNN analogy in the SC domain,
including the following: the missing pooling layer, ReLU and sigmoid activation layer,
and normalisation layer which will be discussed separately in the sub-sections below.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 14/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

2
3

2
2

2
1

2
1

Approximation Unit

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

Sum
Cin

Cout

x2

x1
Full Adder

2
0

2
1

2
2

2
3S1

S15

S1

S16

Parallel Counter

Btanh activation

S1

S2

Sn‐1

Sn

n Stochastic 
streams

Binary of log2n bit
Stochastic 
streams

A B

C

Figure 9 SC bitstream accumulation. (A) APC. (B) Accurate parallel counter. (C) Accumulation and
Btanh activation workflow.

Full-size DOI: 10.7717/peerjcs.309/fig-9

SC average pooling and max-pooling layers
The purpose of CNN pooling layer is to reduce memory usage and reduce model size. Ren
et al. (2016) first used cascaded MUX as the average pooling function in CNN as shown in
Fig. 10A. This solution is simple but may face the precision loss issue as described in the
Integral SC research. Average pooling may not help in CNN training convergence either.
Ren et al. (2017) proposed max-pooling hardware equivalent to the widely adopted CNN
max-pooling layer. The stochastic stream with a maximum value at any given time in the
stochastic domain could not be verified. Hence, a dedicated counter for each stochastic
stream is required to sample and evaluate which stream is of maximum value. By referring
to Fig. 10B, the counter samples the first few bits and compare the magnitude at the end
of bitstream sampling to make an early decision on which stochastic stream should be
chosen to continue the path. The first few bit information could be inaccurate and thus
is an approximate max pooling. Nevertheless, the decision will eventually converge to
the bitstream of maximum value if the sampling continues due to the properties of SC
progressive precision. Moreover, if the bitstream is long, then less information will be lost,
thereby achieving negligible accuracy loss.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 15/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-9
http://dx.doi.org/10.7717/peerj-cs.309


S1

S2

D

C ENB

Multiplexer

S1

S2

D

C ENB

Multiplexer

S1

S2

D

C ENB

Multiplexer

0.5

0.5

0.5

Average 
pooling

S1

S4

D

C2C1 ENB

Multiplexer

C
o
u
n
te
r

C
o
u
n
te
r

C
o
u
n
te
r

C
o
u
n
te
r

Comparator

Max 
pooling

u1

x1

En
Tanh

S1

S2

D

C ENB

Multiplexer

Max(X,Y)

X
Y

u1

x2

x1
Tanh Max

u1

x2

x1
Tanh Max

u1

x2

x1
Tanh Max

Max 
pooling

S1

S3
S2

S4

S1

S3

S2

S4

S1

S3

S2

S4
Tanh Max

A

C

B

Figure 10 SC pooling function. (A) 2×2 average pooling with cascaded MUX adder. (B) hardware-
oriented approximate max pooling circuit. (C) Stochastic MAX function, cascading them will create pure
SC max pool block.

Full-size DOI: 10.7717/peerjcs.309/fig-10

However, a more straightforward stochastic max-pooling block was proposed by Yu
et al. (2017). With only an XOR gate, FSM and MUX, a novel stochastic MAX block
could select whichever stream of higher value. With XOR gate controlling the FSM state
jumping, the probability of the opposite stream could be inferred from another bitstream
by generating the condition of bit entanglement. As such, whenever the FSM sampled a
0’s bit from the current bitstream, it implies a 1’s bit on the opposite bitstream. Thus,
whenever inequality between two bitstreams exists, the FSM state will be biased to the one
with higher magnitude, completing the MAX function with the MUX. Cascading the MAX
function block could realise the max-pooling function block as shown in Fig. 10C.

SC ReLU and sigmoid activation layer
The CNN activation layer is similar to the usual neuron activation function. ReLU function,
as the name suggests, performs rectification and cutting off any negative value such that:

f (x)=max(0,x). (10)

ReLU function is trendy due to its fast computation and solves diminishing return in
backward propagation learning during the CNN training stage. However, no SC equivalent
circuit existed for that particular function; thus, Li et al. (2018a) proposed a novel SC-based
ReLU function block as depicted in Fig. 11A.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 16/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-10
http://dx.doi.org/10.7717/peerj-cs.309


xn

u1x2

x1

Stochastic 
stream 
pooling

S1

S2

D

C ENB

Multiplexer

u1

x2

x1
Linear FSM

x1>x2

x2

x1
Comparator

Accumulator

Half‐clock up 
counter

State number

ReLU(X)

+ Parallel Counter

‐ Parallel Counter

x‐

‐

u1x+

+
Adder

Q

Q
SET

CLR

D

>0

S1
S1

Sn
bias

+
/4

S1
S1

Sn
bias

‐
/4

Sigmoid(X)

A

B

Comparator

Figure 11 Other SC activation functions. (A) ReLU activation function. (B) SC sigmoid activation func-
tion with bias input.

Full-size DOI: 10.7717/peerjcs.309/fig-11

Firstly, the ReLU amplitude will be naturally maxed out at value = 1 in the stochastic
domain, but this is not a concern as clipped ReLU has no significant accuracy degradation
(Fei-Fei, Deng & Li, 2010). Secondly, a negative value must be clipped to zero. Notably, the
number of 0’s bit in the bipolar stochastic stream determines the magnitude of negativity.
Thus, when the accumulated value is less than the reference half value (the 0’s bit is more
than 1’s bit) in a given sample time, the output will be forced to be 1’s bit. Otherwise, the
output will follow the pattern of emulated linear function from the FSM. Although real
number convergence in the accumulator takes time, the real value information is equally
distributed in the stochastic bitstream. Hence, obtaining an accurate comparison is possible
by observing the first few bits of information; thus, inaccuracy is negligible. Moreover, the
comparison is synchronous to the input; therefore, no latency will be incurred.

In the case of larger and deeper CNN models such as VGGNet and GoogleNet, the
sigmoid function becomes more favourable as non-linear function. As such, Li et al.
(2017a) proposed a hardware-oriented SC sigmoid approximation function as shown in
Fig. 11B. Since the output of the stochastic stream is maxed at 1, the Taylor series expanded

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 17/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-11
http://dx.doi.org/10.7717/peerj-cs.309


sigmoid function could be approximated as:

1
1+exp(−x)

≈



1,x >2
1
2
+

1
4
x,−2≤x ≤2

0,x <−2

. (11)

By strategically partitioning the positive summation and negative summation in such a
way that:

A=
1
4
∗

∑
pos

P ·Q+
1
2
+

bias+

4
,B=

1
4
∗

∑
neg

P ·Q+
bias−

4
, (12)

the approximate stochastic sigmoid activation function could then be realised by subtracting
both parts such that:

A−B=
1
2
+

1
4

(∑
P ·Q+bias

)
, (13)

where ‘P’ and ‘Q’ are the weight and pixel value respectively. Therefore, by pre-scaling the
weights and bias to quarter times, the stochastic sigmoid function could be devised as a
result, with the added benefit of including bias information which is missing in the previous
SC CNN implementation. The binary adder now is the sigmoid activation function itself,
eliminating the need for extra hardware cost such as FSM. However, unlike the APC +
Btanh function block, the accurate parallel counter is needed.

The sigmoid function is not limited to CNN algorithm, or rather, is a universal activation
function in other DNN classifier algorithms such as multilayer perceptron and restricted
Boltzmann machine. With 1024-bit length stochastic stream, the proposed SC sigmoid
activated convolution neuron block could perform as accurate as binary computing CNN
while consuming 96.8% and 96.7% less area and power respectively, hugely improving the
capability of SC in the DNN algorithm computation in general.

SC normalisation layer
The purpose of the normalisation layer is to reduce internal covariance, thereby improving
the overall CNN output accuracy. If the ReLU activation is applied to the previous layer,
only a simple local response normalisation function is required, which can be summarised
as:

bix,y =
aix,y(

k+α
∑min(N−1,i+n/2)

j=max(0,i−n/2)

(
ajx,y

)2)β , (14)
where the summation part accumulates all N numbers of adjacent neuron output of
aix,y. ‘k’, ‘n’, ‘ α’, and ‘β’ are hyperparameters which can be determined by CNN
backpropagation training. The complexity of the mathematical relationship can be
decoupled into three compute components, square and sum (calculate the denominator
components), exponential function with ‘‘β’’ and finally division. Li et al. (2017c) used
stochastic square, FSM activation block and traditional gradient descent SC divider to
construct SC normalisation circuit as shown in Fig. 12 to perform SC normalisation.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 18/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


u2
u3
u4
u5

x3
x4
x5

u1
x2
x1

SC Square 
Array

x3
x4
x5

u1
x2
x1

APC Adder FSM based 
Normalisation 

block 
function,
K, α, β 

u1

x2

x1
SC Division

Neuron out 1
Neuron out 3
Neuron out 5

Divident
Normalised 
Neuron out 3

Stochastic 
domain

Binary 
domain

Divisor

Stochastic 
domain

Figure 12 Normalisation unit in SC CNN.
Full-size DOI: 10.7717/peerjcs.309/fig-12

The accuracy had improved with SC normalisation function and only dropped by 0.88%
compared with the original binary AlexNet CNN model, achieving six times in the area
and five times in power savings compared with binary equivalent normalisation. However,
they could have utilised newer SC divider as discussed in the basic concept section.

Other optimisations
The dropout layer is one of the regularisers in CNN to prevent overfitting. However,
dropout layer functions only at the CNN training phase, and no custom hardware
adaptation is needed at the inference stage, hence no extra hardware overhead. Li et
al. (2018a) optimised the APC function block by utilising inverse mirror FA to reduce the
number of transistors required for single FA from 32 to 24 transistors. They also proposed
the APC design which input is not a power of two by incorporating inverse half adder.
APC optimisation further reduced the area required by at least 50% and an average of 10%
improvement in energy efficiency.

In terms of SC accuracy, the bipolar format remains the major limitation as bipolar
is generally worse than the unipolar in terms of SC accuracy (Ren et al., 2016). To
overcome the signed value accuracy limitation, Zhakatayev et al. (2018) decoupled the
sign information from the stochastic stream and added one stochastic bit pair specifically
to store the sign value. Unlike stochastic probability value, the sign value of a stochastic
stream is deterministic, thus, can be processed separately from the stochastic magnitude.
Although small hardware overhead is needed to process the sign function, such as an extra
XOR gate to multiply signed value, the accuracy gain is significant, 4∼9.5 times better
compared to the bipolar format. With that advantage in mind, the little extra hardware
cost for sign processing is trivial.

Binary Interlaced SC, two is better than one
Full-fledged SC CNN might not be feasible to fit a wide variety of modern complex CNN
models. However, the massive multiplication parallelism of SC is still very favourable. Thus
SC-based multiply-accumulate (MAC) unit was proposed by Sim & Lee (2017) as shown
in Fig. 13A to act as multiplier accelerator for binary computing. The MAC leverages
the parallelism of SC multiplier, then accumulate value with accurate parallel counter,

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 19/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-12
http://dx.doi.org/10.7717/peerj-cs.309


u2
u3
u4
u5
un

x3
x4
x5
xn

u1
x2
x1

XNOR gate 
array

x3
x4
x5
xn

u1
x2
x1

Parallel 
Counter

u2
u3
u4
u5
un

u1

x1

Accumulator
1+log2n

B
in
a
ry

S
to
ch
a
st
ic
 

st
re
a
m
s

SNG

SNG

SNG

SNG

SNG

Normal SC

Weight push ahead

Down 
Counter

Counter

Counter

Counter

01011010

10101010
00001010

01011010

00001111
00001010

4/8

4/8

4/8

4/8

4/8

4

00001010

00001111

Stop counting

Stop 
signal

2/8

2/8

2/8

S1

S4

D

C2C1 ENB

Multiplexer

clk

x2

x1
Custom FSM

Time

x3
x2
x1
x0

0,x3,x2,x3,x1,x3,x2,x3,x0,x3,x2,x3,x1,x3,x2,x3

A

B

C

B
in
a
ry
 i
n
p
u
t

Figure 13 Binary interlaced SC, where SC is used as MAC accelerator. (A) SC MAC unit block. (B) SC
MAC optimisation by cutting off SC early with advancing weight bits. (C) Novel SNG with MUX and
FSM.

Full-size DOI: 10.7717/peerjcs.309/fig-13

returning pure binary value to other binary computing circuits at the end of SC cycle. This
approach, while not the most energy-efficient one, achieved two times the area efficiency
and at very high throughput compared to binary computing. With only a single layer
SC in mind, Sim et al. (2017) further leveraged the SC MAC to perform unipolar SC
multiplication. All the stochastic 1’s bit of the neuron weight value was pushed ahead of
time by down counting the weight value so that the SC cycle could terminate when the
stream tail of the weight ended with 0’s bit as depicted in Fig. 13B. This event is possible
because any section of the stream could represent the true value of the stream due to the
probabilistic nature. It is technically feasible as long as single layer SC is concerned. They

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 20/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-13
http://dx.doi.org/10.7717/peerj-cs.309


also proposed a novel MUX FSM based SNG. By predefining the MUX selection sequence
in such a way that the output is the sum of binary weight, the binary input could be directly
converted into a stochastic stream as depicted in Fig. 13C, eliminating the need of WBGs
which could be expensive in FPGA implementation. With the strategic down-counting
timing, an area-delay product reduction of 29%∼49% is achieved while being 10%∼29%
more energy efficient compare to binary computing. In any case, they ignored the SNGs
hardware overhead in performance comparison.

Considering that only a single SC layer is required, Hojabr et al. (2019) radically
redesigned the MAC unit by exploiting computing pattern in modern CNN design
and proposed Differential MAC (or DMAC). Firstly, because CNN ReLU function always
returns positive value, in addition to the binary pixel of positive value, thus, up/down
counter could be used as ReLU function. Secondly, considering that a pixel value will
eventually pass through all the weight multiplication matrix of CNN scanning window in
the convolution process, the neuron weights could be sorted offline ahead of time. In this
way, the weight differential from the next sorted weight of higher value is guaranteed to be
positive, thus, can be fed to a down counter similar to SC MAC to pipeline the stochastic
multiplication. Since the first weight is of minimum value which could be negative, a D
Flip-Flop is used to hold the sign information just for the first bipolar multiplication.
Thus, multiplying in SC is as simple as counting the number of bits from the MUX
AND-ing with counter ‘enable’ control from the weights as depicted in Fig. 14. The FSM
could be shared among all MUX, ignoring the stochastic correlation issue because the
multiplication is mutually independent (Yang et al., 2018). The buffered accumulated
value will then continue the summation operation as the DMAC final stage. This major
circuit overhauling could deliver 1.2 times and 2.7 times gains in speed and energy efficiency
respectively relative to the former MAC with the benchmarking on more modern CNN
models.

Stochastic quantisation, SC is going asynchronous
In the face of quantised binary CNN whereby the arithmetic is lower than 8-bit precision,
no optimisation had been done on the SC CNN counterpart. SC could consume a lot of
logic gates as well, especially in CNN use case. Thus, Li et al. (2018b) proposed a novel
multiplier with shifted unary code (SUC) adder. From the binary interlaced SC research,
the weights do not have to follow probability distribution as the pixel value does, as long as
the next SC component is not computing in the stochastic domain. By strategically using
the weight information as a timing control for SC multiplication, meaningful bits from
each stream could be quantised and unified into a single multiply-sum-averaged stochastic
stream by OR-ing the parallel bitstreams asynchronously as depicted in Fig. 15. The SUC
adder significantly reduced the requirement of parallel counter whereby its internal FA is
expensive in the perspective of SC. The area and power savings are significant as a result,
as much as 45.7% and 77.9% respectively relative to usual unipolar SC with less than
1% accuracy loss compared to quantised binary CNN, paving the way for more efficient
parallel counting accumulation mechanism in SC CNN.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 21/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


U/D

Reset

B1

B4

Carry out

ENB

Counter

U/D

Reset

B1

B4

Carry out

ENB

Counter

U/D

Reset

B1

B4

Carry out

ENB

Counter

S1

S4

D

C2C1 ENB

Multiplexer

S1

S4

D

C2C1 ENB

Multiplexer

S1

S4

D

C2C1 ENB

Multiplexer

reset

x2

x1
Custom FSM

U/D

Reset

B1

B4

Carry out

ENB

Down Counter

Q

Q
SET

CLR

D

B
in
a
ry
 1

B
in
a
ry
 2

B
in
a
ry
 n

w2 w1

Scheduler

O
u
tp
u
t 
b
u
ff
e
r

w2=w2‐w1

weight indexer

Binary Domain Stochastic Domain Binary Domain

Sign index

P(X1)

P(X2)

P(Xn)

I1

weights

Figure 14 Differential MAC. Major overhauling to the SN MAC with counter and differential weight
control indexing to pipeline the SC MAC computation.

Full-size DOI: 10.7717/peerjcs.309/fig-14

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 22/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-14
http://dx.doi.org/10.7717/peerj-cs.309


SNG

SNG

SNG

SNG

a

b

c

d

1001010111000101

1100101011100010

0110010101110001

1011001010111000

0000000000001111

0000000011110000

0000111100000000

11110000000000001/4

1/4

1/4

1/4

1001101001111000

a/4+b/4+c/4+d/4

SUC Adder

SNG

SNG

Parallel Counter

SNG

SNG

SNG

SNG

u1

x2

x1
SUC Adder

u1

x2

x1
SUC Adder

u1

x2

x1
SUC Adder

Bias+

x‐

‐

u1x+

+
Adder

Q

Q
SET

CLR

D

>0
Sigmoid(X)

SNG

SNG

Parallel Counter

SNG

SNG

SNG

SNG

u1

x2

x1
SUC Adder

u1

x2

x1
SUC Adder

u1

x2

x1
SUC Adder

Bias‐

A

B

Figure 15 Stochastic quantisation accumulation with SUC adder. (A) The different information por-
tion of the stochastic stream could be encoded into a single stream by OR-ing the required bitstream asyn-
chronously. (B) SUC paired with SC sigmoid activation function.

Full-size DOI: 10.7717/peerjcs.309/fig-15

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 23/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-15
http://dx.doi.org/10.7717/peerj-cs.309


S1

S2

D

C ENB

Multiplexer

S1

S2

D

C ENB

Multiplexer

Vin

V/4

V/2

3V/4

V

V

V

Y0

Y1

Y2

Vin Y2 Y1 Y0

<V/4

<V/2

<3V/4

>3V/4

0 0 0

0 0 1

0 1 1

1 1 1

X2 X1 X0 Y6

0

0

0

0

1

0 0 0

0 1 0

1 0 0

1 1 0

Binary Thermometer code

Y5 Y4 Y3 Y2 Y1 Y0

0 0

0 1

1 0

1 1

1

1

1

1

11

1

11

1

11

11

11

1111

1111

111111 1

0 0 0 0 0 0

0 00 0

0 00 0

0 0

0 0

0 0

0

0

0

0

Y6 Y5 Y4 Y3 Y2 Y1 Y0

X2

X1

X0

A B

Figure 16 ASC with thermometer coding. (A) Implementation of ASC on thermometer-encoded SC cir-
cuit, eliminating the need for ADC and memory components. (B) Thermometer coding could be utilised
for SNGs.

Full-size DOI: 10.7717/peerjcs.309/fig-16

Analog-to-Stochastic Converter, SC CNN is ready to be embedded
In the case of direct interfacing with analogue input, such as analogue camera sensor,
Analog-to-Digital Converter (ADC) is usually being deployed, but at the cost of requiring
memory storage. Zhang et al. (2019) proposed a novel converter, namely, Analog-to-
Stochastic converter (ASC) as shown in Fig. 16A where the analogue voltage differential
could be directly decoded into stochastic streams with thermometer encoding scheme.
The stochastic stream could either be encoded via LFSR, counter, or newly proposed
thermometer coding as depicted in Fig. 16B. The thermometer coding is capable of
generating parallel bit streams at once but has higher error compared to the others.
Nevertheless, with long enough bitstream length, those error is negligible. The thermometer
encoding enabled the design of novel ASC which allows SC CNN to be directly interfaced
with analogue voltage input, eliminating ADC and memory storage.

SC CNN is meant for memory-centric computing
Notably, SC CNN does require a tremendous amount of weight data similar to fixed
point binary CNN. Despite many SC CNN architecture innovations, however, without
efficient weight storage near to SC elements, SC CNN will suffer memory bandwidth
bottlenecking similar to the binary computing. Since the weight information is fixed
from the training process, those data can be stored in a more area and power-efficient
non-volatile Domain-Wall Memory (DWM) (Ma et al., 2018) built beside the SC elements.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 24/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-16
http://dx.doi.org/10.7717/peerj-cs.309


XNOR gate

0(‐1) 0(‐1) 1(+1)
0(‐1)
1(+1)
1(+1)

1(+1) 0(‐1)
0(‐1) 0(‐1)
1(+1) 1(+1)

Encoding (value) XNOR multiply

SC bipolar 
multiply

BNN vector 
multiply

0
0.2
0.5
1

...0100101

...1011010

...1101100

...1111011

A

B C

Figure 17 SC BNN methodology. (A) The similarity of SC and BNN in terms of logic gate utilisation. (B)
Usual configuration in binary BNN. (C) SC BNN first layer binary image conversion in SC BNN.

Full-size DOI: 10.7717/peerjcs.309/fig-17

This strategy could eliminate memory bandwidth bottlenecking by bringing memory closer
to the computing element, namely, memory-centric computing or in-memory computing.
SC CNN can greatly benefit from memory-centric architecture due to the nature of massive
parallelism. Memoisation approach could also be executed in memory-centric design by
storing the weight data directly in a predefined stochastic bitstream representation instead
of original binary values. As such, sequential read of stochastic bit from DWM could use
less energy while reducing the SNGs usage. Further area reduction could be achieved by
sharing APC and weights. Thus, an area and power reduction of 52.6% and 17.35 times
were reported respectively relative to standard SC CNN as a result of resource sharing and
more efficient memory-centric architecture in the SC CNN circuit.

SC implementation in BNN: the best of both worlds
As mentioned earlier in the basic concept section, BNN challenged the existence of SC
circuits in CNN computing. As the saying goes, the enemy of an enemy is a friend, and
considering that SC and BNN target efficient CNN computation, why not combine both
to maximise the benefits from both aspects, which is what (Hirtzlin et al., 2019) precisely
targeted for. The inspiration for this particular approach is that the SC and BNN come
into the same conclusion that XNOR gate can be used as a bipolar multiplier, as depicted
in Fig. 17A, despite different directions of development. If somehow a way to process the
BNN model in stochastic mean exists, then the SC can take a free ride to the BNN’s internal
logic.

Although BNN process information at the bitwise level in the hidden layer, the initial
layer still needs to deal with input images of fixed-point binary number as shown in

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 25/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-17
http://dx.doi.org/10.7717/peerj-cs.309


Fig. 17B. In most cases, ALU is utilised for real number calculation, or digital signal
processing unit in the case of FPGA. They attempted to fuse the SC domain onto the
first layer by translating image input into stochastic bitstreams and then exploiting SC
logic similarity in BNN for bipolar multiplication to take advantage of the BNN logic.
However, unique data pre-processing is needed so that the trained network is trained on a
serialised stochastic binary image instead of the original grayscale image. The input image
is converted into multiple stochastic image representations as shown in Fig. 17C where
the bitstream generation of each pixel follows the function of SNG. Then, the number of
stochastic images generated is equal to the stochastic bit length of the data. A ‘popcount’
accumulator is implemented at the end of the layer to restore the real number before
proceeding to the next threshold function, which had replaced the activation function and
batch normalisation. The difference of their BNN usage compared with the general BNN
is that they treated the BNN XNOR gate as if it is of SC CNN stochastic logic. Notably, the
SC only apply on the first layer, and the rest of the hidden layer still follows BNN logics.

In the end, they claimed to have 62% area reduction whilst only suffer 1.4% accuracy
degradation in Fashion-MNIST dataset classification compared with the binary first-layer
BNN. They also claimed that with three stochastic image representations, SC BNN could
achieve the same performance as binary BNN implementation at 2.4 times lower energy
usage, which is very similar to the EDT approach. They even extended the experiment with
advanced CIFAR-10 images with RGB channels. By following the same image conversion
principle in channel-wise, the SC BNN achieved the same accuracy as full binary BNN,
proving that eliminating ALU at the first BNN layer is possible. Nevertheless, one possible
confusion is that they could have mistaken the BNN weight information as part of the
stochastic domain. The BNN weights were trained in the binary domain with images of
real fixed-point value, but it is not a concern as long as the BNN weights are represented
in fully quantised ‘−1’ or ‘1’ vector regardless of the computing domain.

DISCUSSION
We discussed the SC CNN and BNN elements in component-wise. However, a visualisation
approach is necessary to obtain the full picture of how are they exactly being stacked together
as SC CNN and SC BNN, which no one had emphasised on in almost all related studies.
Otherwise, novel readers might be having a hard time to grasp the idea and motives behind
the effort of SC development, particularly for those studies mentioned above with the
mixed bag of vastly different fields of study.

SC CNN and SC BNN from a holistic perspective
Modern computing handles the CNN computation by aggregating all values layer-by-layer
until the final class output is converged. The hidden truth behind the oversimplified
drawing of CNN as in Fig. 6C is that there could have a lot of data accumulation and
transfer between the processor and memory. Even if modern GPUs could parallelise
thousands of arithmetic operations, it still takes time to buffer computed data into local
memory for each feature map or layer, because it is impossible to read and write on the
same memory at the same time.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 26/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


BtanhBtanh

A
PC

A
PC

SN bit SN bit

SN SN BN SNSN SN BN SN SN SN BN SN BN

Dense/classification layerFlatten

14px

Conv2 LayerConv1 Layer

25px
WBG array

7*7 XNOR gate

Conv1 weight 7*77*7 pixel SNG
Conv2 weight 5*5

WBG array

5*5 XNOR gate

5*5 px

WBG array

Dense1 weight 
14px *5

14*5 XNOR gate

5 neurons

1 SC clock cycle

5 classes

*BN = Binary number
*SN = Stochastic number

A
PC Btanh

A
PC

Btanh

A
PC
A
PC
A
PC
A
PC
A
PC

BtanhBtanhBtanhBtanhBtanh

Counter
Counter
Counter
Counter
Counter

BN

Figure 18 Process flow in SC CNN and the internal computing domain interchange.
Full-size DOI: 10.7717/peerjcs.309/fig-18

Conversely, SC handles the computation information in a different approach as depicted
in Fig. 18. Due to the extreme parallelisation capability of the SC circuit, all of the data could
be technically preloaded into local memory before the starting of the SC cycle. Although
stochastic stream could take hundreds or even thousands of clock cycles to complete (each
clock for each stochastic bit), SC pipelined all CNN arithmetic operation from top-down.
Thus, all of the bits at a particular moment passed though all CNN layers at every SC
clock cycle. If a clock cycle took 1 µs, then a full-fledged SC CNN inference with 1-kilobit
length stochastic streams could, in theory, complete the CNN computation in under 1
ms. By then, a new full-sized image data could have been buffered asynchronously readily
available for the next SC cycle. Thus, in the perspective of the SC circuit, memory bandwidth
bottlenecking might not be an issue. The simple computing elements in SC allow large-scale
parallelisation, which is incredibly favourable to CNN hardware implementation in edge
computing application. The advantage will only be highly prevalent when noise tolerance
is essential at a higher clock speed in the future of computing or deployment of a big CNN
model which requires larger data parallelisation.

In the case of SC BNN as illustrated in Fig. 19, the converted stochastic images could
exploit the BNN XNOR logic for SC, eliminating the need for ALU. Although the SC
domain ended at the first layer, the subsequent BNN bipolar multiplication, accumulation
and threshold loops do not take much computing time either, virtually single-layer pass
in one or few clock cycles. Given the nature of the layer-wise operation, BNN could in
practice allow layer folding, that is, reusing the computer components of the previous
layer by reloading weight information (Mittal, 2020), further reducing the area and power
required which are not possible on SC CNN. SC BNN also allows in-memory computation
because those bit weights can be stored right next to the computing gate arrays, further
improving energy efficiency by eliminating the cost of communication bandwidth. The
ensemble technique on BNN could also perform as accurate as full precision DNN

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 27/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-18
http://dx.doi.org/10.7717/peerj-cs.309


1
st
 layer

Stochastic 
binarised 
images

…T

SNGs

Grayscale 
image

Loop for T times, accumulate and 
threshold into BNN bipolar bit

Loop for k layers

Hidden 
layer

Final layer

argmax

BN SN BN BitSN Bit BN

*BN = Binary number
*SN = Stochastic number
*Bit = BNN bit vector

p
o
p
co
u
n
t

A
cc
u
m
u
la
te
 a
n
d
 

T
h
re
sh
o
ld
 [
‐1
,1
]

p
o
p
co
u
n
t

Binarised weights [‐1,1]

C
la
ss
e
s

Figure 19 Process flow in SC BNN, stochastic image generation methodology and the internal com-
puting domain interchange.

Full-size DOI: 10.7717/peerjcs.309/fig-19

Table 1 Performance difference across SC and conventional binary domain.

CNN Model Platform Year Method Area
(mm2)

Power (W)
or energy (nJ)

Accuracy
(%)

Energy efficiency
(images/J) or
(GOPS/W)

CPU 2009 Software 263 156 W 99.17 4.2
GPU 2011 Software 520 202.5 W 99.17 3.2
ASIC 2016 SC 256 bit (Ren et al., 2017) 36.4 3.53 W 98.26 221,287
ASIC 2018 SC 128bit (Li et al., 2018a) 22.9 2.6 W 99.07 1,231,971

Lenet-5

ASIC 2018 SC DWM 128bit (Ma et al., 2018) 19.8 0.028W 98.94 –
CPU 2009 Software 263 156 W – 0.9
GPU 2011 Software 520 202.5 W – 2.8

AlexNet (last second
layer)

ASIC 2018 SC 128bit (Li et al., 2018a) 24.7 1.9 W – 1,326,400
ASIC 2015 Binary 5.429 3.287mW – –
ASIC 2017 SC MAC 1.408 1.369mW – –Custom (3x3filter)

ASIC 2019 SC DMAC 1.439 1.393mW – –
ASIC 2017 Binary – 380 nJ 97.7 –Custom (Ardakani et

al., 2017) ASIC 2017 Integral SC – 299 nJ 97.73 –
ASIC 2015 Binary 0.98 0.236W – 1158.11 GOPS/W

ConvNet for MNIST
ASIC 2017 SC MAC 0.43 0.279W – 5640.23 GOPS/W
ASIC 2019 BNN 1.95 220 nJ 91 –Custom (Hirtzlin et

al., 2019) ASIC 2019 SC BNN 0.73 90 nJ 89.6 –

Notes.
GOPS, Giga operations per second.

(Zhu, Dong & Su, 2019). Thus, the area and power savings of SC BNN could be extreme,
challenging the performance of SC CNN.

Although no standard reference exists for a fair comparison, we can compare the
performance difference of SC CNN/BNN in CNN model-wise as shown in Table 1
to highlight the clear advantage of SC in CNN application. Nevertheless, the year of
comparable studies varies greatly, and hardware and software efficiencies had greatly
improved over the last decade, thus should only be taken as a rough comparison. In

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 28/35

https://peerj.com
https://doi.org/10.7717/peerjcs.309/fig-19
http://dx.doi.org/10.7717/peerj-cs.309


Table 2 Component-wise performance comparison of SC CNN.

SC CNN/BNN
components

Author Platform/
software

Relative
accuracy
(%)

Area/gate
count (%)

Power/Energy
saving (%)

Relative to

Integra SC Ardakani et al.
(2017)

FPGA &
ASIC

+0.03 −33.9 21.3 Binary computing

ESL Canals et al.
(2016)

FPGA −2.23 – – Binary computing

APC + Btanh Kim, Lee &
Choi (2016a),
Kim, Lee & Choi
(2016b) and
Kim et al. (2016)

Synopsys
Design
Compiler

−0.18;
−1.71
(EDT)

−50.0 70.0; 76.2 (EDT) Binary computing

APC with inverse adder Li et al. (2018a) Synopsys
Design
Compiler

– −50.0 10.0 Normal APC

SC MaxPooling Ren et al. (2017) Synopsys
Design
Compiler

−0.20 −92.7 98.3 GPU computing

SC ReLU activation Li et al. (2018a) Synopsys
Design
Compiler

−0.88 −95.3 99.1 GPU computing

SC normalisation Li et al. (2017b) Synopsys
Design
Compiler

−0.02 −83.8 88.9 Binary computing

SC MAC Sim & Lee
(2017)

Synopsys
Design
Compiler

−1 −93.9 −89.4 Binary computing ASICa

292 AlexNet
147 InceptionV3
370 VGG16

SC
DMAC

Hojabr et al.
(2019)

Synopsys
Design
Compiler

– −73.5

12 MobileNet
SC Sigmoid activation Li et al. (2017a) FreePDK −0.01 −96.8 96.7 Binary computing

−0.79 −98.6 99.1 Binary computing
– −45.7 77.9 Unipolar SC

SC
Quantization Li et al. (2018b) FreePDK

– −60.3 85.8 Bipolar SC
SC BNN Hirtzlin et al.

(2019)
Cadence
First En-
counter

−1.40 −62.0 240 Binary BNN

Notes.
aBinary computing ASIC apply to the CNN model comparison.

the case of component-wise performance comparison, Table 2 could further clarify the
performance number that had been mentioned in the previous section if any.

CONCLUSIONS
The SC may still not well developed relatively speaking. Still, with the trending of highly
parallelised computing use case, SC might be the good old yet not-so-old idea, specifically
when people are still actively researching and optimising SC circuits with the driving
momentum of CNN algorithm. That being said, the FPGA itself is still not widely adopted

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 29/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


in the programming community, let alone the SC adaptation. Numerous efforts were made
in the high-level CNN to FPGA translation for binary domain computation (Liu et al., 2017;
Noronha, Salehpour & Wilton, 2019). However, the bridging effort of SC in FPGA is near
to non-existence or should be said most of the SC studies lean to ASIC. Many people are
interested in offloading computationally intensive workloads such as image processing and
CNN inferencing to the co-processor. Thus, SC elements should be made an open-source
IPs and introduced into the FPGA design ecosystem so that people can innovate on it. The
open-sourcing design could help accelerate the SC development because researchers do not
have to redesign the IP from scratch which is the major hurdle for novel development and
could turn down people from being interested in SC technology. It could be the primary
reason why SC CNN lacks attention, leading to a low number of comparable data as well
as benchmarking.

Speaking of parallelism capability in SC, data bandwidth bottlenecking could be a major
challenge. Even though SC can have vast arrays of WBG or comparator to compare a
massive amount of binary values at once, delivering massive data on time is challenging.
Notably, SC does require hundreds if not thousands of clock cycles to complete. Thus,
data transfer could be pipelined and buffered asynchronously. Moreover, a tremendous
amount of data needs to be ready beside the SC elements. As such, local memory element
such as SRAM (in ASIC terms) or BRAM/Flip-flop (in FPGA term) limitation should be
the concern. In any case, memory-centric computing design should be the direction of
SC development, especially in SC CNN, where hundreds of thousands, even millions of
operations could be parallelised.

There are still a lot of optimisation rooms for SC implementation on FPGA since most
of the modern FPGA consists of 6-input lookup tables. ASIC logic might not be able to
translate into the FPGA fabric efficiently because lookup tables are hardwired. Although
FPGA is flexible in terms of hardware implementation, it is not as customisable as the
ASIC. Modern FPGA also consists of other resources capable of performance computing
such as digital signal processors or arithmetic logic awaiting to be utilised. However, those
aspects could only be discovered in future research efforts.

Nomenclature

ADC Analog-to-digital converter
ALU arithmetic logic unit
APC approximate parallel counter
ASC Analog-to-stochastic converter
ASIC application-specific integrated circuit
BNN binarised neural network
Btanh binary input Stanh
CNN convolutional neural network
CPU central processing unit
DNN deep neural network
DWM domain-wall memory
EDT early decision termination

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 30/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


FA full adder
FPGA field-programmable gate array
FPU floating-point unit
FSM finite state machine
GPU graphic processing unit
IoT internet of things
LFSR linear feedback shift register
MAC multiplier-accumulator
ReLU rectified linear unit
RNG random number generator
SC stochastic computing
SNG stochastic number generator
SUC shifted unary code
Stanh stochastic TanH
TanH hyperbolic tangent
WBG weighted binary generator

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This research was funded by the School of Electrical and Electronic Engineering, Universiti
Sains Malaysia (1001/PELECT/8014152). The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
School of Electrical and Electronic Engineering, Universiti Sains Malaysia: 1001/P-
ELECT/8014152.

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Yang Yang Lee conceived and designed the experiments, performed the experiments,
analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the
paper, and approved the final draft.
• Zaini Abdul Halim analyzed the data, authored or reviewed drafts of the paper, and
approved the final draft.

Data Availability
The following information was supplied regarding data availability:

No raw data is available for literature review.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 31/35

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.309


REFERENCES
Alaghi A, Hayes JP. 2013. Survey of stochastic computing. Transactions on Embedded

Computing Systems 12:92 DOI 10.1145/2465787.2465794.
Alaghi A, Qian W, Hayes JP. 2018. The promise and challenge of stochastic computing.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
37:1515–1531 DOI 10.1109/TCAD.2017.2778107.

Angizi S, Fan D. 2017. IMC: energy-efficient in-memory convolver for accelerating
binarized deep neural network. In: ACM International Conference Proceeding Series
2017-July. DOI 10.1145/3183584.3183613.

Ardakani A, Leduc-Primeau F, Onizawa N, Hanyu T, Gross WJ. 2017. VLSI im-
plementation of deep neural network using integral stochastic computing. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 25:2688–2699
DOI 10.1109/TVLSI.2017.2654298.

Brown BD, Card HC. 2001. Stochastic neural computation I: computational elements.
IEEE Transactions on Computers 50:891–905 DOI 10.1109/12.954505.

Canals V, Morro A, Oliver A, Alomar ML, Rosselló JL. 2016. A new stochas-
tic computing methodology for efficient neural network implementation.
IEEE Transactions on Neural Networks and Learning Systems 27:551–564
DOI 10.1109/TNNLS.2015.2413754.

Capra M, Bussolino B, Marchisio A, Shafique M, Masera G, Martina M. 2020. An up-
dated survey of efficient hardware architectures for accelerating deep convolutional
neural networks. Future Internet 12:113 DOI 10.3390/fi12070113.

Chen TH, Hayes JP. 2016. Design of division circuits for stochastic computing. In:
Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2016-
September. Piscataway: IEEE, 116–121 DOI 10.1109/ISVLSI.2016.48.

Chen TH, Ting P, Hayes JP. 2018. Achieving progressive precision in stochastic comput-
ing. In: 2017 IEEE Global Conference on Signal and Information Processing, GlobalSIP
2017. Piscataway: IEEE, 1320–1324 DOI 10.1109/GlobalSIP.2017.8309175.

Chu SI. 2020. New divider design for stochastic computing. IEEE Transactions on Circuits
and Systems II: Express Briefs 67:147–151 DOI 10.1109/TCSII.2019.2906385.

De Aguiar JM, Khatri SP. 2015. Exploring the viability of stochastic computing. In:
Proceedings of the 33rd IEEE International Conference on Computer Design, ICCD
2015. Piscataway: IEEE, 391–394 DOI 10.1109/ICCD.2015.7357131.

Fei-Fei L, Deng J, Li K. 2010. ImageNet: a large-scale hierachical image database. Journal
of Vision 9:1037–1037 DOI 10.1167/9.8.1037.

Galloway A, Taylor GW, Moussa M. 2018. Attacking binarized neural networks. In: 6th
International conference on learning representations, ICLR 2018—conference track
proceedings. 1–14.

Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. 2012. Improv-
ing neural networks by preventing co-adaptation of feature detectors. 1–18ArXiv
preprint. arXiv:1207.0580.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 32/35

https://peerj.com
http://dx.doi.org/10.1145/2465787.2465794
http://dx.doi.org/10.1109/TCAD.2017.2778107
http://dx.doi.org/10.1145/3183584.3183613
http://dx.doi.org/10.1109/TVLSI.2017.2654298
http://dx.doi.org/10.1109/12.954505
http://dx.doi.org/10.1109/TNNLS.2015.2413754
http://dx.doi.org/10.3390/fi12070113
http://dx.doi.org/10.1109/ISVLSI.2016.48
http://dx.doi.org/10.1109/GlobalSIP.2017.8309175
http://dx.doi.org/10.1109/TCSII.2019.2906385
http://dx.doi.org/10.1109/ICCD.2015.7357131
http://dx.doi.org/10.1167/9.8.1037
http://arXiv.org/abs/1207.0580
http://dx.doi.org/10.7717/peerj-cs.309


Hirtzlin T, Penkovsky B, Bocquet M, Klein JO, Portal JM, Querlioz D. 2019. Stochastic
computing for hardware implementation of binarized neural networks. IEEE Access
7:76394–76403 DOI 10.1109/ACCESS.2019.2921104.

Hojabr R, Givaki K, Tayaranian SMR, Esfahanian P, Khonsari A, Rahmati D, Najafi
MH. 2019. SkippyNN: an embedded stochastic-computing accelerator for convo-
lutional neural networks. In: Proceedings—design automation conference. New York:
ACM, 1–6 DOI 10.1145/3316781.3317911.

Ichihara H, Ishii S, Sunamori D, Iwagaki T, Inoue T. 2014. Compact and accurate
stochastic circuits with shared random number sources. In: 2014 32nd IEEE Inter-
national Conference on Computer Design, ICCD 2014. Piscataway: IEEE, 361–366
DOI 10.1109/ICCD.2014.6974706.

Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by
reducing internal covariate shift. Proceedings of the 32nd International Conference on
Machine Learning 37:730–743 DOI 10.1080/17512786.2015.1058180.

Jayakumar H, Raha A, Kim Y, Sutar S, Lee WS, Raghunathan V. 2016. Energy-efficient
system design for IoT devices. In: Proceedings of the Asia and South Pacific Design
Automation Conference, ASP-DAC 25-28-January. Piscataway: IEEE, 298–301
DOI 10.1109/ASPDAC.2016.7428027.

Joe H, Kim Y. 2019. Novel stochastic computing for energy-efficient image processors.
Electronics 8:1–11 DOI 10.3390/electronics8060720.

Kim K, Kim J, Yu J, Seo J, Lee J, Choi K. 2016. Dynamic energy-accuracy trade-off using
stochastic computing in deep neural networks. In: Proceedings—design automation
conference 05-09-June. DOI 10.1145/2897937.2898011.

Kim K, Lee J, Choi K. 2016a. An energy-efficient random number generator for
stochastic circuits. In: Proceedings of the Asia and South Pacific Design Au-
tomation Conference, ASP-DAC 25-28-January. New York: ACM, 256–261
DOI 10.1109/ASPDAC.2016.7428020.

Kim K, Lee J, Choi K. 2016b. Approximate de-randomizer for stochastic circuits. In:
ISOCC 2015—international SoC design conference: SoC for internet of everything (IoE).
123–124 DOI 10.1109/ISOCC.2015.7401667.

Li Z, Li J, Ren A, Cai R, Ding C, Qian X, Draper J, Yuan B, Tang J, Qiu Q, Wang Y.
2018a. HEIF: highly efficient stochastic computing-based inference framework for
deep neural networks. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 38:1543–1556 DOI 10.1109/TCAD.2018.2852752.

Li B, Najafi MH, Yuan B, Lilja DJ. 2018b. Quantized neural networks with new stochas-
tic multipliers. In: Proceedings—international symposium on quality electronic design,
ISQED 2018-March. 376–382 DOI 10.1109/ISQED.2018.8357316.

Li B, Qin Y, Yuan B, Lilja DJ. 2017a. Neural network classifiers using stochastic com-
puting with a hardware-oriented approximate activation function. In: Proceedings—
35th IEEE international conference on computer design, ICCD 2017. Piscataway: IEEE,
97–104 DOI 10.1109/ICCD.2017.23.

Li J, Yuan Z, Li Z, Ding C, Ren A, Qiu Q, Draper J, Wang Y. 2017b. Hardware-driven
nonlinear activation for stochastic computing based deep convolutional neural

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 33/35

https://peerj.com
http://dx.doi.org/10.1109/ACCESS.2019.2921104
http://dx.doi.org/10.1145/3316781.3317911
http://dx.doi.org/10.1109/ICCD.2014.6974706
http://dx.doi.org/10.1080/17512786.2015.1058180
http://dx.doi.org/10.1109/ASPDAC.2016.7428027
http://dx.doi.org/10.3390/electronics8060720
http://dx.doi.org/10.1145/2897937.2898011
http://dx.doi.org/10.1109/ASPDAC.2016.7428020
http://dx.doi.org/10.1109/ISOCC.2015.7401667
http://dx.doi.org/10.1109/TCAD.2018.2852752
http://dx.doi.org/10.1109/ISQED.2018.8357316
http://dx.doi.org/10.1109/ICCD.2017.23
http://dx.doi.org/10.7717/peerj-cs.309


networks. In: Proceedings of the international joint conference on neural networks 2017-
May. New York: ACM, 1230–1236 DOI 10.1109/IJCNN.2017.7965993.

Li J, Yuan Z, Li Z, Ren A, Ding C, Draper J, Nazarian S, Qiu Q, Yuan B, Wang Y. 2017c.
Normalization and dropout for stochastic computing-based deep convolutional
neural networks. Integration 65:395–403 DOI 10.1016/j.vlsi.2017.11.002.

Liew SS, Khalil-Hani M, Ahmad Radzi S, Bakhteri R. 2016. Gender classification: a
convolutional neural network approach. Turkish Journal of Electrical Engineering and
Computer Sciences 24:1248–1264 DOI 10.3906/elk-1311-58.

Liu Z, Dou Y, Jiang J, Xu J. 2017. Automatic code generation of convolutional neu-
ral networks in fpga implementation. In: Proceedings of the 2016 international
conference on field-programmable technology, FPT 2016. Piscataway: IEEE, 61–68
DOI 10.1109/FPT.2016.7929190.

Ma X, Zhang Y, Yuan G, Ren A, Li Z, Han J, Hu J, Wang Y. 2018. An area and en-
ergy efficient design of domain-wall memory-based deep convolutional neural
networks using stochastic computing. In: Proceedings—international symposium
on quality electronic design, ISQED 2018-March. Piscataway: IEEE, 314–321
DOI 10.1109/ISQED.2018.8357306.

Mittal S. 2020. A survey of FPGA-based accelerators for convolutional neural networks.
Neural Computing and Applications 32:1109–1139 DOI 10.1007/s00521-018-3761-1.

Najafi MH, Li P, Lilja DJ, Qian W, Bazargan K, Riedel M. 2017. A reconfigurable
architecture with sequential logic-based stochastic computing. ACM Journal on
Emerging Technologies in Computing Systems 13:57 DOI 10.1145/3060537.

Naveen S, Kounte MR. 2019. Key technologies and challenges in IoT edge computing.
In: 2019 Third international conference on I-SMAC (IoT in social, mobile, analytics and
cloud) (I-SMAC). Piscataway: IEEE, 61–65 DOI 10.1109/I-SMAC47947.2019.9032541.

Noronha DH, Salehpour B, Wilton SJE. 2019. Leflow: enabling flexible fpga high-level
synthesis of tensorflow deep neural networks. In: 5th international workshop on
FPGAs for software programmers, FSP 2018, co-located with international conference
on field programmable logic and applications, FPL 2018. 46–53.

Nurvitadhi E, Sheffield D, Sim J, Mishra A, Venkatesh G, Marr D. 2017. Accelerating
binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC. In:
Proceedings of the 2016 International Conference on Field-Programmable Technology,
FPT 2016. Piscataway: IEEE, 77–84 DOI 10.1109/FPT.2016.7929192.

Ren A, Li Z, Ding C, Qiu Q, Wang Y, Li J, Qian X, Yuan B. 2017. SC-DCNN:
highly-scalable deep convolutional neural network using stochastic comput-
ing. In: International conference on architectural support for programming lan-
guages and operating systems—ASPLOS Part F1271. New York: ACM, 405–418
DOI 10.1145/3037697.3037746.

Ren A, Li Z, Wang Y, Qiu Q, Yuan B. 2016. Designing reconfigurable large-scale deep
learning systems using stochastic computing. In: 2016 IEEE international conference
on rebooting computing, ICRC 2016—conference proceedings. Piscataway: IEEE,
DOI 10.1109/ICRC.2016.7738685.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 34/35

https://peerj.com
http://dx.doi.org/10.1109/IJCNN.2017.7965993
http://dx.doi.org/10.1016/j.vlsi.2017.11.002
http://dx.doi.org/10.3906/elk-1311-58
http://dx.doi.org/10.1109/FPT.2016.7929190
http://dx.doi.org/10.1109/ISQED.2018.8357306
http://dx.doi.org/10.1007/s00521-018-3761-1
http://dx.doi.org/10.1145/3060537
http://dx.doi.org/10.1109/I-SMAC47947.2019.9032541
http://dx.doi.org/10.1109/FPT.2016.7929192
http://dx.doi.org/10.1145/3037697.3037746
http://dx.doi.org/10.1109/ICRC.2016.7738685
http://dx.doi.org/10.7717/peerj-cs.309


Salehi SA. 2020. Low-cost stochastic number generators for stochastic computing.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28:992–1001
DOI 10.1109/TVLSI.2019.2963678.

Sim H, Lee J. 2017. A new stochastic computing multiplier with application to deep
convolutional neural networks. In: Proceedings—design automation conference part
12828. New York: ACM, 1–6 DOI 10.1145/3061639.3062290.

Sim H, Nguyen D, Lee J, Choi K. 2017. Scalable stochastic-computing accelera-
tor for convolutional neural networks. In: Proceedings of the Asia and South
Pacific design automation conference, ASP-DAC. Piscataway: IEEE, 696–701
DOI 10.1109/ASPDAC.2017.7858405.

Simons T, Lee DJ. 2019. A review of binarized neural networks. Electronics 8:661
DOI 10.3390/electronics8060661.

Xie Y, Liao S, Yuan B, Wang Y, Wang Z. 2017. Fully-parallel area-efficient deep neural
network design using stochastic computing. 64. Piscataway: IEEE, 1382–1386
DOI 10.1109/TCSII.2017.2746749.

Yang M, Li B, Lilja DJ, Yuan B, Qian W. 2018. Towards theoretical cost limit of stochas-
tic number generators for stochastic computing. In: Proceedings of IEEE computer
society annual symposium on VLSI, ISVLSI 2018-July. Piscataway: IEEE, 154–159
DOI 10.1109/ISVLSI.2018.00037.

Yu J, Kim K, Lee J, Choi K. 2017. Accurate and efficient stochastic computing hard-
ware for convolutional neural networks. In: Proceedings - 35th IEEE interna-
tional conference on computer design, ICCD 2017. Piscataway: IEEE, 105–112
DOI 10.1109/ICCD.2017.24.

Zhakatayev A, Lee S, Sim H, Lee J. 2018. Sign-magnitude sc: getting 10x accuracy for free
in stochastic computing for deep neural networks. In: Proceedings—design automa-
tion conference part F1377. New York: ACM, 1–6 DOI 10.1145/3195970.3196113.

Zhang Y, Zhang X, Song J, Wang Y, Huang R, Wang R. 2019. Parallel Convolutional
Neural Network (CNN) accelerators based on stochastic computing. In: IEEE
workshop on signal processing systems, SiPS: design and implementation. Piscataway:
IEEE, 19–24 DOI 10.1109/SiPS47522.2019.9020615.

Zhu S, Dong X, Su H. 2019. Binary ensemble neural network: more bits per network or
more networks per bit? In: Proceedings of the IEEE computer society conference on
computer vision and pattern recognition 2019-June. Piscataway: IEEE, 4918–4927
DOI 10.1109/CVPR.2019.00506.

Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 35/35

https://peerj.com
http://dx.doi.org/10.1109/TVLSI.2019.2963678
http://dx.doi.org/10.1145/3061639.3062290
http://dx.doi.org/10.1109/ASPDAC.2017.7858405
http://dx.doi.org/10.3390/electronics8060661
http://dx.doi.org/10.1109/TCSII.2017.2746749
http://dx.doi.org/10.1109/ISVLSI.2018.00037
http://dx.doi.org/10.1109/ICCD.2017.24
http://dx.doi.org/10.1145/3195970.3196113
http://dx.doi.org/10.1109/SiPS47522.2019.9020615
http://dx.doi.org/10.1109/CVPR.2019.00506
http://dx.doi.org/10.7717/peerj-cs.309