key: cord-0431212-f7qbwasr
authors: Prabakaran, Bharath Srinivas; Akhtar, Asima; Rehman, Semeen; Hasan, Osman; Shafique, Muhammad
title: BioNetExplorer: Architecture-Space Exploration of Bio-Signal Processing Deep Neural Networks for Wearables
date: 2021-09-07
journal: nan
DOI: 10.1109/jiot.2021.3065815
sha: e8eccab79ea212d7fed22fc8bb55a866d035b094
doc_id: 431212
cord_uid: f7qbwasr

In this work, we propose the BioNetExplorer framework to systematically generate and explore multiple DNN architectures for bio-signal processing in wearables. Our framework adapts key neural architecture parameters to search for an embedded DNN with a low hardware overhead, which can be deployed in wearable edge devices to analyse the bio-signal data and to extract the relevant information, such as arrhythmia and seizure. Our framework also enables hardware-aware DNN architecture search using genetic algorithms by imposing user requirements and hardware constraints (storage, FLOPs, etc.) during the exploration stage, thereby limiting the number of networks explored. Moreover, BioNetExplorer can also be used to search for DNNs based on the user-required output classes; for instance, a user might require a specific output class due to genetic predisposition or a pre-existing heart condition. The use of genetic algorithms reduces the exploration time, on average, by 9x, compared to exhaustive exploration. We are successful in identifying Pareto-optimal designs, which can reduce the storage overhead of the DNN by ~30MB for a quality loss of less than 0.5%. To enable low-cost embedded DNNs, BioNetExplorer also employs different model compression techniques to further reduce the storage overhead of the network by up to 53x for a quality loss of<0.2%.

(Internet-of-Things) devices are expected to generate nearly 79.4 Zettabytes (×10 12 Gigabytes) of data annually [1] . This includes the data collected by autonomous cars, home-automation environments, smart wearables, etc. Recent estimates suggest that the number of smart wearables and fitness trackers in the market will rise to˜356 million by the end of 2020, with each device generating roughly 560 Megabytes of data every month (averaging˜185 Petabytes, monthly) [2] . These trends are expected to rapidly escalate due to the current Covid-19 outbreak. In such situations, low-cost wearables can be widely used for personalized monitoring and prediction to contain any current/future pandemics [3] [4] . The data collected from these smart edge-devices is typically transmitted to the cloud for filtering, processing, and mining to extract valuable information, which can benefit the end-user or the system developer to provide health and lifestyle recommendations to improve the users' quality-of-life. However, transmitting such a large amount of data to a fog device incurs significant time, power, and energy overheads on the wearable device [5] . Besides the overheads associated with these systems, transmitting data to the fog or cloud may create security and privacy concerns for the users who might not wish to transfer their physiological information over untrusted networks or store their data on third-party cloud platforms and data-servers [6] .

To address these issues, the Edge Computing paradigm has emerged [12] , which brings the computing layer closer to the sensor-nodes, i.e., the edge devices where the data is collected, to improve response times and to reduce energy consumption. For example, such a system model is also adopted in the so-called Near-Sensor Computing paradigm [13] . However, these edge devices are typically not capable of executing compute-and memory-intensive data processing algorithms like complex Deep Neural Networks (DNNs), which are currently used for various purposes such as classification, long-/short-term predictions, and anomaly detection in bio-signal data. Fig. 1 presents an analysis of the storage overhead and number of floating point operations 1 required for a single inference of different state-of-the-art DNNs [7] [8] [9] [10] [11] .

From these results, one can make the following observations:

• DNNs are application dependent, i.e., each application, based on its requirement, needs a specific trained DNN architecture that could be very different from the architecture used by another application. For example, DenseNet-190 is used for image classification, whereas Deep Speech-II is considered to be state-of-the-art for speech recognition; • DNNs incur a large on-chip memory overhead for storing the network's parameters, which are required for executing the DNN; • DNNs require a large number of floating-point operations to be executed in a second, to enable the system to work and process data in real-time; Based on these observations, the following research challenges need to be addressed to enable embedded DNNs for resource-constrained wearables: • Based on the number of DNN parameters and their possible values, the design space of DNNs can explode, leading to a very large number of networks in the architecture-space that need to be exhaustively trained and evaluated. This is a complex problem to solve; requiring a huge amount of resources for network search and training. These kinds of resources may only be affordable by large-scale enterprises.

The key scientific questions, that we target in this paper are:

[Q1] How do we traverse the architecture-space of DNNs to effectively reduce the exploration time, while identifying DNNs that trade-off between output quality and hardware overhead? [Q2] Can we effectively and simultaneously optimize both the output quality and the hardware overhead, instead of just one of these factors, while traversing the DNN architecture-space?

• Besides exploring the conventional DNN architecture-space, model compression techniques, such as Pruning and Quantization, can also be used to reduce the hardware overhead of the network. In this regard, the question is:

[Q3] How would different pruning and quantization techniques impact the quality of the bio-signal processing application and minimize the hardware overhead of the DNN? To address these research challenges, we propose the following novel contributions: • We propose a novel BioNetExplorer framework, which systematically varies the key neural architecture parameters of a DNN to construct an architecture-space that can be subsequently explored and evaluated to identify networks that satisfy the user requirements and hardware constraints of the target platform. • We investigate four well-established genetic algorithms to identify near-optimal points/designs without exhaustively exploring the complete search-space and study their outcome when applied to our architecture-space. • We propose a weighted cost function that can be used to simultaneously optimize output quality and hardware overhead. The weights allocated to these two factors will determine the primary optimization goal of the genetic algorithm-based search and its priority. • Our BioNetExplorer framework effectively constructs a new architecture-space of the near-optimal DNN by varying the key pruning and quantization parameters. It then explores this architecture-space to identify networks that minimize the hardware overhead with minimal or no quality loss.

Open-source Contributions: The complete framework along with the Pareto-optimal networks for all the explored DNN architecture-spaces will be made open-source and available online at https://bionetexplorer.sourceforge.io. Fig. 2 illustrates an overview of the proposed contributions integrated in our novel BioNetExplorer framework. 

Wearables are highly pervasive smart electronic devices that are used to improve the users' quality-of-life. This includes a wide range of devices like smart-watches, such as the ones manufactured by Apple [14] and Samsung [15] , fitness trackers, like Fitbit [16] , sports watches, smart clothing, bio-implants, etc. One of the most important functions offered by these devices is physiological signal monitoring, which can be used to track the health of its users [17] . For example, the Apple Watch and Fitbit can be used to monitor the electrical signals associated with a heartbeat and record them in the form of an electrocardiogram (ECG). The data gathered from these devices are transmitted to the cloud for further processing and detecting anomalies such as Atrial Fibrillation [18] . There has been a wide range of research works that have focused on the adoption of wearables for monitoring the user's health [19] [20] [21] . For instance, Sun et al. [22] have proposed an optimized vector transformation approach that reduces the overhead of key generation, thereby enabling lightweight encryption and decryption in smart-health IoT infrastructure. Similarly, to enable real-time feedback to the user, a coding-free control system for an Industrial IoT (IIoT) infrastructure was presented in [23] .

There has been a wide range of works on Neural Architecture Search (NAS), from the use of basic algorithms to Reinforcement learning for the sole purpose of designing DNNs. A comprehensive survey of the neural architecture search approaches based on the search space, the strategy of search, and performance of the approach is presented in [24] . One of the most popular techniques for hyper-parameter optimization is Bayesian Optimization (BO) [25] . However, it has not yet been extensively explored for NAS. This is mainly because the BO toolboxes are based on Gaussian processes and tend to focus on low-dimensional continuous optimization problems, and is, thus, not suitable for architecture-space exploration [24] . A recurrent neural network (RNN), which can be used to generate DNN architectures for classifying objects in the CIFAR-10 image dataset, is presented in [26] . The RNN is trained using a reinforcement learning approach to maximize the accuracy of the networks generated. Liu et al. [27] presented a sequential model-based optimization strategy, which can be used to generate convolutional neural networks 5× as efficiently as the approach presented in [26] . In 2019, Liu et al. [28] proposed a network-level structure search combined with the traditional cell-level structure search to compose a hierarchical architecture space of state-of-the-art DNNs. Typically, the state-of-the-art approaches for NAS focus on image classification and image segmentation [29] without much focus on other application domains such as bio-signal processing. Recently, Tan et al. [30] proposed a platform-aware neural architecture search approach, which focuses solely on image classification and object recognition. Similarly, several other recent works presented platform-aware DNN architecture-space exploration approaches that are targeted for image classification [31] [32] [33] . These approaches do not offer the same accuracy-hardware trade-offs for other application domains, such as bio-signal processing, as illustrated in Section V-A.

Typically, Model Compression techniques such as Pruning and Quantization have been shown to be very effective in reducing the hardware overheads and computational requirements of DNNs. Han et al. [34] proposed a layer-wise approach to prune and quantize DNNs that are used for image classification. Anwar et al. [35] presented a technique that can be used to introduce structured sparsity in the network to reduce the representation effort and maximize parallel computation. As an alternative approach to structured pruning, Luo et al. [36] presented a method that eliminates complete filters based on data from subsequent layers to reduce the number of computations required for inference and accelerate DNN execution. Marchisio et al. [37] improved upon the work presented in [34] by introducing an iterative class-blind approach for pruning weights in DNNs to further reduce the hardware overhead of the network. Lin et al. [38] proposed a framework that can be used to prune a DNN dynamically at runtime, based on the input image and feature maps, to preserve the full ability of the original DNN. Similarly, there is a wide range of techniques focused on quantizing DNNs to reduce their hardware overhead [39] [40] [41] [42] [43] . Besides NAS and model compression, Kumar et al. [44] and Gural et al. [45] presented effective ML-based approaches that can be deployed on resource-constrained IoT devices to perform search predictions and image classification.

In this work, we primarily focus on the automatic generation and exploration of bio-signal processing DNNs that can be deployed on resource-constrained wearable devices. Plenty of research works have used manually-designed DNNs (i.e., not obtained through automatic neural architecture search) for classification of ECG signal components and/or arrhythmia [46] [47] [48] . The DNN architecture proposed by Hannun et al. [48] can classify 12 rhythm classes on their custom ECG dataset and is considered to be the state-of-the-art for arrhythmia classification. Our BioNetExplorer framework adopts the basic building block used in [48] to generate and explore DNN architectures for specific arrhythmia categories, as per user requirement, which can be deployed on wearables like smart-watches or fitness trackers. Note, we have to modify the input and output layers of this DNN to ensure that the network can process data from the available training dataset and classify them into the required output classes based on the target application. We have also included a Long Short-Term Memory (LSTM) [49] layer at the end of the DNN to improve accuracy in worst-case scenarios, i.e., lower number of feature extraction layers. Our ECG processing DNN architecture, modified from the DNN presented in [48] , is illustrated in Fig. 3 . Given such a DNN architecture, the challenge is to explore the following architectural parameters: (i) Number of ResNet Blocks; (ii) Number of Filters; and (iii) Number of LSTM Cells. This requires us to generate and explore different DNN variants considering these network parameters to identify a set of Pareto-optimal configurations for the DNN architecture. This provides a low-complexity solution instead of employing a full-scale NAS from scratch. 

An overview of our BioNetExplorer framework is presented in Fig. 4 . It considers the user requirements (e.g., output classes and quality) to construct and label the comprehensive bio-signal data-set, and hardware constraints (e.g., Floating-point Operations (FLOPs) and memory storage overhead (MB)) to generate and explore the architecture-space of bio-signal processing DNNs for wearables.

The BioNetExplorer framework restricts the output classes of the DNN being explored. For instance, in case the user requires another output class besides the two base cases of normal and anomalous, due to pre-existing conditions, another output class and the associated label can be included by our framework to search for an efficient DNN architecture in the new architecture-space. Note, the labeled data associated with the new output class has to be present in the dataset. We consider four metrics for evaluating the quality (Q DN N ) of DNNs, namely, Accuracy (A DN N ), Recall (R DN N ), Precision Recall is the ability of the DNN under consideration to find all relevant cases of a specific class within the dataset, while Precision defines the ability of the DNN to identify relevant points in the dataset. F1-score is defined as the harmonic mean of Precision and Recall. The user can define a minimum quality constraint using any of the aforementioned metrics to ensure that the final network achieved by the exploration satisfies this constraint.

Besides the quality constraint, a hardware-level constraint on the networks being explored ensures that these DNNs do not require compute or memory resources beyond the configuration of the target wearable platform. We can estimate the hardware requirements of a DNN using metrics such as Storage Overhead (S DN N ) based on the number of parameters (weights and biases) of the network, and Floating-point Operations (F DN N ) that determines the execution time of the DNN on the target platform. Due to the hardware constraints of the wearable device, an efficient DNN that can be executed on such devices needs to offer the best possible quality with minimum-possible S DN N and F DN N . This typically creates a trade-off, wherein the quality of the DNN typically decreases when we reduce the number of parameters to curtail the memory footprint/FLOPs, and vice-versa.

Our framework constructs a dataset for each specific DNN exploration by combining labels and creating the desired output classes from the primary bio-signal dataset. Each of the labels/annotations in the primary dataset should correspond to one of the output classes of the explored DNN in the final constructed dataset. For example, in our case study, we create a new dataset by grouping together a 256 bio-signal sample window and assigning a specific label to this window based on the original labels of the signals present. Each of the 41 beat and non-beat annotations in the dataset are eventually classified into one of the output labels in each DNN exploration. We use 70% of the constructed datasets for training the DNN architecture, 10% for validating the network, and the remaining 20% for testing and estimating the output quality of the DNN. We have explained this scenario with help of a few different examples in Section V.

After collecting information regarding all the relevant constraints and requirements that can be used to restrict the architectural-space, and the dataset required for the target application, we generate the set of possible DNN architectures (ψ) by varying three key neural architecture parameters, namely, (1) #ResNet Blocks, (2) #Filters, and (3) #LSTM Cells. These three parameters can potentially take up any value in the domain of IR + , thereby, creating an unbounded design space that is impossible to explore. Therefore, in this work, we consider the parameters of the current state-of-the-art DNN architectures as the upper-bounds to our architecture-space exploration. Precisely, in our case-study, the ranges of these parameters are as follows: 1) The #ResNet Blocks [50] can vary from 0 to 15, with each block being composed of two 1-D convolution layers, followed by batch normalization [51] , Rectified Linear Unit (ReLU) activation, and dropout [52] ; 2) The #Filters (size 16) in each convolution layer is defined as 32 × 2 y where the value of y, starting from 0, is incremented by 1 after every x ResNet Blocks where x ∈ [1, 4] ; and 3) The #LSTM cells can vary as a function of 2 z where z ∈ [4, 8] . Due to the upper-bound on the number of parameters and storage size imposed by the state-of-the-art networks, we can reduce the number of networks to be explored to realistic limits. For example, in our case study (see Section V), the number of explored networks is reduced from 320 to 135. It is important to note that, since the exploration of the architecture-space is limited by the current generation of state-of-the-art architectures, any major upgrades in the architectures of the state-of-the-art will, in turn, affect the architecture-space of the explored DNNs.

Each of the DNNs obtained from varying the architectural parameters needs to be trained using the constructed bio-signal data-set to detect anomalies and/or specific conditions. Since training is a compute-intensive task, it is important to (i) generate fewer DNN models (as discussed above), and (ii) explore the search-space fast. At this stage, there are two ways to explore the architecture-space:

(1) Exhaustive Search, which involves training all the DNNs in the architecture-space to identify the set of Pareto-optimal DNN architectures that trade-off QoS for S DN N /F DN N and vice-versa. Training all the networks in the architecture-space requires hundreds of GPU-hours depending upon the size of the networks being explored. Due to the practical constraints imposed by hardware resources present on wearables and the upper bound imposed by state-of-the-art DNNs, we have a curtailed architecture-space, for which we can perform an exhaustive search. Therefore, for benchmarking the efficiency of the genetic algorithms and random search, we additionally implemented the exhaustive search. However, when the network complexity, number of parameters, parameter instances, and the number of hyper-parameters increase, thereby leading to exponentially large architecture-spaces, we usually rely on genetic-algorithm based architecture-search methods.

(2) Genetic Algorithm-Based Search, which involves effectively training a small subset of the networks in the architecture-space to reduce the training time to only tens of GPU-hours. Genetic algorithms have proved to be quite effective in utilizing the given cost function to obtain near-optimal solutions while reducing the time required for exploration [53] . Towards this, we leverage the Genetic Algorithms that rely on the biological principles of reproduction and evolution to create a new generation of networks that can potentially perform better than the previous generation to optimize the cost function. The scope of our work is limited to the investigation of genetic algorithms, a small subset of meta-heuristics that encompasses other techniques like Tabu Search [54] , Simulated Annealing [55] , Ant Colony Optimizations [56] , etc., for the exploration of our DNN architecture-spaces.

Initially, we start with a set of random individuals/DNNs in the architecture-space, which we refer to as the Initial Population. Based on our experiments and recommendation from previous works [57] , we have identified that a population size of 30 offers the best results in terms of designs obtained for our architecture-space ψ. Each of the three key network parameters (#Filters, #ResNet Blocks, #LSTM cells) that can be varied in the DNN architecture-space is encoded as a Gene.

These genes are joined together in the form of a string to generate a Chromosome, which can be decoded to construct the DNN architecture. Based on the number of possible values for each gene, our chromosome is a binary string of length 9, as illustrated in Fig. 5 . The next step is to evaluate the viability of an individual (i.e., a DNN architecture) based on its ability to compete with the other individuals. This is known as the Fitness Value, which is subsequently evaluated for each individual in the population and depends on the cost function φ. The fitness value is computed either as the cost function in scenarios where the decoded individual (ARCH) corresponds to a DNN in the existing architecture-space ψ or is taken to be 0 otherwise and discarded from the search. For example, even though the #LSTM cells is encoded using 3-bits, the gene only has 5 possible values. This leads to 3×2 5 potential DNN architectures that are not present in the architecture-space ψ, which can be immediately discarded. Next, through a process called Selection, we pick two individuals, based on their fitness values, and have them pass their genes to the next generation. At this point, two important factors come into picture, namely Crossover and Mutation. For each pair of parents selected for mating, an ordered crossover occurs with a probability of 0.4, and a crossover point in the parents' chromosomes are selected at random. The offspring are created by exchanging the genes of parents among themselves from the start of the chromosome until the crossover point. The new offspring generated by this crossover mechanism are added to the population of the next generation. Furthermore, the offspring's genes can undergo a shuffle-index mutation with a probability of 0.11, which would flip the bit, thereby introducing "diversity" in the population and prevent premature convergence. An example illustrating the mechanism of crossover and mutation is depicted in Fig. 6 . We determine a population of size 30, based on the individuals' fitness values, and re-iterate to produce 5 generations of offspring to explore and identify the individuals that offer the best fitness value. Fig. 7 presents a flow chart depicting the stages for our genetic algorithm-based DNN architecture-space exploration.

Without loss of generality, in this work, we consider 4 well-known genetic algorithms for exploring our architectural-space, namely, Roulette Wheel [58] , Tournament Selection [59] , NSGA-II [60] , and SPEA-2 [61] . The time complexity of our BioNetExplorer framework depends on the size of the architecture-space (N ) and the computational complexity of the genetic algorithms used for exploring the architecture-space. Roulette Wheel has a time complexity of O(N * log N ) when using binary search in the selection process. Similarly, Tournament Selection, NSGA-II, and SPEA-2 have a time complexity of O(N ), O(N 2 ), and O(N 2 * log N ), respectively. Since each of these algorithms select and train different DNN architectures as part of their exploration, we will also compare the efficiency of these algorithms, as illustrated in Section V. These algorithms require us to propose a cost function (φ), which the genetic algorithm can optimize (minimize/maximize) to obtain a set of near-optimal DNN architectures (ω) for the given exploration. To ensure that the obtained design offers the best quality (Q DN N ) with minimum storage overhead (S DN N ) or F DN N , we propose the following weighted cost functions: S DN N .append(StorageOverhead(ARCH)); 5: else 6: ψ.remove(ARCH); Table I . 

Besides the use of architecture-space exploration to identify networks with similar output quality and low hardware overhead, we include the ability to further compress the network size using model compression techniques, like Pruning and Quantization, in our framework, and study their impact on the quality of bio-signal processing applications.

Pruning: This technique involves eliminating redundant and less important weights/kernels, or at times layers in the network, to further reduce the DNN's hardware overhead, thereby increasing their deployability in wearables. Furthermore, eliminating weights and network connections results in reducing the number of floating-point operations (F ), which also decreases the execution time and energy required for inference. We retrain the network with the remaining weights to achieve an output quality similar to that of the original network obtained from the architecture-space exploration. We integrate different pruning techniques from [34] [35] [36] [37] [38] in our framework, from which the system designer can select the appropriate technique based on DNN design and application requirements. For instance, the techniques presented in [34] and [37] prune the weights based on their magnitude. The approach presented by Han et al. [34] eliminates the lowest x% of the absolute weights in each layer of the DNN and retrains the network to achieve the original network accuracy. On the other hand, the technique presented in [37] is class-blind, i.e., it eliminates x% of the lowest absolute weights in the entire DNN, irrespective of which layer the weight corresponds to, and retrains the network to improve its accuracy. We evaluate the effectiveness of these two pruning techniques in our case-study as well, see Section V. Note, other pruning techniques can also be integrated into our framework as long as they comply with the standard input/output interfaces of pruning.

Quantization: The weights and biases in the network are typically stored as 32-bit floating-point numbers in memory, which leads to a large storage overhead, access latency, and energy consumption. Similarly, the energy required for performing a 32-bit floating-point ADD operation is 9× greater than the energy required for a 32-bit integer ADD operation [62] . Therefore, by limiting the set of possible values through a process known as quantization, we can reduce the hardware overhead of the DNN parameters from 32-bit floating-point values to 16, 8, or as illustrated in our work, even lesser number of bits. The key is to analyze the impact of quantization on the quality of the resulting DNN for our targeted bio-signal processing application.

Based on our experiments and previous work [34] , the optimal approach to achieve maximum reduction in hardware overhead of the network involves pruning the network to eliminate redundant or less important weights, followed by DNN parameter quantization. This involves the formation of 2 q (where q is the number of quantized bits) clusters based on the DNN parameters in each layer of the DNN using the k-means clustering algorithm. Next, we allocate equally spaced values to each cluster of weights in the layer, starting with the minimum value allocated to all zeros up to the maximum value allocated to all ones. For the purpose of uniformity, we use a uniform number of bits to quantize all layers (both convolutional and fully-connected) in the network. Note, different quantization techniques can be integrated into our BioNetExplorer framework if they comply with the input/output interfaces of standard quantization techniques. IV. EXPERIMENTAL SETUP Fig. 8 presents an overview of our experimental tool-flow for implementation and validation of the BioNetExplorer framework. The dataset used in this work is openly accessible and is made available online by the Physionet data bank [65] . Our framework uses the MIT-BIH ECG data [66] with 41 beat and non-beat annotations to construct new datasets and labels based on the application's requirement. Furthermore, to illustrate the benefits of including a unique output class, we include the CU Ventricular Arrhythmia dataset [67] that contains instances of ECG recordings with labels of ventricular tachycardia, ventricular flutter, and ventricular fibrillation. We use the Keras environment, which includes the Tensorflow machine learning platform, in Python to implement the DNN architectures that we explore in this work. The hyper-parameters are crucial for evaluating the accuracy of a DNN architecture. Therefore, we train the DNN architecture over multiple iterations for various values of the hyper-parameters in each iteration and select the hyper-parameter values that offer the best accuracy. The hyper-parameters that offer the maximum accuracy for these networks are presented in Table II . The four genetic algorithms that are used in this work are implemented using the DEAP library [68] . The parameters used by these algorithms and their values are presented in Table III . The exploration algorithm is implemented on top of the network training stage using multiple GPU servers with Core i9-9900 CPUs integrated with two Nvidia RTX 2080i GPUs, each. Furthermore, we enable the early-stopping mechanism during the network training stage to terminate execution when the loss of the DNN does not change between two consecutive epochs. The trained networks are then evaluated based on the number of parameters they require, which determines the storage overhead and the number of floating-point operations required to perform a single inference, which, in turn, determines the execution time of the network on the target platform.

We illustrate the benefits of using the proposed BioNetExplorer framework with the help of an ECG signal processing DNN, which is used to classify the following cases: 

First, we evaluate the effectiveness of two state-of-the-art Neural Architecture Search approaches for our bio-signal processing DNN architecture-space, namely, (i) MnasNet [30] and (ii) MobileNetV3 [33] . The purpose of this experimental study is to corroborate our claims that (i) traditional NAS approaches designed for image classification applications do not work effectively for bio-signal processing applications and (ii) there is a need and lack of NAS frameworks for bio-signal processing applications, like the BioNetExplorer framework proposed in this work. Unlike image classification applications that have two-dimensional inputs, bio-signal data is typically present in a 1-dimensional format. Therefore, to accommodate this requirement, we replaced the traditional 2D normal and depth-wise convolution layers with 1D normal and separable convolution, respectively. We have explored five variants of the MnasNet discussed in [30] and the small and large variants of the MobileNetV3 discussed in [33] , with various filter sizes ( [3, 5, 7, 8, 16] ) and depth multipliers ([1, 0.5, 0.25]).

The results of deploying these NAS techniques on the DNN 1 and DNN 4 architecture-spaces are presented in Figs. 9 and 10, respectively. The default blocks used in the MnasNet framework are more suited for images and/or 2-dimensional data because of which they do not perform well for bio-signal processing applications, as illustrated in Figs. 9(a) and 10(a). The best performing networks obtained using the MnasNet framework (labeled A in both figures) consistently perform worse than the networks obtained by our BioNetExplorer framework, especially in cases where the number of output classifications is greater than two ( A in Fig. 10 ). The networks obtained by MobileNetV3-NAS (labeled B in both figures) perform much better than MnasNet for both illustrated cases, which is still non-optimal when compared to the networks obtained by our framework. The basic ResNet block, adopted from [48] by our BioNetExplorer framework, is more suited for bio-signal processing applications as compared to the blocks used in state-of-the-art [30]- [33] . BioNetExplorer is successful in obtaining networks with better output quality for the same or reduced hardware overhead and vice-versa when compared to other existing approaches.

The approaches presented in [31] and [32] use a basic block structure that is replicated in different combinations, for exploring the DNN architecture-space, which is similar to the block structure and approaches presented in MnasNet [30] and MobileNetV3 [33] . Therefore, these approaches would also behave similarly to the state-of-the-art techniques presented in Figs. 9 and 10 .

Second, we illustrate the benefits obtained in training time of the DNN architecture-space exploration when a genetic algorithm is used instead of the exhaustive exploration for all five DNNs. Fig. 11 presents the time required for selecting and training the Pareto-optimal (Exhaustive Search) or near-optimal (Genetic Algorithm-Based Search) DNNs for various values of α and β. Similar to recent NAS studies [69] , we also implement a random search as a potential search strategy. It randomly selects and trains 10% of the architecture-space; see results in Fig. 11 . Since random search has practically very low overhead and a fixed number of architectures to train for all optimization goals, the time required for training the DNN stays the same for all scenarios. On average, the use of genetic algorithms reduces the exploration time by 9.03×. Note, exhaustive exploration might not be a viable option in all scenarios. For example, in applications that require the use of deeper and complex neural networks, the training time for each network would lead to an exponential increase in the exploration time while requiring hundreds of GPU-hours. Since genetic algorithms do not train all the networks in the architecture-space, the exploration time is limited to tens of GPU hours. Furthermore, as a back-up strategy, our system supports the use of random search that has a fixed-time complexity to provide a reasonably good solution.

Networks from the DNN 1 and DNN 4 Architecture-Spaces

Next, we exhaustively explore all the architectures of DNN 1 and DNN 4 , i.e., train all the architectures using the constructed datasets. We illustrate the quality-of-service and memory storage overhead trade-off of the DNN 1 and DNN 4 architecture-spaces in Figs. 12 and 13 , respectively. The Pareto-front obtained by random search (labeled B in both figures) is similar to the Pareto-front obtained using exhaustive exploration (labeled A in both figures) of the architecture-space, primarily due to a large number of inter-dependent DNN parameters and their effects on the final accuracy. The exhaustive exploration approach is able to successfully identify networks that trade-off between output quality and hardware overhead, as illustrated by the figures. For instance, we have successfully identified a DNN 4 network that reduces the hardware overhead by˜30MB for a quality loss of 0.5% (labeled C in Fig. 13 ). The Precision, Recall, and F1-score for the Anomaly class is lower than that of the Normal class. Therefore, a quality constraint can be imposed on any of the four quality metrics based on the application requirement. The main drawback of this approach is the time required for training and evaluating all networks in the architecture-space of the DNN.

Deep Neural Networks from the DNN 1 , DNN 2 , DNN 3 , and DNN 4 Architecture-Spaces

As illustrated in Fig. 11 , the training time benefits obtained by using genetic algorithms to search and select points from the architecture-space is quite significant. We use the four genetic algorithms (see Section III) instead of exhaustive exploration to search the architecture-space for all five DNNs described at the start of Section V, using equal weights (α = 0.5, β = 0.5) to construct the cost function.

We illustrate a subset of the results obtained from these experiments in Figs. 14, 15, 16, and 17. Genetic Algorithms identify near-optimal designs, without traversing the entire architecture-space of DNNs, thereby reducing the exploration time. It is important to note that a design selected by one algorithm might not be selected by another algorithm (for example, labels A and B in Figs. 16 and 17, respectively) . Therefore, a genetic algorithm should be selected by the requirements of the application and the system designer. This requires a preliminary analysis and evaluation of all genetic algorithms under consideration. It is also interesting to note that NSGA-II and SPEA-2 select and train a small subset of the networks from the architecture-space as compared to the Roulette Wheel and Tournament Search techniques.

E. Proposed Weighted Architecture Search and Analysis of Networks from DNN 3 , DNN 4 , and DNN 5 Architecture-Spaces

Next, we explore the architecture-space of all DNNs using the proposed weighted DNN architecture search approach illustrated in Algorithm 1. Although we have explored the architecture-space of all DNNs using all the genetic algorithms for various values of α and β, we illustrate a subset of these results for DNN 3 , DNN 4 , and DNN 5 architectures for three cases, namely, (i) focus on minimizing storage overhead (α = 0.2 and β = 0.8), (ii) simultaneously optimize for quality and storage overhead (α = 0.5 and β = 0.5), and (iii) focus on maximizing quality (α = 0.8 and β = 0.2), in Fig. 18, 19 , and 20.

The DNN architectures selected by the genetic algorithms clearly illustrate the effects of the cost function on the architecture-space exploration. For example, in Case-(i), a large number of points are selected closer to the low-memory region, because that is the focus of exploration (for example, label A in Fig. 19) . Similarly, in Case-(iii), since the search focuses more on the accuracy, a large number of points with high accuracy are selected, as depicted by label B in Fig. 18 . A large percentage of the DNN 5 architectures have 0% output quality (label C in Fig. 20) due to a bias against Ventricular Fibrillation in the number of samples present in the dataset. We illustrate the benefits of pruning various percentages of DNN parameters on the above-discussed three different networks from the DNN 3 architecture-space using the technique presented in [37] (see Fig. 21 ). Similarly, we illustrate the reduction in storage overhead for various levels of quantization for the networks obtained from the DNN 2 architecture-space (see Fig. 22 ). Finally, we illustrate the benefits of combining pruning and quantization (discussed in [34] for image classification) on the networks obtained from the architecture-space of DNN 4 in Fig. 23 .

Pruning, as a standalone technique, can be quite effective in reducing the hardware overhead by˜40% while improving the accuracy by up to 0.15% as illustrated by labels A and D in Fig. 21 . The labels B , C , and E denote DNN architectures that have an output quality similar to the original network while minimizing the overhead by more than 50%. Networks of model M1 can endure the process of pruning to a significantly larger extent, without loss of accuracy, as compared to models M2 and M3, primarily due to the higher number of non-essential parameters in the network. Quantization, is similarly useful in reducing the hardware overhead of the network by˜5× compared to the original DNNs for an output quality loss of < 0.1%, as illustrated by labels A , B , and C in Fig. 22 . A combination of pruning and quantization can be used to achieve a 53× reduction in hardware overhead of the DNN for a quality loss of˜0.2%, as illustrated by A in Fig. 23 . Similarly, the best networks obtained from models M2 and M3 are depicted by labels B and C , similar to the output quality of the network A .

Receiver operating characteristics (ROC) graphs are useful in visualizing and evaluating various types of classifiers. They have been widely used in evaluating the quality of DNNs and ML-based classifiers since the 1990s [70] . In this work, we use ROC graphs to analyze the effectiveness of the DNNs, obtained after pruning and quantization, that have similar output quality and hardware overhead. For this purpose, we explore the following two scenarios in each model of the DNN discussed in the previous section: (1) Maximize the DNN's accuracy with a constraint of 0.5MB maximum hardware overhead ( Fig. 24(a) ); (2) Minimize the hardware overhead of the DNN with a minimum output quality of 96.7% (Fig. 24(b) ). The results of this analysis are presented in Fig. 24 . While model M1 has the best ROC, M2 and M3 have similar, albeit slightly attenuated, characteristics, when compared to M1. We can clearly observe that the operating characteristics of the DNN obtained after pruning 90% of model M1's parameters and quantizing them with 3/4-bits have severely deteriorated. The pruned and quantized networks obtained from M2 and M3, which have similar output quality to the pruned and quantized network obtained from M1, exhibit better operating characteristics, similar to the original uncompressed networks. Therefore, these compressed DNNs are better suited for deployment in wearables.

To further re-emphasize the efficacy of our framework, we have included additional results in the supplementary material.

To summarize and answer the questions raised in Section I:

[A1] We presented the BioNetExplorer framework, which can be used to systematically explore the architecture-space of Deep Neural Networks for bio-signal processing applications. Based on the required output classes, user's quality requirements, and hardware constraints of the target wearable device, we perform a genetic algorithm-based exploration of the architecture-space to identify (near-) optimal DNN architectures that can be deployed on the wearable and benchmark it against an exhaustive search, while reducing exploration time by 9×.

[A2] We proposed a weighted DNN architecture search algorithm with a modified cost function to simultaneously search for networks that offer high accuracy and minimum hardware overhead. Based on these explorations, we have obtained a wide-range of (near-) optimal DNN architectures for all five cases that have been explored in Section V.

[A3] We have investigated the applicability of model compression techniques such as pruning and quantization in our framework to further reduce the hardware overhead of the DNN with minimal loss in output quality. The framework is successful in reducing the hardware overhead of the network by 53× for a quality loss of less than 0.2%. Furthermore, we have also illustrated that DNNs obtained through simultaneous quality and hardware overhead optimization that have been subsequently pruned and quantized exhibit better operating characteristics, when compared to DNNs that have been optimized for quality, making them more suitable for deployment in wearables.

Our framework can also be used to evaluate DNNs for various bio-signal processing use-cases, besides anomaly detection in ECG signals, such as seizure detection and prediction using EEG signals, detecting anxiety attacks in users using a combination of ECG, blood pressure, and heartrate, hypoxia using SpO 2 measurements, etc. The BioNetExplorer framework and the explored DNN architectures will be open-sourced and available online at https://bionetexplorer.sourceforge.io to ensure reproducibility. Semeen Rehman is currently with the Faculty of Electrical Engineering, Technische Universität Wien (TU Wien) as a tenure-track Assistant Professor. In October 2020, she received her habilitation in the area of Embedded Systems from TU Wien. Before that, she was a Postdoctoral Researcher with Technische Universität Dresden (TU Dresden) and Karlsruhe Institute of Technology (KIT), Germany, since 2015. In July 2015, she received her Ph.D. from Karlsruhe Institute of Technology (KIT), Germany. She has co-authored one book, multiple book chapters, and more than 50 publications in premier journals and conferences. Her main research interests include dependable systems, cross-layer design for error resiliency with a focus on run-time adaptations, emerging computing paradigms, such as approximate computing, hardware security, energy-efficient computing, embedded systems, MPSoCs, Internet 

Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update

Coronavirus disease 2019 (covid-19): situation report

Germany Launches Smartwatch App to Monitor Coronavirus Spread

Energy-efficient long-term continuous personal health monitoring

Secure integration of iot and cloud computing

Deep speech 2: End-to-end speech recognition in english and mandarin

Google's neural machine translation system: Bridging the gap between human and machine translation

Harmonious attention network for person re-identification

Condensenet: An efficient densenet using learned group convolutions

Convolutional sequence to sequence learning

A survey on mobile edge computing: The communication perspective

Near-sensor and in-sensor computing

Apple Watch Series 5

Galaxy Watch

Fitness Tracker

A survey of wearable devices and challenges

Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: The apple heart study

Blood cholesterol monitoring with smartphone as miniaturized electrochemical analyzer for cardiovascular disease prevention

Real-time event-driven classification technique for early detection and prevention of myocardial infarction on wearable systems

Ecg classification algorithm based on stdp and r-stdp neural networks for real-time monitoring on ultra low-power personal wearable devices

Lightweight and privacy-aware fine-grained access control for iot-oriented smart health

Wireless networked control systems with coding-free data transmission for industrial iot

Neural architecture search: A survey

Practical bayesian optimization of machine learning algorithms

Neural architecture search with reinforcement learning

Progressive neural architecture search

Towards flops-constrained face recognition

Imagenet: A large-scale hierarchical image database

Mnasnet: Platform-aware neural architecture search for mobile

Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search

Chamnet: Towards efficient network design through platform-aware model adaptation

Searching for mobilenetv3

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Structured pruning of deep convolutional neural networks

Thinet: A filter level pruning method for deep neural network compression

Prunet: Class-blind pruning method for deep neural networks

Runtime neural pruning

Haq: Hardware-aware automated quantization with mixed precision

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Trained ternary quantization

Lq-nets: Learned quantization for highly accurate and compact deep neural networks

Compensated-dnn: energy efficient low-precision deep neural networks by compensating quantization errors

Resource-efficient machine learning in 2 kb ram for the internet of things

Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications

An automatic cardiac arrhythmia classification system with wearable electrocardiogram

Arrhythmia detection using deep convolutional neural network with long duration ecg signals

Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network

Long short-term memory

Identity mappings in deep residual networks

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Dropout: a simple way to prevent neural networks from overfitting

Genetic algorithms

Tabu search

Simulated annealing

Ant colony optimization

Genetic algorithms: principles and perspectives: a guide to GA theory

Genetic algorithms in search, optimization and machine learning

Genetic algorithms, tournament selection, and the effects of noise

A fast and elitist multiobjective genetic algorithm: Nsga-ii

Spea2: Improving the strength pareto evolutionary algorithm

Learning both weights and connections for efficient neural network

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Adam: A method for stochastic optimization

Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals

The impact of the mit-bih arrhythmia database

Crei-gard, a new concept in computerized arrhythmia monitoring systems

Deap: A python framework for evolutionary algorithms

Random search and reproducibility for neural architecture search

Signal detection theory: Valuable tools for evaluating inductive learning

His research has a special focus on cross-layer analysis, modeling, design, and optimization of computing and memory systems. The researched technologies and tools are deployed in application use cases from Internet-of-Things (IoT), smart Cyber-Physical Systems (CPS), and ICT for Development (ICT4D) domains. Dr. Shafique has given several Keynotes, Invited Talks, and Tutorials, as well as organized many special sessions at premier venues. He has served as the PC Chair, General Chair, Track Chair, and PC member for several prestigious

Chip Technology Most Influential Scholar Award in 2020, six gold medals, and several best paper awards and nominations at prestigious conferences

This work was partially supported by Doctoral College Resilient Embedded Systems which is run jointly by TU Wien's Faculty of Informatics and FH-Technikum Wien.