key: cord-0605072-g62sezgh
authors: Aloufi, Ranya; Haddadi, Hamed; Boyle, David
title: Paralinguistic Privacy Protection at the Edge
date: 2020-11-04
journal: nan
DOI: nan
sha: 582d0993bcae45044a8fd669c7ed7a6815dc2436
doc_id: 605072
cord_uid: g62sezgh

Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, mental well-being, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data. In this paper we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY's on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in ABX score or minimal performance penalties in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.

Voice user interfaces (VUIs) are commonplace for interacting with consumer IoT devices and services. VUIs use speech recognition technology to enable seamless interaction between users and their devices. For example, smart assistants (e.g., Google Assistant, Amazon Echo, and Apple Siri) and voice services (e.g., Google Search) use VUIs to activate a voice assistant to trigger actions on IoT devices, or perform tasks such as browsing the Internet and/or reading news and playing music. The majority of these voicecontrolled devices are triggered with some wake word or activation phrase like 'Okay, Google', 'Alexa', or 'Hey, Siri', to inform the system that speech-based data will be received. We also know that they all suffer from frequent false activations [20] . Once a voice stream is captured by a device, analysis is outsourced to the provider's cloud services that perform automatic speech recognition (ASR), speaker verification (SV), and natural language processing (NLP). This frequently involves communicating instructions to other connected devices, appliances, and third-party systems. Finally, text-to-speech services are often employed to speak back to the user. This is shown in Figure 1 (A). While VUIs are offering new levels of convenience and changing the user experience, these technologies raise new and important security and privacy concerns. The voice signal contains linguistic and paralinguistic information, where the latter is rich with interpretable information such as age, gender, and health status [67] . Paralinguistic information can therefore be considered as a rich source of personal and sensitive data. Our voice also contains indicators of our mood, emotions, physical and mental well-being, and thus raises unprecedented security and privacy concerns where raw or inferred data may be used to manipulate us and/or shared with third parties.

Various neural network architectures, such as autoencoder networks (AE) and convolutional neural networks (CNN), have been proposed to tackle a diverse set of linguistics (e.g., speech recognition [79] ) and computational paralinguistics applications (e.g., speaker recognition [80] , emotion recognition [43] , and detecting COVID-19 symptoms [19] ). For example, recent end-to-end (E2E) automatic speech recognition systems rely on an autoencoder architecture as to simplify the traditional ASR system into a single neural network [13, 16, 79] . These ASR models use an encoder to encode the input acoustic feature sequence into a vector, which encapsulates the input speech information to help the decoder in predicting the sequence of symbols, as shown in Figure 1 (B). Cummins et al. in [18] perform speech-based health analysis using Deep Learning (DL) approaches for early diagnosis of conditions including physical and cognitive load and Parkinson's disease. Although these deep models have comparable performance with more conventional approaches like Hidden Markov model (HMM) based ASR, they have been designed without considering potential privacy vulnerabilities given the need to train on real voice data.

In this paper, we present EDGY, a hybrid privacy-preservation approach incorporating on-device paralinguistic information filtering with cloud-based processing. EDGY pursues two design principles: it enables primary tasks such as speech recognition, while removing sensitive attributes in the raw voice data before sharing it with the service provider. The first is a collaborative edge-cloud architecture [57, 78] , which adaptively partitions DL computation between edge devices (for privacy preservation) and the cloud server (for generous processing capabilities and storage). Computational partitioning alone, however, is not enough to satisfy privacy-preservation requirements [69] , and we therefore leverage disentangled representation learning to explicitly learn independent factors in the raw data [37] . EDGY further combines DL partitioning with optimization techniques to accelerate inference at the edge. Our prototype implementation and extensive evaluations are performed using a Raspberry Pi 4 and MacBook Pro i7 as example edge and server devices to demonstrate EDGY's effectiveness in running in tens of milliseconds at the edge and an upper bound of a few seconds for the overall computation, with minimal performance penalties or accuracy losses in the tasks of interest. In summary, our contributions are:

• We propose EDGY, a hybrid privacy-preserving approach to delivering voice-based services that incorporate on-device paralinguistic information filtering with cloud-based processing. Filtration of the voice data is based on disentangled representation learning, building on our prior work in [3, 4] . We show that the disentanglement can strengthen the edge deployment and can be leveraged as a critical step in developing future applications for privacy-preserving voice analytics. • We build EDGY as a composable system to enable configurable privacy as well as facilitate its deployment on embedded/mobile devices. • We demonstrate that a collaborative 'edge-cloud' architecture with DNN optimization techniques can effectively accelerate inference at the edge in tens of milliseconds with insignificant accuracy losses in the tasks of interest. • We investigate the adopting of 'zero-shot' linguistics metrics to evaluate the encoding quality of linguistics units from raw audio while using the accuracy metric to estimate the paralinguistics information. • We experimentally evaluate the proposed framework over various datasets and run a systematic analysis of its performance at the edge under different privacy configuration, and the results show its effectiveness in learning linguistic representations with 0.2% relative improvement in ABX score or minimal performance penalties while confronting privacy leakage by filtering the sensitive attributes with a classification accuracy drop to 34%-58% (i.e. over multi/binary attributes). Our code is openly available online 1 . The paper is organized into seven sections. Following the introduction, we provide a general background about model optimization techniques for running DL at the edge and existing edge-based speech processing works in Section 2. We formulate the threat model and a propose the EDGY defense framework in Section 3. Section 4 presents the experimental settings and the implemented model optimization techniques. We evaluate our experimental results in Section 5 before providing discussion and highlighting directions for future work in Section 6. We conclude the paper in Section 7.

2 BACKGROUND AND RELATED WORK 2.1 Deep Learning at the Edge 2.1.1 Edge Computing for Privacy-Preserving DL. Edge computing is increasingly adopted in IoT systems to improve user experience and improve individuals' privacy [57, 78] . Running deep models on edge devices in practice presents several challenges. These challenges include: (i) the need to maintain high prediction accuracy with a low latency [27] , and (ii) executing within the limitations of the resources available, such as memory and processing capacity of embedded devices considering the conventionally high computational requirements of these models. DL models are generally deployed in the cloud while edge devices merely collect and send raw data to cloud-based services and receive the DL inference results. Cloud-only inference, however, risks privacy violations (i.e., inference of sensitive information). To address this issue, researchers have proposed edge computing techniques. Often, DL models must be further optimized to fit and run efficiently on resource-constrained edge devices, while carefully managing the trade-off between inference accuracy and execution time. Partitioning large DL models across mobile devices and cloud servers is an appealing solution that filters data locally before sending it to the cloud. This approach may be used to protect users' privacy by ensuring that sensitive data is not unnecessarily transmitted to service providers. In [57] , a hybrid framework for privacy-preserving analytics is presented by splitting a deep neural network into a feature extractor module on the user side, and a classifier module on the cloud side.

Although most existing work in the area has looked at signals from inertial measurement unit (IMU) sensors, typically recorded while the user is performing different activities [10, [46] [47] [48] 64] , we demonstrate that using the encoder part of an autoencoder to sanitize data can also be used for privacy protection in the context of speech. Since breaking the computation between the edge and cloud is not sufficient for privacy preservation purposes, we strengthen this method by learning disentangled representations in the raw data at the edge, and then filtering sensitive information before sharing it with cloud-based services.

Optimizing DL Models for the Edge. DNN models are usually computationally intensive and have high memory requirements, making them difficult to deploy on many IoT devices. For example, recurrent neural networks (RNNs) which are commonly used in applications such as speech processing, time-series analysis, and natural language processing, can be large and computeintensive due to a large number of model parameters (e.g., 67 million for bidirectional RNNs) [51] . This makes it difficult to deploy these models on resource-constrained devices. Optimizing DL models by quantizing their weights can, however, reduce these resource requirements. Narang et al. [51] propose a method to reduce the model weights in RNNs to deploy these models efficiently on embedded or mobile devices. Similarly, Thakker et al. [73] significantly compress RNNs without negatively impacting task accuracy using Kronecker products (KP) to quantize the resulting models to 8-bits.

Optimizing the neural network architecture using quantization and pruning can lead to significant efficiency improvement in many speech processing applications. For example, He et al. [27] propose an end-to-end speech recognizer for on-device applications such as voice commands and voice search which runs twice as fast as real-time on a Google Pixel phone. They do this by quantizing model parameters from 32-bit floating-point to 8-bit fixed-point precision to reduce memory footprint and speed up computation. Zhai et al. [82] proposed SqueezeWave, a lightweight flow-based vocoder. SqueezeWave aims to address the expensive computational cost required by the real-time speech synthesis task. The proposed vocoder translates intermediate acoustic features into an audio waveform on a Macbook Pro and Raspberry Pi 3B to generate a highquality speech. The most important challenge, however, is to ensure that there is no significant loss in terms of model accuracy after being optimized. Specifically, we analyze different optimization techniques (e.g., filter-pruning, weight-pruning, and quantization) to fulfill our goal in learning privacy-preserving representation from the raw data in near real-time with as little cost as possible to model performance.

2.2.1 Disentangled Representation. Learning speech representations that are invariant to differences in speakers, language, environments, microphones, etc., is incredibly challenging [44] . To address this challenge, numerous variants of Variational Autoencoders (VAEs) have been proposed to learn robust disentangled representations due to their generative nature and distribution learning abilities. Hsu et al. in [30] propose the Factorized Hierarchical VAE (FHVAE) model to learn hierarchical representation in sequential data such as speech at different time scales. Their model aims to separate between sequence-level and segment-level attributes to capture multi-scale factors in an unsupervised manner. There is an extended trend towards learning disentangled representations in the speech domain as they promise to enhance robustness, interpretability, and generalization to unseen examples on downstream tasks. The overall goal of disentangling is to improve the quality of the latent representations by explicitly separating the underlying factors of the observed data [37] . Speech signals simultaneously encode linguistically relevant information, e.g. phoneme, and linguistically irrelevant information, i.e. paralinguistic information. In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody, as well as the linguistic content [22] . Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition [59] , prosody transfer [72, 84] , speaker verification [61] , speech synthesis [31, 72] , and voice conversion [32] , among other applications. Although the focus of these works is to raise the efficiency and effectiveness of speech processing applications (e.g. speech recognition, speaker verification, and language translation), in this paper we highlight the benefit of learning disentangled representation to learn privacy-preserving speech representations, as well as showing how disentanglement can be useful in transparently protecting user privacy.

Privacy-preserving Speech Representation. Learning privacy-preserving representations in speech data is relatively unexplored [44] . Aloufi et al. [4] investigate the scenario whereby attackers can infer a significant amount of private information by observing the output of state-of-art underlying deep acoustic models for speech processing tasks. In [52] , Nautsch et al. demonstrate the importance of the development of privacy-preserving technologies to protect speech signals and highlight the importance of applying these technologies to protect speakers and speech characterization in recordings.

The recent VoicePrivacy initiative [74] promotes the development of anonymization methods that aim to suppress personally identifiable information in speech (i.e., speaker identity) while leaving other attributes such as linguistic content intact. Most of the proposed works focus on protecting/anonymizing the speaker identity using voice conversion (VC) mechanisms [2, 62, 70, 71] . VoiceMask, for example, was proposed to mitigate the security and privacy risks of voice input on mobile devices by concealing voiceprints [62] . It aims to strengthen users' identity privacy by sanitizing the voice signal received from the microphone and then sending the perturbed speech to the voice input apps or the cloud. However, these VC methods aim to protect speaker identity against different leakage attacks depending on the attacker's knowledge of the anonymization method (i.e., ignorant, informed, and semiinformed) [41] . They found that when the attacker has complete knowledge of the VC scheme and target speaker mapping, none of the existing VC methods will be able to protect the speaker identity. Thus, disentangled-based VC might strengthen speaker identity protection by avoiding the leakage of private speaker attributes into the content embeddings.

Similar to our work, Srivastava et al. in [70] proposed an ondevice encoder to protect the speaker identity using adversarial training to learn representations that perform well in ASR while hiding speaker identity. They conclude that the adversarial training does not immediately generalize to produce anonymous representations in speech (i.e., that could be limited by the size of the training set). In [23] , the authors combine different federated learning and differential privacy mechanisms to improve on-device speaker verification while protecting user privacy. Beside speaker identity, various works have been proposed to protect speaker gender [34] and emotion [3] . In [3] , an edge-based system is proposed to filter affect patterns from a user's voice before sharing it with cloud services for further analysis.

Considering the 'configurable privacy' principle, we assume that privacy is subjective, with varying sensitivity between users which may even depend on the services with which these systems communicate. For example, in [83] , PDVocal is proposed as a privacypreserving and passive-sensing system to enable monitoring and estimating the risk of Parkinson's disease in daily life. Unlike other approaches, however, we seek to protect the privacy of multiple user attributes for IoT scenarios that depend on voice input or speech analysis, i.e. sanitizing the speech signal of attributes a user may not wish to share, but without adversely affecting the functionality or experience. We also emphasize the importance of learning disentangled speech representation for optimizing the privacy-utility trade-off and transparently promoting privacy.

We consider an adversary with full access to user data with the aim to correctly infer sensitive attributes (e.g., gender, emotion, and health status) about users by exploiting a secondary use of the same data collected for the main task. Specifically, the attacker could be any party (e.g., a service provider, advertiser, data broker, or a surveillance agency) with interest in users' sensitive attributes. The service providers could use these attributes for targeting content, or data brokers might profit from selling these data to other parties like advertisers or insurance companies, while surveillance agencies may use these attributes to recognize and track activities and behaviors. In the settings of the current system, all VUI providers have access to raw data that contains all the paralinguistic information needed to infer myriad sensitive attributes, including emotions [50] , age, gender, personality, friendliness, mood, and mental health. For instance, Amazon has patented technology that can analyze users' voices to determine emotions and/or mental health conditions. This allows understanding speaker commands and responding according to their feeling to provide highly personalized content [35] . Our work is designed to protect the sensitive attributes contained in shared data from potential inference attacks. An Open Inference Attack Vector. It is possible to accurately infer a user's sensitive and private attributes (e.g., their gender, emotion, or health status) from deep acoustic models (e.g., Deep-Speech2 [5] ). An attacker (e.g., a 'curious' service provider) may use an acoustic model trained for speech recognition or speaker verification to learn further sensitive attributes from user input even if not present in its training data. To investigate the effectiveness of such an attack, we used the output of the DeepSpeech2 model and attach different classifiers (i.e., emotion and gender recognition) to demonstrate the potential privacy leakage caused by these deep acoustic models based on each of the datasets described in Section 4. We measured an attack's success as the increase in inference accuracy over random guessing [81] . We found that a relatively weak attacker (i.e., using methods that include logistic regression, random forest, multi-layer perceptron, and support vector machine classifiers) can achieve high accuracy in inferring sensitive attributes, ranging from 40% to 99.4%, i.e. significantly better than guessing at random.

Speech communication in human interaction can be broadly convey information on two layers: (1) a linguistic layer that refers to the meaningful units of information structure in the speech signal, including phonemes (i.e., the smallest speech unit that may cause a change of meaning within a language [29] ), words, phrases, and sentences, and (2) a paralinguistic/extralinguistic layer that refers to non-verbal phenomena, including speaker traits and states [66] . The phonetic content affects the segment level, while the speaker characteristic affects the sequence level [30] . Thus, the speech signal could be disentangled into several independent factors, each of which carries a different type of information.

Core Design: In our context, the idea is to disentangle the factors related to the task we want to compute per layer. We aim to demonstrate the effectiveness of learning disentangled representations in preserving the sensitive attributes in the user data. Such disentanglement can be beneficial to enable decentralized privacy-aware analytics and promote transparency in protecting users' privacy.

Discrete units (e.g., phonemes) highlight linguistically relevant representations of the speech signal in a highly compact format [8, 53, 65, 75] , while being invariant to speaker-specific and background noise details. These representations can be used to bootstrap training in speech systems and reduce requirements on labeled data for zero-resource languages [77] . It can also enable privacy-preserving paralinguistics.

Disentangle-based learning techniques prevent the speaker information from leaking into the content embeddings either by reducing the dimension or quantizing the content embedding as a powerful information bottleneck [17] . These techniques include: propagating reversed gradient from the speaker classifier [15] , applying instance normalization [14] , and quantizing the representation [75] . Thus, to achieve our goal in learning disentangled representations for preserving privacy, we investigate clustering approaches (e.g., -means and vector quantization (VQ)) as an information bottleneck, and propose three models: -means and Vector-quantized Contrastive Predictive Coding ( -means/VQ-CPC) and Vector-quantized Variational Autoencoder (VQ-VAE) to extract the phonetic content, while being invariant to low-level information.

One motivation to apply clustering approaches is that implementing clustering/quantization can capture high-level semantic content from the speech signal, e.g., phoneme due to the discrete nature of phonetic units [45] . The input speech sequence is first encoded into the frame-level continuous vector with length . Then, the quantize layer projects the latent representation from the encoder module into the closet point in the codebook by selecting one entry from a fixed-size codebook = [ 1 , 2 ,.., ], where is the size of the codebook.

-means/Vector-quantized-CPC. CPC [56] is a self-supervised learning method that learns representations from a sequence by trying to predict future observations with a contrastive loss. Given an input signal , the CPC model embeds to a sequence of embeddings = ( 1 , . . . , ) at a given rate using a non-linear encoder . At each time step , the autoregressive model takes as input the available embeddings 1 , . . . , and produces a context latent representation = ( 1 ,..., ). Given the context , the CPC model tries to predict the next future embeddings { + } 1 ≤ ≤ by minimizing the following constrastive loss:

where N is a random subset of negative embedding samples, and is a linear classifier used to predict the future -step observation. Rozé et al. [53] train a -means clustering module on the outputs of the CPC model. After training the -means clustering, the continuous features assigned to a cluster, and the input speech sequence can then be discretized to a sequence of discrete units corresponding to the assigned clusters. Similarly, VQ-CPC [76] incorporates vector quantization with the CPC model to discretize the continuous features and capture phonetic contrasts.

Vector-quantized-VAE. VQ-VAE [75] model uses a Vector Quantization (VQ) technique to produce the discrete latent space. During the forward pass, the output of the encoder (x) is mapped to the closest entry in a discrete codebook of = [ 1 , 2 ,.., ]. Precisely, VQ-VAE finds the nearest codebook using Eq.1 and uses it as the quantized representation (x) = (x) which is passed to the decoder as content information.

The transition from (x) to (x) does not allow gradient backpropagation due to the argmin function, but uses a straight-through estimator [9] . VQ-VAE is trained using a sum of three loss terms (in Eq.2): the negative log-likelihood of the reconstruction, which uses the straight-through estimator to bring the gradient from the decoder to the encoder, and two VQ-related terms: the distance from each prototype to its assigned vectors and the commitment cost [75] .

Note sg(·) denotes the stop-gradient operation that zeros the gradient with respect to its argument during the backward pass. For a more details refer to [75] .

Layer. Information on speaker characteristics can be useful for various paralinguistics tasks such as speaker authentication [23] or Parkinson's disease detection [83] , however, such information is often private. Thus, personalization and ondevice training are increasingly important for these tasks, since performing computations locally can improve both privacy and latency [24] . Non-semantic aspects of the speech signal (e.g., speaker identity, language, and emotional state) generally change more slowly than the phonetic and lexical aspects used to explicitly convey meaning (i.e., ASR) [68] . Following [60, 68] , we use TRIpLet Loss network (TRILL) to learn speaker-specific embedding that can be adapted to a variety of downstream paralinguistic tasks such as speech emotion recognition, speaker identification, language identification, and medical diagnosis. TRILL representation is trained using a self-supervised approach and uses triplet loss-based metric learning, assuming that segments closer in time are also closer in the embedding space. Formally, a large collection of example triplets of the form z = ( , , ) (namely anchor, positive, and negative examples) is sampled from unlabeled speech collection represented as a sequence = 1 2 ... [68] . The distance from the baseline (anchor) example to the positive (truth) example is minimized, and the distance from the baseline (anchor) example to the negative (false) example is maximized. The loss incurred by each triplet is then given by:

where . 2 is the 2 norm, . + is standard hinge loss, is a nonnegative margin hyperparameter. To support our goal to achieving configurable privacy in processing and sharing speech data, learning these generally-useful paralinguistics representations will help to enable personalization as well as serve as a first step toward achieving paralinguistics privacy in a decentralized manner, as proposed in [23] .

Speech reconstruction/generation is often implemented in the form of autoencoding, where speech is first encoded into a low-dimensional space, and then decoded back to speech. In the speech domain, a vocoder acts as a decoder and learns to reconstruct audio waveforms from acoustic features [55] . For example, Wave Recurrent Neural Networks (WaveRNN) [36] uses linear prediction with recurrent neural networks to synthesize neural audio. Specifically, it will combine the speaker identity (i.e., global condition) and linguistic context (i.e., local condition) to generate the synthesized speech, as shown in Figure 3 . Thus, discovering informative discrete speech units from raw audio (i.e., zero-shot fashion) opens up the possibility of addressing the effect of non-linguistic variability (e.g., channel and speaker identity) on the linguistic quality of the reconstructed speech. It is also allow us to evaluate speech generation (i.e., speech content) at many linguistic levels in regards to it.

Inference efficiency is a significant challenge when deploying DL models at the edge given restrictions on processing, memory, and in some cases power consumption. To address this challenge, we focus on a variety of techniques that involve reducing model parameters with pruning and/or reducing representational precision with quantization to support efficient inference at the edge. We elaborate in the following:

3.3.1 Quantization. Quantized models are those where we represent the models with lower precision [33] . The main feature of existing quantization frameworks is usually the ability to quantize the weights and/or activations of the model from 32-bit floatingpoint into lower bit-width representations without sacrificing much of the model accuracy. Quantization is motivated by the deployment of ML models in resource-constrained environments like mobile phones or embedded devices. For example, fixed-point quantization approaches can be implemented on the weight and activation to reduce resource consumption on devices. We thus experiment with the quantization technique on the model parameters to discover its effect on the compression of the model and increased speed of inference time at the edge with minimal detriment to its prediction accuracy.

Knowledge distillation refers to the idea of model compression where a complex model (i.e., teacher model) will be used to distill its knowledge to the small model (i.e., student model) without a significant drop in the prediction accuracy [11, 28] .

In this process, the teacher network or an ensemble model can extract important features from the given data and produce better predictions. Then the student network with the supervision of the teacher model will be able to produce comparable results. Thus, this distillation compresses the knowledge in an ensemble model into a single model which is much easier to deploy on embedded devices. Following [60] , we turned into knowledge distillation where a generally-useful speech representation model will be used to distill its knowledge to the small model and fast enough for on-device applications.

EDGY is designed as a composable system to enable 'configurableprivacy' in various deployment environments with resource constrained devices, and it currently consists of three basic modules, which are the encoding, quantization, and classification, with the possibility of adding a fourth unit, which is the decoding (if needed according to the application context), as shown in Figure 4 .

Assume that the processing pipeline starts with the encoding module. The encoding module consists of various models each trained on public data using an unsupervised learning approach. The module learns information over two different layers (i.e., linguistics and paralinguistics), and after verifying the ability of the model to extract useful acoustics representations, the following are the possible application scenarios under the collaborative "edgecloud" paradigm: we suggest partitioning the encoding with the quantization module or/and the encoding with the classification module and deploying it on the resource-constrained edge devices while decoding module if implemented can be performed in the cloud. For linguistics embeddings, there have been recent attempts to recognize speech through the direct use of the quantized representations (i.e., speech-related embedding) in NLP algorithms without the need for decoding these representations. For example, after using vector-quantized/clustering modules to generate quantize the dense representations from the speech segments, wellperforming NLP algorithms (e.g. BERT) were then applied to these quantized representations, which achieve promising state-of-theart results in phoneme classification and speech recognition [7] . In addition, paralinguistics embeddings can be used separately for further local authentication or personalization purposes.

The decoding part may be implemented for the purpose of generating voice. It is reasonable to assume that service providers may want to keep recordings in user records for the sake of transparency (e.g., to permit GDPR). We assume that service providers may want to offer personalized services to their users. Using the proposed framework, therefore, and by decomposing the processing between the edge and the cloud can help to achieve this objective in a privacy-preserving manner. More precisely, learning disentangled representations by the proposed framework at the edge will allow more control over the sharing of these representations. For example, service providers may train a decoder using the speaker embedding when using the service for the first time, and then the encoder at the edge sends only the speech-related embedding to regenerate the user recordings.

In this section, we briefly describe the datasets used (LibriSpeech, VoxCeleb, CREMA-D, SAVEE, and Common Voice) and our experimental setup, highlighting baseline settings as well as optimization techniques used to improve EDGY's performance.

We use a number of real-world datasets that were recorded for various purposes including speech recognition, speaker recognition, accents, and emotion recognition to train EDGY and examine its effectiveness in protecting paralinguistic information. The details of each dataset are as follows: LibriSpeech. LibriSpeech [58] is a large dataset of approximately 1,000 hours of reading of English. It was derived from reading audiobooks from the LibriVox project, and was recorded to facilitate the development of automatic speech recognition systems. We use the train-clean100 set and test set. VoxCeleb. The VoxCeleb dataset [49] contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to 

For the linguistics embedding, the target is to learn discrete units (i.e., speaker-invariant) useful for speech recognition and phone classification. We apply three different vector quantized-based models which are: CPC-kmean clustering [53] , CPC-VQ [76] , and VAE-VQ [75] to extract the phonetic content. All of these models start by encoding the audio signal by the encoder, then the encoder output (latent vectors) passes through vector quantization layer to become a sequence of quantized representation that serves as the speech embedding.

For CPC-kmean clustering, we used the implementation by [53] 2 , which is a modified version of the CPC model. The encoder is a 5-layer 1D-convolutional network with kernel sizes of 10, 8, 4, 4, 4 and stride sizes of 5, 4, 2, 2, 2 respectively, resulting in a downsampling factor of 160, meaning that for a 16kHz input, each feature encodes 10 ms of audio. This is followed by the autoregressive model which is a multi-layer LSTM network with the same hidden dimension as the encoder. Then, a k-means clustering module is trained on the outputs of either the final layer or a hidden layer of the autoregressive model. The clustering is done on the collection of all the output features at every time step of all the audio files in a given training set. After training the k-means clustering, each feature is then assigned to a cluster, and each audio file can then be discretized to a sequence of discrete units corresponding to the assigned clusters. The k-means training was done on the subset of LibriSpeech containing 100 hours of clean speech. For CPC-VQ and VAE-VQ, the implementation followed [76] . The CPC-VQ encoder consists of a convolutional layer (downsampling the input by a factor of 2), followed by a stack of 4 linear layers with ReLU activations and layer normalization after each layer, while the VAE-VQ encoder is a stack of 5 convolutional layers (downsamples the input by a factor of 2). Then, the encoder output is projected into a sequence of continuous latent vectors which are discretized using a VQ layer with 512 codes. For the CPC-VQ, the autoregressive model summarizes the discrete representations up to time into a context vector . Using this context, the model is trained to predict future codes.

For the paralinguistic embedding, the target is to learn general representation (i.e., non-semantic) useful for personalization tasks and the medical domain. We follow the work of [68] and use TRILL embedding 3 . It is based on ResNetish [60] , a variant of the standard ResNet-50 architecture, followed by a = 512-dimensional embedding layer. The TRILL model uses triplet loss as a training objective, often used in similarity learning, aiming to discriminate against the same/different audio segments. Intuitively, the objective is attempting to learn an embedding of the input such that positive examples end up closer to their anchors than the corresponding negatives do. The produced embedding of dimension =512 represents the training input of downstream paralinguistics tasks.

Training. For the linguistics embedding, the training-100 set of LibriSpeech [58] is used as a training dataset. It has multiple speakers and was recorded at a sampling rate of 16 kHz. The log Mel-spectrogram context windows with = 80 Mel bands and = 96 frames representing 0.96 s of input audio are computed from the speech waveform (i.e., STFT computed with 25 ms windows with step 10 ms) as model input. For the paralinguistic embedding, a subset of AudioSet [21] is used as a training dataset, which is the largest dataset for general purpose audio machine learning (serving as an audio equivalent of ImageNet). The log Mel-spectrogram context windows with = 64 Mel bands and = 96 frames representing 0.96 s of input audio are computed from the speech waveform (i.e., STFT computed with 25 ms windows with step 10 ms) as model input. 

To develop an edge-friendly model, we implement various optimization techniques and show their effect on the trade-off between performance and accuracy. Optimization techniques can be applied either during or after the training. Upon completion of model training, we apply the optimization methods and fine-tune them.

Linguistics. To quantize the linguistics models (i.e., CPC-kmean, CPC-VQ, and VAE-VQ), we use the Neural Network Compression Framework (NNCF) [39] , a framework for neural network compression with fine-tuning to experiment with different compression techniques. It supports various compression algorithms including quantization, pruning, and sparsity applied during the model fine-tuning process to achieve better compression parameters and accuracy. The overall compression procedures can be summarized as loading a JSON configuration script that contains NNCF-specific parameters determining the compression to be applied to the model, and then passing the FP model along with the configuration script to the "nncf.create_compressed_model" function. This function returns a wrapped model ready for compression and fine-tuning, and an additional object to allow further control of the compression during the fine-tuning process, as in Figure 5 . Fine-tuning is a necessary step in some cases to recover the ability to generalize what may have been damaged by the model optimization techniques. We, therefore, fine-tune the model over 10 epochs after implementing the model optimization (i.e., quantization) to enhance its accuracy. Paralinguistics. To optimize the paralinguistics model, we follow Peplinski et al. work [60] and use the knowledge distillation to distill "TRILL" to a much smaller student model by using a truncated MobileNet architecture. Knowledge is transferred from the teacher model to the student by minimizing a loss function, aimed at matching softened teacher logits as well as ground-truth labels [28] . The logits are softened by applying a temperature scaling function in the softmax, effectively smoothing out the probability distribution and revealing inter-class relationships learned by the teacher. The MobileNet architecture uses a width multiplier alpha to control the number of filters in the convolutional layers within each inverted residual block, and thus the student models can be distilled with several values of alpha, allowing independent variation of the width (via alpha) and depth (via truncation) of the student model while sampling a wide range of parameter counts [60] . Therefore, we distill the "TRILL" embedding to a student model which is trained to map input spectrogram to the output representation produced by "TRILL". Such student embedding then are used as input representation for solving paralinguistics tasks on edge devices.

For system performance, we consider two metrics of computational efficiency on a MacBook Pro and a Raspberry Pi 4B, (1) CPU Execution Time, measured in seconds (s) and (2) Memory Usage, measured in megabytes (MB). By focusing on the linguistic level, we evaluate the quality of the learned embedding using linguistic metrics introduced by [40] . These metrics either are derived by computing pseudo-distance or by computing (pseudo-) probabilities. Distance-based metrics require models to provide a pseudo-distance computed over pairs of embeddings. Probability-based metrics are computed over pairs of inputs. One example is to evaluate the syntactic abilities of language models by comparing the probabilities of grammatical versus ungrammatical sentences (i.e., syntactic level), while the other is to evaluate the lexical level by comparing the pseudo-probability associated with words and non-words. They give interpretable scores at each linguistic level: phonetics (ABX score), lexicon (spot-theword), syntax (acceptability judgment), and semantics (similarity score). We draw on evaluations presented in the Zero Resource challenge 2021 [53] and select metrics that enable us to evaluate two linguistic levels: the acoustic and the semantics levels. This allows us to evaluate the learning of linguistic representations from raw audio signals without any labels as well as without the need for the reconstruction step.

To evaluate the quality of paralinguistics embeddings, we train a set of simple models (i.e., using the Scikit-Learn library) using embeddings as input representation to solve each paralinguistics classification task. Embeddings for each utterance are averaged in time to produce a single embedding vector. We report the test accuracy (i.e., the ratio of correctly predicted labels) across combinations of downstream classifiers.

In this section, we evaluate our results in terms of (i) system performance, and (ii) privacy estimation.

To evaluate the system performance, we measure CPU execution time and the memory usage of the compressed models during the inference time. We perform a detailed analysis of these models, each of which requires: loading the model, pre-processing the raw recordings, and producing the encoded representation. This analysis will help us make good decisions in choosing the appropriate optimization methods for EDGY deployment and understanding the associated cost. 

ARM Cortex-A72 Intel Core i7 Figure 6 : CPU execution time required by each encoder (including linguistics) models (left), and memory usage consumed by these models in two deployment environments (right).

We compare the performance of three different models that learn discrete linguistics representations as baseline models, namely 'CPC-Kmean-FP32', 'CPC-VQ-FP32', and 'VAE-VQ-FP32' trained using floating-point-32 precision. Firstly, we divided the models into encoder and vector quantization modules to enable better performance analysis and more distributed deployment settings. We run the pre-trained vector-quantized based models on the MacBook Pro and the Raspberry Pi 4, and we measure each module separately (i.e., encoder and vector quantization). As shown in Figure 6 , the results indicate that we can deploy these models on the different edge/cloud devices with promising overall inference time and memory usage by all the modules (i.e., encoder and vector quantization) per model. For example, the 'CPC-Kmean-FP32' model requires 0.071 s and 0.001 s (i.e., encoding and quantizing) and 0.801 s and 0.025 s (i.e., encoding and quantizing) for inference time, while memory consumption is 0.005 MB and 0.004 MB (i.e., encoding and quantizing) and 0.006 MB and 0.005 MB (i.e., encoding and quantizing) on the Raspberry Pi 4 and MacBook Pro, respectively. Despite the increase of the inference time consumed by these modules on the Raspberry Pi 4 compared to the MacBook Pro, the memory consumption is similar in all cases. Moreover, we show the performance results for the compressed models with INT8 quantization in Figure 6 , 'FP32' represents floating-point-32 models and 'INT8' represents 8-bit-integers quantization models. Interestingly, INT8 model quantization was able to reduce the model size by almost 23.6x, yielding a model about half or less the baseline'FP32' models. We also measure the inference time and memory usage on the MacBook Pro and Raspberry Pi 4, and we find that it speeds up the inference time by about 2.4x compared with the 'FP32' baseline models.

We show the performance results for two different models that learned universal representations useful for several paralinguistics tasks (e.g., speaker recognition, emotion recognition, and accent identification), namely 'TRILL' and 'TRILLdistilled' in Figure 7 . We measure the inference time and memory usage on the MacBook Pro and Raspberry Pi 4. as shown in Figure 7 , distillation can effectively reduce (about 2.9x) the inference time from 800 to 275 milliseconds in encoding processes on Raspberry Pi 4. For all models, the memory usage is very small, ∼12 kB. However, there is a slight increase in the 'TRILL-distilled' model size about 5 MB over the 'TRILL' model due to the size of its final dense layers (i.e., more parameters), see Figure 7 . System Performance Summary. Quantization and distillation are two techniques commonly used to address the model size and performance challenges. We found that when using these methods (i.e., quantization and knowledge distillation) we obtained a model that can be deployed on the edge with better performance in terms of inference time and memory usage. The optimized models only need about a maximum 30 MB (i.e., the linguistics layer) and 103.22 MB (i.e., the paralinguistics layer) for installation, which is suitable for limited-capacity devices. However, for the paralinguistics layer, further quantization and distillation might be needed to get even lighter models (i.e., less than 103.22 MB). We also investigated the effect of using these methods on speeding up the inference time, achieving approximately a 2.9x improvement over the baseline models, but as expected we trade-off between performance and accuracy, see Table 3 . However, we note that the performance of the quantized models might be varied based on the input data (i.e., batch size) and the used hardware (i.e., support for INT8 inference). We leave for future work additional efforts to explore these optimization approaches deployed for even more constrained devices.

To estimate the privacy protection, we first measure the quality of the linguistics representations by aiming high performance for ASR, while achieving low performance on paralinguistics tasks (i.e., the increase in inference accuracy over random guessing). Then, we evaluate the accuracy of the paralinguistics representations to have good accuracy in the target tasks. The overall results are in Tables 3 & 4. 5.2.1 Linguistics Task. We conduct two types of experiments using these representations, one to measure the quality of these representations in understanding what has been said, and the other to estimate to what extent these representations preserved paralinguistics information (i.e., privacy leakage). For the first experiment, we report ABX and similarity scores using the development set of the Zero Resource challenge 2021 [53] . For the acoustic level, we use the ABX score (i.e., lower is better) to estimate the discriminability between phonemes. ABX is calculated by computing the distance between the representations associated with three acoustic tokens (a, b, and x), two of which belong to the same category A (a and x) and one which belongs to a different category B (b). Thus, the score is the estimated probability that a and x are closer to one another than a and b. In this paper, we use the ABX score developed in Libri-light and report both the within-speaker ABX score (where a, b and x belong to the same speaker) and the across-speaker ABX score (where a and b belong to the same speaker and x to a different one). At the semantic level, the sSIMI similarity score (i.e., higher is better) uses to compute the similarity of the representation of pairs of words and compares it to human similarity judgments. The score is a correlation coefficient and if the model perfectly predicts human judgment, the score will be 1, otherwise will have a score of 0. To obtain this, the outputs from a hidden layer of the language model of the two discretized sequences will be aggregated with a pooling function to produce a fixed-length representation vector for each sequence and the cosine similarity between the two representation vectors is computed. We use the sSIMI similarity score developed in the mturk-771 dataset and report similarity distance over pairs of the same voice for the synthetic subset, and all possible pairs for the LibriSpeech subset. The overall results are in Table 3 , where shows that the lower the ABX score, the better the linguistic representations we got, as the 'CPC-Kmean-FP32' model scored the lowest compared to the 'CPC-VQ-FP32' and 'VAE-VQ-FP32' models and a further reduction in ABX score when implementing it optimization techniques by about 0.2%. Although 'CPC-Kmean-FP32' scored a higher score compared to the 'CPC-VQ-FP32' and 'VAE-VQ-FP32' models, the overall semantic scores show the need for further improvement.

For the second type of the experiments, we report the accuracy of the classification tasks using these linguistics representations, and interestingly, based on the drop in the classification accuracy across the various paralinguistics tasks, as shown in 4, the VQ-based embeddings clearly acts as an information bottleneck, forcing the models to discard speaker-related details, and shows that learning privacy-preserving linguistics representation is feasible.

To support the principle of configurable privacy, and since we indicated in Sec. 2.2.2 that this kind of information could be useful for potential critical applications such as healthcare, we have made our privacy estimation by using two types of representations over two layers (i.e., linguistics and paralinguistics) and training shallow classifiers on the top of these representations. We train these small models to solve various downstream paralinguistics tasks (i.e., speaker identification, emotion recognition, accent identification, and gender recognition). The following describes each of them: Speaker Identification. Speaker identification involves determining which speaker has produced a given speech [49] . We use the VoxCeleb1 dataset and treat it as a multiclass classification task (i.e., a distribution over the 1,251 different speakers). As shown in Table 4 , by using paralinguistics representations, we can achieve reasonable accuracy to identify the speaker. This identification accuracy decreased sharply by 14-38% when using linguistics representations (i.e., after implementing vector-quantized/clustering and optimization techniques to get these representations). Emotion Recognition. Emotion recognition involves classifying vocal emotional expressions in sentences spoken in a range of basic emotional states (e.g., happy, angry, sad, and neutral) [12] . We use CREMA-D and SAVEE datasets and consider emotion recognition as a multiclass classification task (i.e., a distribution over basic emotions). We observe that the disentanglement in learning linguistics representation combined with optimization techniques shows a considerable drop in emotion recognition, i.e. at a rate of 18-59 % compared with paralinguistics one. Accent/Language Identification. Language Identification involves classifying the language being spoken by a speaker [6] . We use the Common Voice dataset (English set only) and evaluate accent identification as a multiclass classification task (i.e., a distribution over 17 English language accents). We note that using linguistics representation combined with optimization techniques shows a significant drop in language identification, i.e. at a rate of 40-64 % compared with paralinguistics one. Gender Recognition. Gender recognition involves distinguishing the speaker's gender from a given speech. We use the Common Voice dataset (English set only) and estimate gender recognition as a binary classification task (i.e., a distribution over male and female). We find that disentanglement in learning linguistics representations combined with optimization techniques can decrease gender recognition at a ratio of half compared with the learned paralinguistics representations. Interestingly, even in the binary classification task such as gender classification, the disentanglement approach in learning linguistics representations achieves a promising protection level up to a random guess. Privacy Estimation Summary. Regarding linguistic representations, we composed two zero-shot tests probing two linguistic levels: acoustic, and semantic, where these metrics will help to evaluate the quality of the learned linguistic representations from raw signals (i.e., unsupervised systems) without the need to reconstruct them. Therefore, these results, as shown in Table 3 are promising in learning linguistics representation directly from raw signals (i.e., without text or labels) while being invariant about background noise and speaker characteristics, among other information. Learning such discrete speech units might be helpful in developing speech technology more robust and inclusive, especially for low-resourced/ with no textual resources languages. Besides, we show that these discrete speech units could be a promising solution at protecting paralinguistics information within the raw speech signals, and thus open up new areas of investigation for new privacy-aware, crosslanguage architectures for voice analytics systems. We used two types of representation in estimating the paralinguistics privacy, as shown in Table 4 . It shows that there is a sharp drop in the classification accuracy when comparing linguistic and paralinguistics representations performance over various paralinguistics tasks (e.g., speaker recognition and emotion recognition) with an about 34% to 58% in detecting emotions for example, and interestingly, this performance drop increased by 6% when implementing optimization techniques (e.g., precision quantization). The clustering approaches show its effectiveness in learning linguistic representations, see Table 3 , while reducing irrelevant paralinguistics information (i.e., measured by the accuracy drop), as shown in Table 4 . Such representations will help in protecting the sensitive data when sharing audio data as well as speed up its transmission [38] . To address the configurable privacy principle, we consider also the scenario where paralinguistics information might be needed for authentication purposes or medical diagnosis, thus we disentangle the representation learning (i.e., linguistic and paralinguistic) using a composable framework. Thus, such non-semantic and on-device embeddings can be tuned for privacy-sensitive applications (e.g., speaker recognition), see Table 4 .

Preserving privacy in speech processing is still at an immature stage, and has not been adequately investigated to-date. Our experiments and findings indicate that it is possible to achieve a fair level of privacy protection at the edge while maintaining a high level of functionality for voice-controlled applications. Our results can be extended to highlight different design considerations characterized as trade-offs, which we discuss as follows.

Performance vs. Optimization. First, we asked the following question: "is it possible to develop lightweight models for representation learning that can work on the edge in near real-time while maintaining accuracy?" Deep learning model compression techniques (e.g., pruning, quantization, and knowledge distillation) aim to reduce the size of the model and its memory footprint while speeding up the inference time and saving on memory use. To do this, a pre-trained dense model is transformed into a sparse one that preserves the most important model parameters. For example, Peplinski et. al. in [60] evaluate the importance of adopting knowledge distillation to develop a set of efficient models that can learn generally-useful non-semantic speech representation and run on-device inference and training.

In our work, we have adapted several popular techniques that are currently used to compress models and enable them to be deployed at the edge. One of the primary reasons for taking an edge computing approach is to filter data locally before sending it to the cloud. Local filtering may be used to enhance the protection of users' privacy. As shown in Figure 7 , combining the precision quantization with linguistics models enables us to reduce the model size by almost 23.6x, yielding a model about half or less the baseline'FP32' models. Model compression is also shown to speed up inference time by about 2.4x. Interestingly, it can improve the quality of the learned linguistics representation by about 0.2% (Table 3) . Moreover, having a paralinguistics embedding model trained via knowledge distillation can effectively reduce the model size and the inference time (about 2.9x). This model, in particular, is fast enough to be run in near real-time on an edge device to expose many privacy-sensitive applications. We were able to obtain light models for representation learning but as future work, we will look to advance its deployment on devices with even more limited resources, e.g., by using optimization techniques such as architecture reductions per module. For example, combining adversarial training [25] , scalar quantization [54] , and Kronecker products [73] might help to get more further lightweight models. Such models should be fast enough to run in real-time on a mobile device and present minimal performance degradation of the task of interest.

Disentanglement vs. Privacy. Second, we examined the following question: "is it possible to increase user privacy by learning privacy-preserving representations from speech data while also increasing transparency by giving users control over the sharing of these representations?" Speech data has complex distributions and contains crucial information beyond linguistic content that may include information contained in background noise and paralinguistics information (i.e., speaker-related information), among other information. The current speech processing systems are trained without regard to the impact of these varieties of sources which may affect its effectiveness. For example, only a portion of this collected information is related to ASR, while the rest can be considered as invariant and therefore potentially impinge upon the performance of ASR systems. Likewise, the implementation of disentanglement in speaker-related representations can enhance the robustness of speaker representations and overcome common speaker recognition issues like speaker spoofing [61] . Recent studies have suggested that a disentangled speech representation can improve the interpretability and transferability of the representation in the speech signal [30] . Although such work seeks to improve the quality and effectiveness of speech processing systems, it has not considered its application to protecting privacy. We observe that learning disentangled representations can bring enhanced robustness, interpretability, and controllability. Our proposed system aims to achieve the configurable privacy principle by adopting a number of design factors. We first implement multi-layer processing based on the assumption that the information in the speech signal can be extracted via two basic layers (i.e., linguistics and paralinguistics) [30] . Second, in each layer, we distribute the processing pipeline into independent modules where each performs a specific task (i.e., mainly encoding, quantization, and classification). Finally, we apply the optimization mechanisms over the module with more computational overhead (i.e., encoding module) to enable its deployment on embedded/mobile devices. Consequently, it can be argued that the proposed system can help to develop a variety of future privacyaware solutions between users and service providers, and can give more control to users to consent to share their data. Learning disentangled representations not only serves our purpose to protect user privacy, but is also useful in finding robust representations for different speech processing tasks with limited data in the speech domain [44, 53] . In the future, we will attempt to combine techniques like adversarial training [32] and Siamese networks [42] with disentanglement, and add further constraints grounded in information theory (e.g., triple information bottleneck [63] ) to improve learning disentangled representations. Compactness vs. Robustness. Finally, we investigated the following question: "is it possible to take advantage of model optimization techniques to obtain edge-friendly models as well as enhance the protection of sensitive paralinguistic information?" We first focus on the effect of compactness on privacy, and as shown in Table 4 we demonstrate that the use of compression techniques can improve filtering of sensitive information we may wish to keep locally. For example, by comparing the classification accuracy over various paralinguistics tasks using linguistics embeddings, the compressed 'INT8' models are worse than the 'FP32' baseline by about 6%. It is, therefore, also interesting to further investigate the impact of performance optimization techniques on enhancing the disentangled representations learning. Specifically, the higher degree of disentanglement between the linguistics and paralinguistics representations, the better control we have over the application of various privacy configurations.

The security vulnerabilities of DL algorithms to adversarial examples is an ongoing concern [1], but is beyond the scope of this paper. We will pose an additional future question toward enabling the deployment in a trustworthy way: "can the model compression techniques be used as a defense to improve both privacy and security objectives of voice-controlled systems?". In computer vision, for example, Gui et al. in [25] proposed adversarially trained model compression (ATMC) algorithm to enhance the robustness of convolutional neural networks (CNNs) against adversarial attacks that aim to fool these models into making wrong predictions. It is, therefore, also interesting to further investigate the trade-off between accuracy and optimization in addition to security. Concerning trade-offs between robustness and performance, in future work we will attempt to strengthen the representation learning models' robustness against potential attacks using advanced optimization techniques such as adversarial compression or using a combination of optimization techniques (e.g., pruning with quantization) for security and trust-sensitive voice-controlled IoT applications.

In this paper, we proposed EDGY, a hybrid privacy-preserving approach to delivering voice-based services that incorporates ondevice paralinguistic information learning and filtering with cloudbased processing. We leverage disentangled representation learning to explicitly learn independent factors in the raw data. Model optimization is essential for deep learning embedded on mobile or IoT devices. Thus, we performed further combinations between multi-layer/modules with optimization to accelerate deep learning inference at the edge, gaining approximately a 2.4x-2.9x performance improvement over the floating-point models. We successfully deployed our model on representative edge/cloud devices, including Raspberry Pi 4 and MacBook Pro i7, and showed its effectiveness in running in tens of milliseconds with 0.2% relative improvement in ABX score in learning linguistic representations or minimal impact on accuracy in the various voice analysis tasks. Using EDGY, we evaluated the trade-off made between lightweight implementation and performance, and explain that striking the correct balance will depend on the services with which we interact. We can expect further trade-offs between performance and accuracy when considering deployment for even more constrained devices.

Our future work includes investigating the development of a scheme to automatically choose a compression method and/or combine a subset of these methods so that automatic optimization can be conducted for deploying deep models for given computational resources, latency requirements, and privacy constraints.

The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Preech: A system for privacy-preserving speech transcription

Emotion Filtering at the Edge

Privacy-Preserving Voice Analysis via Disentangled Representations

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Common voice: A massively-multilingual speech corpus

2020. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Estimating or propagating gradients through stochastic neurons for conditional computation

DYSAN: Dynamically sanitizing motion sensor data against sensitive inferences through adversarial networks

Model Compression

Crema-d: Crowd-sourced emotional multimodal actors dataset

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Stateof-the-Art Speech Recognition with Sequence-to-Sequence Models

Unsupervised speech representation learning using wavenet autoencoders

Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning

An Overview on Audio, Signal, Speech

When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers

Audio set: An ontology and human-labeled dataset for audio events

Towards learning fine-grained disentangled representations from speech

Improving on-device speaker verification using federated learning with privacy

Improving on-device speaker verification using federated learning with privacy

Model compression with adversarial robustness: A unified optimization framework

Audio-visual feature selection and reduction for emotion classification

Streaming end-to-end speech recognition for mobile devices

Distilling the Knowledge in a Neural Network

Speech perception as categorization

Unsupervised learning of disentangled and interpretable representations from sequential data

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Privacy enhanced multimodal neural representations for emotion recognition

Voice-based determination of physical and emotional characteristics of users

Efficient Neural Audio Synthesis

Disentangling by Factorising

Skoglund, and Hengchin Yeh. 2021. Generative Speech Coding with Predictive Variance Regularization

Nikolay Lyalyushkin, and Yury Gorbachev. 2020. Neural network compression framework for fast model inference

Adelrahman Mohamed, and Emmanuel Dupoux

Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers

Unsupervised feature learning for speech using correspondence and Siamese networks

Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Mobile sensor data anonymization

Andrea Cavallaro, and Hamed Haddadi. 2020. Privacy and utility preserving sensor-data transformations

Replacement autoencoder: A privacy-preserving algorithm for sensory data analysis

VoxCeleb: A Large-Scale Speaker Identification Dataset

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Exploring sparsity in recurrent neural networks

Preserving privacy in speaker and speech characterisation

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

Scalable Model Compression by Entropy Penalized Reparameterization

Wavenet: A generative model for raw audio

Representation learning with contrastive predictive coding

A Hybrid Deep Learning Architecture for Privacy-Preserving Mobile Analytics

Librispeech: an asr corpus based on public domain audio books

Unsupervised Speech Domain Adaptation Based on Disentangled Representation Learning for Robust Speech Recognition

Jake Garrison, and Shwetak Patel. 2020. FUN! Fast, Universal, Non-Semantic Speech Embeddings

An empirical analysis of information encoded in disentangled neural speaker representations

Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity

Unsupervised Speech Decomposition via Triple Information Bottleneck

Olympus: sensor privacy through utility aware obfuscation

Unsupervised pretraining transfers well across languages

Paralinguistics in speech and language-State-of-the-art and the challenge

Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. 2020. Towards Learning a Universal Non-Semantic Representation of Speech

Overlearning Reveals Sensitive Attributes

Privacy-preserving adversarial representation learning in ASR: Reality or illusion?

Mohamed Maouche, Aurélien Bellet, and Marc Tommasi. 2020. Design choices for x-vector based speaker anonymization

Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis

Compressing rnns for iot devices by 15-38x using kronecker products

Neural discrete representation learning

Vectorquantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

Convergence of edge computing and deep learning: A comprehensive survey

End-to-end Anchored Speech Recognition

Utterance-level aggregation for speaker recognition in the wild

Privacy risk in machine learning: Analyzing the connection to overfitting

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Pdvocal: Towards privacy-preserving parkinson's disease detection using non-speech body sounds

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis