key: cord-0668688-3gx4xcu5
authors: Almeida, Mario; Laskaridis, Stefanos; Mehrotra, Abhinav; Dudziak, Lukasz; Leontiadis, Ilias; Lane, Nicholas D.
title: Smart at what cost? Characterising Mobile Deep Neural Networks in the wild
date: 2021-09-28
journal: nan
DOI: nan
sha: c92123e7cd34fb764bdcfe45986386a9b3636333
doc_id: 668688
cord_uid: 3gx4xcu5

With smartphones' omnipresence in people's pockets, Machine Learning (ML) on mobile is gaining traction as devices become more powerful. With applications ranging from visual filters to voice assistants, intelligence on mobile comes in many forms and facets. However, Deep Neural Network (DNN) inference remains a compute intensive workload, with devices struggling to support intelligence at the cost of responsiveness.On the one hand, there is significant research on reducing model runtime requirements and supporting deployment on embedded devices. On the other hand, the strive to maximise the accuracy of a task is supported by deeper and wider neural networks, making mobile deployment of state-of-the-art DNNs a moving target. In this paper, we perform the first holistic study of DNN usage in the wild in an attempt to track deployed models and match how these run on widely deployed devices. To this end, we analyse over 16k of the most popular apps in the Google Play Store to characterise their DNN usage and performance across devices of different capabilities, both across tiers and generations. Simultaneously, we measure the models' energy footprint, as a core cost dimension of any mobile deployment. To streamline the process, we have developed gaugeNN, a tool that automates the deployment, measurement and analysis of DNNs on devices, with support for different frameworks and platforms. Results from our experience study paint the landscape of deep learning deployments on smartphones and indicate their popularity across app developers. Furthermore, our study shows the gap between bespoke techniques and real-world deployments and the need for optimised deployment of deep learning models in a highly dynamic and heterogeneous ecosystem.

The recent popularity of Deep Neural Networks (DNNs) has seen them being applied to myriads of areas, from computer vision [29] to speech recognition [10] and machine translation [58] . DNNs are no longer only being deployed in datacenters [28] , as they have found their way into mobile devices, ranging from IoT devices to flagship smartphones and self-driving cars. In fact, large part of what makes smartphones smart, can be attributed to the everincreasing support for machine learning, be it in the form of camera optimisations, intelligent assistants or text predictions.

While DNNs have become more and more accurate, this was frequently at the expense of an increased number of parameters, energy consumption and computational load [3, 29, 32, 55] , often resulting in poor performance on resource-restricted mobile and embedded devices [3, 40, 72] .

To address these challenges, there has been significant research towards mobile-specific DNN optimisations. Firstly, researchers have designed various mobile-specific architectures either manually [31, 39] or automatically, through Network Architecture Search (NAS) [59] . Secondly, numerous works have looked into reducing computation through weight sparsification and pruning [41] and quantisation [26] . Thirdly, kernel optimisations have been proposed for mobile SoCs [13] . Last but not least, inference offloading is an alternative approach where computation is partly or wholly outsourced to a remote endpoint for faster results [35, 38] .

At the same time, recent developments on mobile SoCs enable smartphones to support higher DNN computational throughput at a lower energy budgets [33, 65] , either through heterogeneous multicore processors (e.g. ARM big.LITTLE and DynamIQ) or through specialised hardware (e.g. DSPs and NPUs). However, the device ecosystem remains very heterogeneous, ranging from cheaper devices with older processors to flagship devices with dedicated processing units. As a result, it is extremely hard for developers to assess the performance and optimise their DNN models for each possible device tier [68] .

In this work, we attempt to measure what the actual mobile ML landscape looks like in the wild by studying real-world DNNs, as deployed with the most popular applications of the Google Play Store. Our goal is to examine whether real-life deployments follow the state-of-the-art of ML research and identify performance bottlenecks over devices of different tiers and generations. The gained experience will provide insights on the system and modellevel optimisations required to push the current frontier of mobile intelligence. In particular, we make the following contributions:

• We design a system, named gaugeNN, that automates the extraction, analysis and benchmarking of DNN models found in the most popular apps in the wild. • Using gaugeNN we analyse over 16k (33k across two snapshots)

Google Play Store apps with respect to their DNN models. We characterise these models in terms of their usage, architecture, layer operations and optimisations as well as external cloudbased DNN API calls. • We compare our latest snapshot with a previous version of the Google Play most popular apps 12 months ago and comment on the trajectory of DNN mobile penetration in the past year. • We perform a runtime measurement of hundreds of these DNN models across heterogeneous devices of different capabilities to further characterise these models in terms of their achieved latency and energy consumption. • We analyse model and system-level optimisations supported by publicly available toolsets and provide an overview of the current DNN optimisation landscape available to developers and practical guidelines for improving the development and deployment of future DNNs.

With our study, we aim to answer the following Research Questions (RQ) that arise:

RQ#1: Given the forefront of ML research and the multitude of tools and devices in the wild, what kind of models are being deployed in mobile apps and utilised by developers and for which tasks? RQ#2: In a highly heterogeneous ecosystem of smartphones, how are these models deployed and are they able to perform efficiently across different targets and tasks? RQ#3: What are common model and system-level optimisations being used to make inference in the wild faster on smartphones? Can they be improved?

Results: Our results indicate that mobile developers choose to deploy simple off-the-shelf models on-device, potentially pretrained or fine-tuned for targeting different tasks, and often rely on cloud offloading to support larger tasks. This minimises the burden to the app developer and cashes upon existing models widely available. Furthermore, we witness that devices of different tiers and generations have widely varying performance over the benchmarked models, with the low-tier devices being significantly slower in DNNbased tasks. When it comes to performance per watt, we notice a general trajectory of devices getting incrementally more efficient from generation to generation, with SoCs integrating more and more specialised hardware in the die. However, the same trajectory cannot be traced on battery technology, which remains largely the same and mainly varies depending on the device's form factor. Last, we have observed that off-the-shelf model-level optimisations deployed with major frameworks more often than not do not result to latency or memory benefits during inference, but are focused on compressibility of the model. Simultaneously, SoC vendor-specific tools offer a significant benefit in runtime, at the expense of generality of the deployed models. Still, we found no significant evidence of target-specific model deployment in the wild.

To fulfil these diverse characterisation goals, we employ the three step methodology depicted in Fig. 1 . First, we crawl the Google Play Store to find the DNN models from within the most popular apps among mobile users and extract their associated ML models, validating them against certain rules (grey boxes). Second, we perform a device-agnostic app and model analysis (purple boxes). Specifically, we look at the app's store metadata, where the DNN is used, as well as the model's layers and operations. Finally, we benchmark the models on different devices to analyse their performance upon deployment (blue box). To automate this process and analyse ML models at scale we designed gaugeNN. We describe below each component in greater detail.

The first step in our methodology is to find, extract and validate the DNNs from Google Play Store most popular apps. App crawling. First, gaugeNN mimics the web API calls made from the Google Play store of a typical mobile device to crawl the Google Play Store. In these requests, both the user-agent and locale headers are defined, which determine the variant of the store and apps retrieved. To perform the crawling, we fetch the list of the top free apps per category which returns a maximum of 500 apps. Additionally, gaugeNN stores the store metadata for each app, including popularity, category, reviews, etc. in an ElasticSearch instance for quick ETL 1 analytics and cross-snapshot investigations (Sec. 4). Model extraction. Given the downloaded apps, gaugeNN proceeds to extract the DNN models from each application's package. Traditionally, Android applications are packaged in a zip file, i.e. apk, which comes with the the Java/Kotlin "bytecode" along with resources used by the app (e.g. textures, images, fonts). Apks have a size limit of 100MB and files -such as DNN weights -can have a larger storage footprint. As a result, Google Play allows additional content to be shared either with expansion files [21] (OBBs) or through Android App Bundles through Play Asset Delivery [20] The former supplement the main apk file and are hosted and served by Google Play, whereas the latter offers the possibility of downloading assets on demand, as needed for a given device. gaugeNN supports file extraction from i) the base apk, ii) expansion files (OBBs) and iii) Android App Bundles, but does not track asset delivery outside of Google Play. Extracted files are matched against a compiled list of 69 known DNN framework formats (listed in the Appendix) to identify potential DNN models. Model validation. Many models use generic file formats (e.g., protobuffer). Therefore, the number of candidate model files and extensions is quite large and benchmarking all prospective ones quickly becomes computationally prohibitive at scale. Therefore, inspired by the open-source Netron [53] tool , gaugeNN employs a lightweight -framework and format specific -validation process to remove files that are not DNN models. This validation consists of checking the binary signature of the file for the presence of specific identifiers that a framework uses. For example, for TFLite, we know that the FlatBuffer files representing models include specific headers at certain positions of the binary file, thus we check for the existence of e.g. the string "TFL3" there. On the downside, encrypted and obfuscated models do not match such validation rules and are not extracted in our analysis. Moreover, models downloaded on demand by the application outside of the official Google Play distribution mechanisms are omitted from our benchmarks. However, we do track applications using such models indirectly by means of library inclusion in the application code and native libraries, even without explicitly analysing the models. The native code detection follows the methodology of Xu et al. [70] .

After collecting the top apps from each category, we analyse the usage of Deep Neural Networks in the wild. Apps can use DNN models in different ways; i) they can execute the models on-device or ii) offload the computation to external resources (e.g. cloud providers). In-app DNN models. After identifying the model files within an application, gaugeNN extracts their DNN architecture either by parsing directly the file, or by using the associated framework's interpreter. A DNN model is typically represented as a DAG 2 , where layers are represented by vertices and data flows by edges. By going through each model's graph, gaugeNN registers the type of layer, its parameters (weights) and operations in a trace-based manner and uses this information to estimate the total operations 3 (#FLOPs) and model size (#parameters). Furthermore, we can later individually run these models and measure their inference latency, energy and memory footprint. DNN Cloud APIs. Alternatively, applications might integrate ML functionality through cloud-backed APIs, by means of offloading inference to a remote endpoint. To detect the usage of cloud-based DNN models, gaugeNN inspects the app code to search for common DNN framework API calls. Android apps are typically developed in Kotlin or Java and then compiled into dex format [16] and packaged within the app binary. It is possible to extract this dex binary from the app package and decompile it into a human-readable (smali [14]) format using the apktool [63] to inspect the original code API calls. gaugeNN automates the process of decompiling these binaries and performs string matching on the smali files to detect 2 

Next, we describe how gaugeNN assesses the on-device run time and energy consumption of DNNs.

Devices. To assess the performance of the deployed DNN models at runtime -i.e. latency, energy, memory and CPU utilisation -we deploy these models on the devices of Table 1 . The devices of the first group represent three distinct tiers of smartphones (low to highend) and showcase the performance across heterogeneous clients, while the development boards of the second group represent hightier SoCs from different generations, whose open design allows us to measure energy consumption through cable probes connected to a Monsoon power monitor (Fig. 2) . Benchmark workflow. All benchmarks are written in native code and compiled for aarch64 with Android NDK. gaugeNN adopts a master-slave architecture depicted in Fig. 2 . The server, where the models initially reside, is responsible for orchestrating the deployment and benchmarking of the models across client devices (phones), connected over USB. To control the power passthrough of mobile devices, we use a USB controller board [71] that can programmatically disable data and power channels during measurements. This component was necessary, as connecting the device over USB charges it, interfering with the energy measurements. The benchmarking workflow is depicted in Fig. 3 . Initially, the master (left side) pushes all the necessary dependencies to the device (right side) through adb and asserts the initial device state (WiFi and sensors are off, maximum screen timeout, etc). The benchmark consists of an unattended, headless script that runs on the device upon disconnection of the USB power, controlled through the USB board. This script is launched as a daemon process and performs the following tasks: 1) It waits until the USB power is off; 2) it runs a configurable amount of warmup inferences to remove cold cache outliers; 3) it runs the actual benchmark inferences with a configurable inter-experiment sleep period; 4) it turns on WiFi upon completion and communicates a TCP message through netcat to the server that the experiment is over. Subsequently, the server re-enables the USB power, connects over adb and gathers the job results before cleaning up and launching the next job. In the following sections, we present the findings of our experiments run with gaugeNN. First, we present an offline analysis of the apps and models found from crawling the Google Play Store (Sec. 4) and then we move to runtime analysis of these models on devices (Sec. 5) and specific optimisations (Sec. 6).

In this section, we attempt to find an answer to RQ#1 with regards to DNN deployment in the wild. To this direction we first analyse our collected data with respect to the existence of DNN models in the top Google Play Store apps and their distribution to user devices. Then we move to more specific model and app categorisation and characterisation and finally draw conclusions about the trajectory of ML mobile deployment from our temporal analysis results.

As shown in Table 2 we collected two snapshots of the top free Google Play apps, on the 14 ℎ of February 2020 and on the 4 of April 2021. At these points in time, the Android devices represented 73.3% and 72.19% of the mobile OS market share [15, 56] respectively. Data was collected from an UK-based account associated to a Samsung S10 (SM-G977B), downloading the most popular apps across all categories of the Google Play Store (up to 500 apps per category). This accounts for the top 0.6% of total applications available in the store 4 . In general, apps downloads tend to follow a power law distribution [64] . Therefore, the most popular apps are installed on most users' phones while the rest follow a long tail. While we could not scale a study of paid apps for monetary reasons, these account for a very small percentage of downloaded apps [64] . For the rest of the paper, we report on the latest Play Store snapshot, unless explicitly stated otherwise.

As described in Sec. 3.1, models in Android applications can be distributed post-installation (e.g. through OBBs or Asset Delivery). This allows developers to bypass the 100MB apk limit and to provide customised models for devices with different capabilities (e.g. 4 Google Play Store is estimated to have 2.9M apps at the time of the latest snapshot [6] Snapshot '20 Snapshot devices with specified NPU). To identify any models that are distributed post-installation, we downloaded all companion files and Google Play assets. We found no models being distributed outside of the main apk. Furthermore, we downloaded an extra snapshot with a three Android generations older device profile 5 , and found no evidence of device-specific model customisation.

Observations: Our results indicate that the functionality offered by Play Services to download device-specific models may be underutilised in the realm of mobile ML or that developers choose not to specialise their models per device SoC or model. While specialising the model distribution per device target can be beneficial for performance and energy, it requires offline vendor-specific customisation of the model. Evidently, app developers seem to prefer generality of their deployment solutions, in line with [68] , and defer optimisation to middleware in the stack, such as NNAPI drivers or specific hardware delegates [33] .

Next, we look into the models found per ML framework. Specifically, Fig. 4 depicts the number of models successfully extracted, validated and benchmarked, per category and ML framework. These models represent 90.72% of the total apps including ML libraries in their codebase (Table 2) , with the rest accounting for obfuscated, encrypted or lazily downloaded models. In total these account for 1,666 models -1436 (86.19%) TFLite, 176 (10.56%) caffe, 46 (2.76%) ncnn, 5 (0.3%) TensorFlow and 3 (0.18%) SNPE. TFLite is expectedly first in popularity, as the recommended solution from the OS provider for mobile ML inference. However, it is surprising to see caffe so widely used, since it has been long deprecated and replaced by caffe2 in 2017 and now PyTorch Mobile. Observations: These results illustrate a long latency between the state-of-the-art frontier of ML frameworks and their adoption for in-the-wild deployment.

Here, we perform a quantitative analysis of DNN models and their respective apps and correlate them with metadata from the Google Play Store. Our aim is to categorise the most popular DNN-powered apps and characterise their usage. Fig. 4 shows the number of ML models per framework and Google Play category. We observe that the top DNN-powered apps belong to "communication" and "finance" tools with several DNNs for face and object detection (e.g. for detecting a card or ID to make transactions in the latter case). These are followed by more traditionally DNN-backed categories, such as "photography" and "beauty", which typically contain DNN-based filters to enhance photos. Potentially less expected categories include "food and drink", "dating" and "parenting". By manually examining these models, we found 0 20 40 anecdotal examples of apps within these categories using DNNs to detect or recognise objects (e.g. a bottle of wine or a face), for recommendation systems (e.g. partner matching, advertising and food recipe recommendation) and even for baby monitoring.

To dig deeper into the purpose of each AI model, we manually looked into the naming, input/output dimensions and layer types of the encountered DNN models in order to characterise their usage. This labour intensive job was done across three ML researchers with a majority vote on the results. We were able to identify the usage of 1, 531 models, accounting for 91.9% of all models, with around 67% having names which hint either the model, task at hand or both (e.g. "hair_segmentation_mobilenet.tflite"). Our characterisation shows that the most popular task for deploying Deep Learning is computer vision (> 89% of all models), followed by NLP (17 models) and audio (15 models). Last, we found traces of DNN models (4 models) utilising sensor data, such as accelerometer, gyroscope, etc. Two anecdotal use-cases for sensor ML are horse movement tracking and car crash detection in insurance apps. Task-specific results are shown in Table 3 , where it can be seen that most vision models were targeted at object, face and contour detection, most audio tasks at ambient sound recognition, most NLP tasks at textcompletion and sensor tasks at movement tracking. Observations: Vision models seem to be the most prevalent, with a focus on object and face detection and text recognition and used mostly across communication, photography and beauty apps.

Diving deeper into the models distributed amongst the most popular applications, we found that not all models are bespoke or unique. Overall, we witness DNN models spread across different application categories, with a significant portion of these being off-the-shelf models without customisation. In fact, after checking for unique checksums on these models and respective weights 6 , we find that only 318 models (19.1% of the models as shown in Table 3 ) are unique. For the most prevalent vision task, i.e., object detection, FSSD [43] seems to be the most popular model. We found such occurrences even within popular Google apps (e.g. "Gallery Go" and "Arts & Culture"). For face detection the Blazeface [8] is another very popular model. Spanning across tasks, MobileNet [31] seems to be the most popular architecture with variants (e.g. FSSD) being used to other vision tasks including semantic segmentation, pose estimation or classification. Last, we encounter multiple occurences of models tackling a common task, e.g. recognise information from credit cards [60] , such as names and dates.

Model fine-tuning. Taking this analysis one step further, we perform a checksum-based analysis at finer-granularity (layer-level) to see to what degree to developers train their own models from scratch or fine-tune the last layers through transfer learning [49] . The intuition is that the first layers of the network are typically extracting low-level features (e.g. edges, shapes, etc. for vision tasks) that are shared between similar tasks and only deeper in the DNN do the task-specific and semantically relevant features get extracted. Results from our analysis show that, excluding duplicate models, 9.02% of the remaining models share at least 20% of the weights with at least one other model. In fact, 4.2% of the models only differ in up to three layers, indicating that some developers only fine-tune small portions of the network, resulting in a significantly smaller training footprint and exploiting transfer learning from other (typically off-the-shelf) networks. Moreover, we checked for traces of online fine-tuning done on device (e.g. through TFLiteTransferConverter [61] ) and found none, indicating that on-device fine-tuning is not yet widely exploited in the wild due to the significant computation requirements and the limited availability of labelled high-quality on-device datasets.

Observations. Based on this type of evidence, we deduce that it is common for developers to leverage a pre-trained model that is widely available and pay the significantly smaller cost of training offline only a subset of the last DNN layers. While online on-device training is a prominent future avenue, be it through fine-tuning or federated learning, current support in mobile frameworks is limited and so are such deployments.

As aforementioned, we took two distinct snapshots of the most popular apps in the Google Play Store 12 months apart from each other. In this part of our analysis, we compare and contrast these two snapshots in terms of app popularity and in-the-wild DNN deployment and draw conclusions about the trajectory of ML penetration in smartphones nowadays. What is unique about our dataset is that we happened to measure DNN-deployment across the COVID-19 pandemic, which had a crucial impact on human activity during the course of 2020/2021. For this reason, we also compare our temporal analysis with similar analyses done in the past [70] Figure 5 : Individual models removed/added between two snapshots taken one year apart.

potential biases of our dataset during these exceptional circumstances and ii) to see how app popularity and, as an extension, DNN adoption, has been affected by these circumstances. Results from our temporal analysis indicate a surging number of DNN models being deployed on the Android platform, essentially doubling in the course of 12 months. Specifically, our traced models went from 821 to 1.6 for our latest snapshot (Table 2) , with most additions belonging to vision tasks. TFLite remains the dominant mobile inference framework, going from 81.6% to 86.1% of the total models found (2.15 ). The increase in models was less pronounced for ncnn (1.18 ) and caffe (1.69 ). The latter is surprising given the fact it has been deprecated and newer frameworks have taken its place (caffe2 and PyTorch Mobile). Finally, we observe a drop in the TF (0.56 ) adoption rate, which is expected given the increasing popularity of its mobile counterpart.

Next, we analyse the DNN models across snapshots per category of application to which they belong. Fig. 5 depicts the number of individual models that were removed/added across our snapshots, sorted by the difference between the two. Interestingly, most additions of ML models happened for communication tools, taking the lead from "photography" applications, which was the top ML-powered category of 2020. This can potentially indicate that communication apps became more important due to the pandemic, and developer focus was diverted to this category. A similar trend could be witnessed for "finance" applications, where we observed many models aimed at the automated identification of people and their ID cards. Whilst this traditionally constituted a manual process done in person in financial institution (e.g. banks), the pandemic might have created a new need for ML models to fill. Last, apps related to "health" and "medical" purposes seem to have a surging deployment of DNN models. On the other side of the spectrum, "lifestyle", "food & drinks" and "Android Wear" applications seem to be falling in terms of popularity, something that could be potentially attributed to the fact that people stay more at home.

Next, we integrate the results of previous analyses [57, 70] to shape a more general trend for DNN adoption in the Android ecosystem. In [70] , the authors report the total ML-backed apps going from 166 in June 2018 to 211 in September 2018. In [57] , the authors traced 178 ML-powered apps, somewhere between [70] and June 

After having coarsely characterised the models based on their input modality, target task and app category, we take a finer-grained look into the models and analyse their structure in terms of the layers and operations they contain. DNN layers and operation types. First, we go through the graph representing each DNN and trace the layer types they contain, grouping results per input modality. Results are shown in Fig. 6 for TFLite, NCNN and Caffe. We see convolution layers being amongst the most popular layer types across modalities (34%, 10%, 20% for image, text and audio, respectively). Originally applied in visual tasks, their usage nowadays spreads across recommender systems, natural language processing and time-series analysis. Variants such as depthwise-separable convolutions (depth_conv) [31] are computationally less heavy and are aimed for mobile deployments. Dense (or linear) layers are fully-connected layers that are typically found in the output of classification tasks, or in the implementation of RNNs. Majority of these layers are found in audio (19%) and text (9%) models. Activations essentially impose non-linearity in DNNs, and can be fused with the previous layer in terms of implementation. Thus, the existence of such operations as distinct layers is framework dependent. Last, "helper" layers such as math, quant, resize and slice operations are performing math or matrix representation operations and can be found across modalities. DNN #operations and #parameters. Next, we estimate the number of operations (in FLOPs) and parameters that each model contains by going through the graph in a trace-based manner. Concretely, we generate a random input with the DNN-specified input 7 The snapshot date is not reported, thus we consider it between [70] , with which it compares, and the work's venue submission date. dimensions and perform a DNN inference. During the forward propagation step, we measure analytically the amount of operations being performed per layer (dependent on the kind of layer) and the number of trainable parameters associated with it. Fig. 7 shows the result of this analysis per DNN task. We see that among the traced models, on average the heaviest deployed vision models belong to classification, hair reconstruction, segmentation and beauty tasks. For NLP the heaviest tasks belong to text auto-completion whereas for audio the heaviest deployed task is sound recognition. At this point, we note that these numbers only refer to the traced deployed models and do not represent a generic commentary on the overhead of models per task. In fact, in many cases it is the opposite if we only take into consideration the task (e.g. classification vs. segmentation or speech vs. sound recognition). Also, we note that the number of models found for each task category varies significantly.

Observations: We find that convolutions dominate the mobile DNN landscape due to their wide use in vision models, as well as the fact that they can map well on mobile hardware for efficient execution, compared to e.g. recurrent layers [72] . While depth-wise convolutions can significantly improve performance, their deployments are scarcer as they can impact the quality of the model. Furthermore, we find that there is huge variance in terms of FLOPs and parameters (four orders of magnitude) in the traced models. This might be attributed to the granularity of the task corresponding to a single inference. For example, in image recognition the input is typically an RGB image while in next-word prediction the input can be a couple of words.

Up until now, we have focused our efforts on analysing the DNN models in an offline manner. In this section, we turn to on-device benchmarking and report on performance and energy when running the encountered models across the devices presented in Table 1 . This analysis provides important insights about how real-world AI applications are performing on a heterogeneous set of devices, thus answering RQ#2.

Prior work [3, 33] has shown that FLOPs is not necessarily a good proxy for estimating a model's on-device performance. Reasons for such discrepancies include the underutilisation of hardware due to e.g. memory-bound operations, thermal throttling due to continuous inference or even due to scheduling on cores of different dynamics due to energy-saving scheduler policies on Heterogeneous Multi-Processors [36] . To further corroborate this fact, in Fig. 8 we depict the FLOPs and actual measured inference latency across devices for different models. Our analysis on real-world models on different devices reinforces this non-linear (line-fit) relationship as it not only varies for different model architectures, but also differs from one device to another.

To investigate this further, in Fig. 9 we show the ECDF of model runtime across all available devices. From the graph it is evident that the computing gap between a low-end device (A20) and a mid-tier device (A70) is considerably larger than the difference of mid-tier to high-end (S21). Specifically, low-end and mid-tier devices (A20 and A70) are 3.4× and 1.51× slower compared to S21. Across generations of high-end SoCs of the same manufacturer (Q845, Q855, Q888), we see incremental performance gains (i.e., average latency of 76, 58 and 35 ms), but noticeable, to the point that a next-gen mid-tier phone may perform better than the high-end SoC of a prior generation, despite claims about significant boosts in AI acceleration between generations. Last, we want to mention that for the two devices that integrate the same SoC (Q888 and S21), the open-deck design of the development board along with the vanilla variant of the OS leads to incrementally better results and faster inference overall. Heat dissipation of the open design, cross-manufacturer configurations and low-level configuration of the Android Scheduler can all be contributing factors. Observations: We observe a wide variability of inference latency across devices even for models that have similar FLOP counts, which reaffirms the need for on-device benchmarking. Devices of different tiers and generations offer variable dynamics, with the lower-tier falling significantly behind in performance. Even devices integrating the same SoC can offer variable performance due to vendor-specific configurations, the installed apps and drivers or even due to different thermal characteristics. Therefore, given this heterogeneity, it is hard for developers to accurately predict the users' experience without testing their models on a large sample of devices.

In mobile settings, one cannot simply optimise for performance without taking energy consumption into consideration. While smartphone capabilities are growing larger every year, the same developments have not been witnessed in battery technology. Therefore, quantifying the cost of being smart in terms of energy is an important component in the mobile world. In this section, we report on the energy, power and efficiency of doing inference on device, across frameworks for the three Snapdragon boards representing different generations of devices. Fig. 10a shows the distribution of models with respect to the energy required per inference across our three devices. Expectedly, we see from the kernel density function lines that all three devices follow a similar trajectory, indicating that a similar amount of energy is required for similar workloads regardless of the device. On the other hand, this is not the case in terms of power consumption (Fig. 10b) , where we can see newer generations of devices consistently drawing more power to run models. This is a direct implication of the fact that newer generations of devices can execute models faster, as shown in Fig. 9 , while energy required remains similar.

Following these observations, we decided to calculate inference efficiency per each model by calculating how many floating-point operations can be executed in one second per one Watt 8 . As can be seen in Fig. 10c , trends in efficiency stay mostly the same across different devices, following energy consumption, but unlike energy we can see a minor improvement of the newer devices over Q845 in the middle of the distribution, suggesting that relatively more models can run more efficiently (median efficiency of 730, 765 and 873 MFLOP/sW, after removing outliers) on the newer hardware. 8 Effectively the same as calculating FLOPs per Joule.

Battery discharge (mAh) Table 4 : Scenario-driven energy consumption for three devices and use-cases in audio, text and vision.

Use-case driven energy consumption. Up to here, we have seen performance and energy consumption for single inferences. However, the quanta of data associated with each inference may vary considerably between tasks or modalities as noted before in Sec. 4.7. Thus, we dive deeper into three selected tasks representative of each modality, namely i) sound recognition for audio, ii) auto-completion for text and iii) semantic segmentation for vision.

We make certain realistic assumptions on the data sizes, granularity input and frequency of results and then assess all relevant models belonging to this category. Specifically, for speech recognition, we assumed each model is run in order to recognize 1 hour of audio input. To derive how long a model would need to be run, we manually investigated the models and assumed the most likely amount of audio input per inference considering the model's input dimension and common practices in speech ML [10, 47, 51] . For text auto-completion we assumed each model is run once per new word typed by a user, and further assumed a workload of 275 words, derived from WhatsApp's statistics about average daily number and length of messages [12, 54, 66] . Last, for semantic segmentation, we assumed each model is used to segment a human at 15 FPS during a 1-hour-long video call in order to apply background effects, we further assumed that the model processes one frame per inference which is the usual approach [11, 45, 73] .

Results across the development boards are depicted in Table 4 we see that one hour of segmentation can result in a significant average reduction of 26.6% to 30.54% of a common 4000mAh battery capacity (e.g. A20 and S21). Moreover, the most energy hungry segmentation models can almost deplete the full battery capacity within an hour, with a 80.9% to 95.9% reduction. On the other end, models like auto-completion are ubiquitous across messaging apps and deliver both in terms performance and efficiency, allowing their frequent use without a significant impact on battery. Observations. Energy consumption is a major component in mobile, and intelligence comes at a cost to battery life. Unlike latency, which is visibly improved with new generations of devices, energy consumption seems to be predominantly dependent on the model architecture. Even though newer hardware might improve in power-efficiency, differences are much less pronounced compared to performance improvements, which are even less observable across different model architectures. This suggests that it is the AI developers who can optimise battery life the most, unlike plain latency which can be improved at multiple levels, including manufacturers.

After examining how real-world DNNs run on a heterogenous set of devices, we now look into RQ#3 by means of DNN-specific as well as system-level optimisations aiming to improve inference and deployment performance.

In this section, we focus on the adoption of three model-level optimisations, namely i) weight clustering, ii) pruning and iii) quantisation, for the identified TFLite models. Clustering: Clustering refers to the technique of reducing the number of distinct weight values by representing them through their clusters' centroids [26] . We identify clusters of shared weights by searching for layers with a "cluster_" prefix on TFLite models. Despite the advertised potential for significant model size reductions [22] , we report that none of the models in-the-wild seem to use weight clustering. This may be a result of either accuracy drops or the fact that the current clustering implementation does not reduce runtime memory and targets model compression only [22] . Pruning: Pruning refers to the technique of zero-ing out specific weights/channels of the network that have minimal impact on the output, due to representational redundancy in DNNs. Weight pruning can be detected during training by searching for layers with a "prune_" prefix for TFLite models. Nonetheless this prefix is often removed for inference [23] . We report that we did not find any occurrence of such layers either. While this approach has the potential to skip the zero weight computations during inference, the current implementation benefits only from increased sparsity [62] which, like clustering, results only in model compressibility. To find if there is the potential of adopting magnitude-based weight pruning, we measured the weight sparsity for the tracked TFLite models. We find that, overall, 3.15% of weights are near zero (within ±10 −9 ), which might show limited prospects for weight magnitudebased pruning. Quantisation: Finally, quantisation constitutes a prominent method for minimizing the computational and memory demands of DNNs by means of reducing their representation precision [34, 69] . To study its adoption, we analysed the layer types and their weight and input bitwidth representations. We report that 10.3% of the models make use of the dequantize layer, which indicates the deployment of lower-precision models as a way to perform model compression. Furthermore, by examining each model's weights, we found that 20.27% of the models use int8 for the weight tensors whereas 10.31% of the models work with int8 activations.

Recent hardware advances have led to NPUs that support multiple arithmetic precisions [7, 44, 52] . Such examples are the Hexagon 698 processor on Qualcomm Snapdragon 865 (SDM865) [52] and the Arm Ethos processor [7] , which support 16-bit for activations and 8-bit for weights (A16W8). These schemes enable a better compromise between faster low-precision compute and having enough representational power to achieve good accuracy. In spite of the new opportunities of these hardware architectures, not only do existing deployment methodologies fail to exploit them but we also found no evidence of their adoption. We revisit the issue of quantisation with hardware-specific optimisations in Sec. 6.3, where we use the Google's NNAPI and Qualcomm's SNPE to target specific processors in the SoC. Observations: While the research community has developed numerous ways to optimise DNNs for mobile execution, out-of-the-box support for such optimisations in modern frameworks' can be primitive and might not translate to run time gains at the expense of accuracy. Furthermore, most optimisations typically require model re-training and access to large-scale datasets. As such, we find that such optimisations are not widely adopted by the mobile AI developers. Quantisation, which can also be used to target different SoC accelerators, is the most widely-used optimisation. However, more advanced hybrid quantisation schemes remain unsupported.

Upon deploying a model, developers have different setup choices that can affect the model's performance. In this section, we discuss the impact of different tuneable model and system parameters on model performance. Impact of batch size. One common way of increasing a model's throughput is batching input samples together. By taking advantage of SIMD instructions of SoCs and accelerators, this technique increases the DNNs throughput by producing multiple inference results in one forward pass. In Fig. 11 , we show the batch throughput across devices when processing 2, 5, 10, and 25 samples at a time with 4 threads. We only consider TFlite models that successfully ran all batch sizes across all devices (149 in total). As expected, we see that the throughput increases as the batch size does. In fact, throughput scales almost linearly, which indicates that no bottleneck is hit up to that point. Moving the comparison across devices, we see that S21 offers significantly faster inference, with throughput being 2.14× and 5.42× higher compared to A70 and A20 respectively on the highest batch size. This result goes in line with our conclusions from Sec. 5.1. We anticipate that when scaling to higher batch sizes, devices with lower core count and memory will hit memory bandwidth bottlenecks or out of memory errors, but we defer this for future work. Impact of thread count. Another tuneable parameter during mobile execution is the number of threads allocated for execution on CPU. By default, all cores of the device can be simultaneously used during execution (ARM DynamIQ). However, in Heterogeneous Multi-core Processors (HMP) there usually exist multiple islands of cores, offering different dynamics and computational power. In Fig. 12 we show how the models' throughput varies when executed with different thread counts (2, 4, 8) and affinities (2, 4) . For the latter, we use process pinning to select which cores to target from the heterogeneous core sets. We observe that the optimal thread count can vary across devices, with A20, A70 and S21 performing better with 4, 2 and 4 threads, respectively. We also see that the 8threaded performance drops significantly across devices, indicating bottlenecked execution.

Digging deeper into thread performance, we further plot four additional setups where we set the CPU affinity to run over a varying number of the largest cores. For example, 4a2 means 4 threads with affinity 2, which means 4 threads will run over the top 2 cores of the mobile's SoC. As expected, we observe that any setup that sets the number of threads higher than the CPU affinity cores (4a2 and 8a4) results in significant performance degradation. This happens to due to time-sharing, having the other thread pinned on the same core waiting. Nonetheless we also witness some less expected findings, such as the fact that setting the affinity to the same number of top cores does not yield any significant gain, irrespective of our initial hypothesis that it would reduce process migration between cores. In fact, 4a4 performs worse to 4 threads for A70 and similar is the case for 2a2 and 2 threads for A20. Predicting the optimal number of threads for mobile inference can be challenging as mobile devices have different CPU architectures with varying core frequencies as well as DVFS-enabled schedulers implementing energy-preserving policies [36] . Moreover, most mobile devices, nowadays, incorporate HMP SoCs (i.e. ARM big.LITTLE, DynamIQ) with varying number of cores per island (e.g. Q888 has 1×X1, 3×A78, 4×A55 ARM Cortex cores, whereas Q675 has 2×A76 and 2×A55 cores). Therefore, scheduling across core islands can bring sub-optimal results to DNN execution. However, when selecting the optimal thread count and affinity for each device, we see up to 2× throughput gains overall. This suggests that tuning scheduling and thread count of DNN execution on heterogeneous devices and processors can yield significant improvements.

Observations: Results from model-level optimisation indicate that there are alternative parameters for boosting inference throughput, but they should be tweaked in tandem with system-level factors, including the SoC topology and memory hierarchy to make efficient use of the underlying hardware.

In the previous section, we have visited certain setup "hyperparameters", namely batch size and process affinity that depending on the use-case can enhance inference performance. In this section, we investigate framework-specific optimisations that can enhance performance, either by means of optimised operator kernel implementations or by moving computation to a different device altogether, i.e. targeting the GPU/NPU/DSP of the SoC. To this direction we run experiments measuring performance and energy of framework-specific optimisations on TFLite and caffe models across three alternative backends, namely NNAPI, XNNPACK and SNPE, on the Q845 board. We divert the reader to the Appendix for more information on these frameworks. Traces of hardware-specific acceleration. In our latest snapshot, we found some traces of hardware-specific acceleration. Specifically, we have found 71(23.8%) apps are using NNAPI, a single application using XNNPACK and three using SNPE. It is interesting to note that in the last case these models get blindly distributed to all devices, irrespective of having a Qualcomm-based SoC or not. In fact, they deploy both a TFLite and dlc variants of the same model. Overall, we see that many app models are missing on the efficiency promises of targeting specialized hardware or using target-optimized kernel operations. Optimisation opportunities. As a way to measure the potential benefit of using each of the aforementioned framework optimisations on different processing elements, we run two experiments, one on TFLite models for NNAPI and XNNPACK (Fig. 13) and another for TFLite and caffe models for SNPE (Fig. 14) . In each case, we compare the performance of framework-specific optimisations to the baseline CPU and GPU runs. The reason we do not compare across them is because the number of models commonly compatible is low. This highlights one succinct characteristic of such optimisations, the rudimentary support for operators across heterogeneous targets which in turn can hinder their widespread adoption. Results from our evaluation indicate that for CPU execution (Fig. 13) , one is better off using the XNNPACK delegate for executing DNN inference 1.03× faster and 1.13× more efficiently on average. NNAPI did not prove its potential in our experiments, with its performance lagging behind the default CPU execution (0.49× slower and 1.66× less efficient on average). This could be potentially attributed to unoptimised NN drivers from the vendor. On the other hand, when one is deploying with a vendor-specific platform, SNPE in our case, performance is better for DSP and GPU (Fig. 14) , compared to vanilla CPU and GPU runs. Specifically, these are 5.72× and 2.28× faster and 20.3× and 8.39× more efficient on average, compared to CPU runs. In comparison to GPU runs, these are 2.97× and 1.19× faster and 2.69× and 1.11× more efficient on average. In the case of CPU, however, the story is similar with our last experiment, further corroborating the story for non-optimised CPU drivers from the vendor.

Note that CPU and GPU runs are executed at full-precision (float32), while the DSP runs in int8. Depending on the task this can result in accuracy variations, but we do not have access to model-specific data and labels to assess that. Observations. Results from our experiments say a mixed story about hardware and frameworks specific optimisations. While it can yield noticeably better performance across models, this is not always the case due to driver implementation or other low-level confounding factors. The dilemma of target generality vs hardware-specific optimisations ultimately lies in the hand of the developer and the resources they have at their disposal to extract every bit of performance in hardware.

Another approach to accelerate inference and bring intelligence to mobile apps, without having the need to specialise per target device is by offloading to the cloud. We can envision this approach being popular amongst developers who do not implement or train their own models or for models that are too computationally intensive to run locally on a mobile device or too expensive to optimise for each available target to offer a similar QoE.

As mentioned in Sec. 3.2, gaugeNN tracks app invocations of known cloud-based machine learning APIs in their code. This includes calls to Google (Google Cloud and Firebase ML) and Amazon services. Fig. 15 shows the number of applications invoking each of the cloud-based ML APIs across our dataset. Overall, we find 524 distinct applications that use cloud AI APIs, a considerable increase of 2.33× from our 2020 dataset. More specifically, 452 and 72 apps using Google AI services and Amazon respectively. This increase is inline with the increase in models deployed within the apps (Sec. 4.6). Furthermore, we observe that developers primarily use cloud-based image and video analytics to perform face identification, bar/QR code recognition, video analytics and chatbots. Observations: Our results indicate that cloud APIs from Google and Amazon are gaining in popularity as they allow developers to quickly deploy AI capabilities without the need for specialised ML expertise and costly infrastructure for training. Moreover, developers do not need to maintain training data on-premise and the resulting apps can be supported by heterogeneous devices with similar QoE.

In the past, there have been numerous studies that performed large scale analysis of the Google Play Store but with different aims, such as characterising mobile apps [64] and their API usage [1, 48] . Closer to the ML community, there has been an increasing effort to benchmark state-of-the-art models across different devices and frameworks [3, 24, 25, 27, 33, 67] . Although these studies have done a great job at extensively benchmarking state-of-the-art models, we still lack the knowledge as to whether these models are representative of the ones deployed today in mobile apps. Moreover, there is a lack of understanding on how the latest trends on DNN optimisation affect the latest DNN-based mobile apps.

To the best of our knowledge, there are largely two works that have investigated DNN usage in the wild. One is from Xu et al. [70] and focuses on investigating who the early adopters of DNNs are and what are the use-cases for Deep Learning in mobile apps. While they do conduct a lightweight analysis of DNN operations, they have only measured model footprint and performance in an offline and device-agnostic manner, by means of measuring the FLOPs of DNN layers. However, it has been shown that FLOPs is not a good proxy of a model's run time [3, 33] , especially across different hardware configurations. Therefore, there is still limited understanding about the actual performance of DNN models in the wild, across a heterogeneous ecosystem of more and less capable devices. A more privacy-centric work has been presented in [57] , which investigates DNN model protection on mobile devices and illustrates succinctly that many Android apps do not protect their DNN models, which means these can be easily leaked, or extracted for analysis. Nevertheless, it does not perform any performance analysis.

These two works serve as a starting point for our study, which aims to answer the question of how widely deployed DNNs found in the most popular Android apps actually perform on widely deployed devices, essentially correlating the state of Deep Learning mobile deployment in the wild. To this end, we conduct an indepth benchmarking of models used in the latest most trending mobile apps. This includes analyses of latency, energy, system and model-level parameters and optimisations, providing a better comprehension of the current limitations when deploying DNNs on mobile phones of different tiers and generations.

Proliferation of mobile AI. Our results indicate that both ondevice and cloud-supported DNN applications are increasing rapidly (doubled within a year). This is mostly driven by the availability of pre-trained models and easy-to-use cloud-based APIs, focusing mostly on vision tasks such as image detection and recognition. Model reuse. While there is much research on bespoke model architectures, customisation and fine-tuning [37, 49] , we observe that most developers use off-the-shelf DNN architectures. In fact, close to 80.9% of the models are shared across two or more applications and a further 9.02% of the remaining models share some layers (i.e., derived from a common model after fine-tuning). Simultaneously, there is a parallel trend of resorting to cloud-powered inference, further demonstrating a preference of developers towards turnkey solutions, instead of bespoke customised ones. With the current trajectory of AI, we expect more developers specialising on ML-based app development at least until the middleware (e.g. NNAPI) which abstracts away ML-specific parameters becomes more prevalent. DNNs and mobile hardware resources. We witness that most applications do not take advantage of SoC-specific accelerators to accelerate their inference runtime, but rather target generality of their solutions, either by shipping vanilla CPU-only execution or by integrating framework-specific middleware options (e.g. NNAPI). Last, offloading inference to the cloud offers a consistent QoE, which is not dependent on the target device, at the expense of privacy [4, 38] and monetary cost. This behaviour comes as a consequence of the fragmentation in the Android ecosystem in terms of hardware capabilities and software support (e.g. vendor-specific NNAPI drivers). Consequently, we anticipate the need of automated solutions for optimised development and deployment of ML solutions in mobile apps, which abstract away the complexity of efficiency and heterogeneity of the ecosystem. Energy as a bottleneck. While Deep Learning adoption is undisputed, with accelerating trajectory in the future, manufacturers turn to specialised hardware for faster and more efficient ML (e.g. NPUs). However, the same cannot be stated for battery technology and capacity, which remain relatively stale. Given what we observed for the segmentation scenario in Sec. 5.2.2, we anticipate energy sooner or later becoming a bottleneck in DNN deployment, requiring novel solutions to support mobile intelligence on the go. DNN co-habitation. With more and more applications shipping DNN-powered solutions, we also anticipate the co-existence and parallel runtime of more than one DNN in the future. Thus, researchers will need to tackle this emerging problem to efficiently support such runtimes, by means of OS or hardware-level solutions. On-device learning and personalisation. Last, so far in the paper we have only visited the task of mobile inference. In this setup, the weights of the model come pretrained on some centralised dataset and the device only performs forward propagation. However, with users becoming more and more privacy aware and with legislation discouraging the storage of user data without legitimate interest, on-device training and federated learning [30, 46] become more and more prevalent [9, 50] . Moreover, with the proliferation of on-device data, on-device personalisation [42] is also gaining traction. These tasks will create a different workload to be optimised for on-device runtime, for which current or future tools will need to provide support.

In this work we have shed light to the use and performance of DNNs in real-world applications. However, we only focused on the Android smartphone landscape due to its larger market share and wide device fragmentation. These finding might only partially hold for other mobile ecosystems.

Furthermore, we have analysed the models that could be identified as DNN models. Obfuscated and encrypted models, or models that are downloaded outside of Google Play store were not benchmarked, despite us tracking the respective application as MLpowered. While there might be a different distribution of obfuscated models in the wild, the results from [57] indicate otherwise.

Our analysis included both offline introspection and dynamic benchmarking of the models. However, we did not investigate particular invocation paths and frequency of inference per app. We expect that some of these models are rarely used (e.g. credit card scanning) while others are utilised more frequently (e.g. activity detection). However, the real-world usage of these models requires device instrumentation and collecting telemetry data over a large user-base. While previous works [2, 48] have proposed large-scale crowd-testing of virtualised mobile apps with real user interaction, these generally preclude testing sensor input-dependent functionality, on which DNNs depend. We leave this as future work.

Last, while we characterise DNN cloud offloading, we acknowledge that we miss any developers who use their own custom (e.g., REST-based) APIs to access remote execution.

In this work, we have carried out a comprehensive empirical study of the most popular DNN-powered mobile apps. Using gaugeNN, we analyse thousands of mobile apps in the wild and identify a significant chasm between the deployed models and the state-ofthe-art architectures and optimisation techniques. This is the first work to dig deeper into these aspects so as to provide guidelines for both the mobile application and the DNN-framework developer communities.

In Sec. 3.1 of the paper, we stated that gaugeNN supports file extraction from i) the base apk, ii) expansion files (OBBs) and iii) Android App Bundles. The extracted files are matched against a compiled list of known DNN framework formats and validation rules to identify potential DNN models. The complete list of formats is shown in Table 5 . As per Sec. 6.3, we run our TFLite models against alterative backends, namely NNAPI, XNNPACK and SNPE. Below we provide additional information for each one: NNAPI 9 . Neural Networks API (NNAPI) is a middleware-level library in Android that sits between the machine learning framework library used by an application (e.g. TFLite) and the the Android Hardware Acceleration Layer (HAL). It essentially provides an abstraction layer, handling hardware acceleration through vendor and hardware specific NN drivers, which provide efficient operator implementations for CPU, GPU, DSP, NPUs or other kinds of specialised hardware. Execution falls back to CPU in the absence of such drivers or unsupported operators. TFLite is at the foreforent of NNAPI delegation and PyTorch Mobile has announced support for it. Nonetheless, NNAPI being in its infancy comes with some shortcomings, mainly in the realm of OS version support (Android P and above), NN drivers availability and heterogeneity in performance gains. XNNPACK 10 . XNNPack provides a low-level, highly optimised library for NN inference operators across platforms. Specifically for ARM, it supports efficient implementation of operators through Neon instructions, as well as inference on sparse networks, which 9 https://developer.android.com/ndk/guides/neuralnetworks 10 https://github.com/google/XNNPACK offers a practical solution to the problem described in Sec. 6.1. Despite the claimed performance benefits, operator support is limited and if not careful can lead to performance penalties instead of gains when compared to the baseline CPU delegates. SNPE 11 . The Snapdragon Neural Processing Engine (SNPE) constitutes a vendor-specific runtime for execution of DNNs on Qualcomm SoCs, targeting the CPU, Adreno GPU or Hexagon DSP of the SoC, handling quantisation in the proper precision internally. It uses its own representation for NNs (.dlc format) supports conversion from different frameworks, including caffe and TFLite. However, while SNPE can potentially take advantage of hardwarespecific optimisations, it can only target Qualcomm SoCs, trading off generality for performance. Operator support can also be of issue in SNPE, supporting CPU fallback in case of hardware-specific unsupported operations.

An Empirical Study of Android Alarm Usage for Application Scheduling

Chimp: Crowdsourcing human inputs for mobile phones

EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

DynO: Dynamic Onloading of Deep Neural Networks from Cloud to Device

Number of Android apps on Google Play

Blazeface: Sub-millisecond neural face detection on mobile gpus

Towards Federated Learning at Scale: System Design

Listen, attend and spell

FasterSeg: Searching for Faster Real-time Semantic Segmentation

Two Billion Users -Connecting the World Privately

Searching for Winograd-aware Quantized Networks

Mobile operating systems' market share worldwide from

Android Runtime and Dalvik

Google Cloud APIs

Google Cloud APIs

Google. 2020. Optimize for Doze and App Standby

Google. 2021. About Android App Bundles

Google. 2021. APK Expansion Files

Google. 2021. Tensorflow: Clustering

Tensorflow: pruning with keras

An Empirical Study towards Characterizing Deep Learning Development and Deployment across Different Frameworks and Platforms

Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. International Conference on Learning Representations (ICLR

Latency and Throughput Characterization of Convolutional Neural Networks for Mobile Computer Vision

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

Deep Residual Learning for Image Recognition

FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Densely connected convolutional networks

AI Benchmark: All About Deep Learning on Smartphones in 2019

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge

Enhancing energy efficiency of multimedia applications in heterogeneous mobile multi-core processors

Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions

SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud

HAPI: Hardware-Aware Progressive Inference

On-Device Neural Net Inference with Mobile GPUs

SNIP: Single-Shot Network Pruning based on Connection Sensitivity

It's Always Personal: Using Early Exits for Efficient On-Device CNN Personalisation

FSSD: feature fusion single shot multibox detector

DaVinci: A Scalable Architecture for Neural Network Computing

Fully convolutional networks for semantic segmentation

Communication-efficient learning of deep networks from decentralized data

NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition

A Family of Droids-Android Malware Detection via Behavioral Modeling: Static vs Dynamic Analysis

A survey on transfer learning

Chris Vandevelde, et al. 2021. Federated Evaluation and Tuning for On-Device Personalization: System Design & Applications

Scaling Up Online Speech Recognition Using ConvNets

Qualcomm. 2021. Snapdragon Neural Processing Engine

A Study of WhatsApp Usage Patterns and Prediction Models without Message Content

Very Deep Convolutional Networks for Large-Scale Image Recognition

Mobile operating systems' market share worldwide from

Mind Your Weight(s): A Large-scale Study on Insufficient Machine Learning Model Protection in Mobile Apps

Sequence to sequence learning with neural networks

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Faceter Team. 2020. Pay Cards Recognizer

Example on-device model personalization with TensorFlow Lite

Trim insignificant weights

A Measurement Study of Google Play

Neural Network Inference on Mobile SoCs

Whatsapp. 2021. Whatsapp daily messages

Machine Learning at Facebook: Understanding Inference at the Edge

Machine Learning at Facebook: Understanding Inference at the Edge

Quantized Convolutional Neural Networks for Mobile Devices

A first look at deep learning apps on smartphones

Yepkit YKUSH 3 USB 3.1 Switchable Hub

Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs

ICNet for Real-Time Semantic Segmentation on High-Resolution Images