key: cord-0678607-icc179vv authors: Hu, Han; Jiang, Xiaopeng; Mayyuri, Vijaya Datta; Chen, An; Shila, Devu M.; Larmuseau, Adriaan; Jin, Ruoming; Borcea, Cristian; Phan, NhatHai title: FLSys: Toward an Open Ecosystem for Federated Learning Mobile Apps date: 2021-11-17 journal: nan DOI: nan sha: b8fb8ec02b302bc43e66bbc81a2639dc8e5005c1 doc_id: 678607 cord_uid: icc179vv This paper presents the design, implementation, and evaluation of FLSys, a mobile-cloud federated learning (FL) system that supports deep learning models for mobile apps. FLSys is a key component toward creating an open ecosystem of FL models and apps that use these models. FLSys is designed to work with mobile sensing data collected on smart phones, balance model performance with resource consumption on the phones, tolerate phone communication failures, and achieve scalability in the cloud. In FLSys, different DL models with different FL aggregation methods in the cloud can be trained and accessed concurrently by different apps. Furthermore, FLSys provides a common API for third-party app developers to train FL models. FLSys is implemented in Android and AWS cloud. We co-designed FLSys with a human activity recognition (HAR) in the wild FL model. HAR sensing data was collected in two areas from the phones of 100+ college students during a five-month period. We implemented HAR-Wild, a CNN model tailored to mobile devices, with a data augmentation mechanism to mitigate the problem of non-Independent and Identically Distributed (non-IID) data that affects FL model training in the wild. A sentiment analysis (SA) model is used to demonstrate how FLSys effectively supports concurrent models, and it uses a dataset with 46,000+ tweets from 436 users. We conducted extensive experiments on Android phones and emulators showing that FLSys achieves good model utility and practical system performance. Federated Learning (FL) [3] has the potential to bring deep learning (DL) on mobile devices, while preserving user privacy during model training. FL balances model performance and user privacy through three design features. First, each device trains a local model on its raw data. Second, the gradients of the local models from multiple users are sent to a server for aggregation to compute a global model that is more accurate than individual local models. Third, the server shares the global model with all users. During this federated training, the raw data from individual users never leave their devices. A wide range of mobile apps, e.g., predicting or classifying health conditions based on mobile sensing data, can benefit from running DL models on smart phones using FL, which offers privacy-preserving global training that incentivizes user participation. § The first two authors contributed equally to this work. Despite progress on theoretical aspects and algorithm/model design for FL [32] , [34] , [40] , [9] , [36] , the lack of a publicly available FL system has precluded the widespread adoption of FL models on smart phones, despite the potential of such models to apply DL on mobile [sensing] data, in a privacy-preserving manner, for novel mobile apps. Furthermore, this has also limited our understanding of how realworld applications can benefit from FL. To the best of our knowledge, the only existing FL systems are either unavailable for the research and practice communities (e.g., Google [3] , FedVision [27] ), under development [13] , or do not support mobile devices [10] . Most of the existing FL studies are based on simulations [29] , [32] , [34] , [40] , [9] , [36] , which may lead to an oversimplified view of the applicability of FL models in real-world. In the meantime, although demonstrated in several scenarios such as keyboard typing prediction [39] , FL lacks real-world applications, which can drive the design of FL systems. Indeed, real-world benchmarks for FL are pivotal to help shape the developments of FL systems [23] . In this paper, we take a unique application-system co-design approach to design, build, and evaluate an FL system. Our system design is informed by a critical mobile app: human activity recognition (HAR) on the phones, which is important for industry, public health, and research. Simply speaking, mobile apps using HAR can harness recognized human physical activities using data collected from phone sensors. From an industry point of view, accurate HAR can help the smart phone manufacturers to be smart about allocating resources and extending battery life. The Covid-19 pandemic highlights the public health importance of understanding individual & population behaviors under government orders and (health) emergencies [17] ; furthermore, combining user activities with mental wellness surveys and prediction has the potential to develop personalized interventions to help individuals to better cope with anxiety, stress, and substance abuse, and other important societal issues [8] . Current research on HAR models uses centralized learning on data collected in controlled lab environments on standardized devices and controlled activities [16] , [30] , [15] , [22] , [2] , [6] , [7] . Instead, we use HAR in the wild (open environments, where the user mobility, activities, or application usage are not controlled in any way). The privacy-sensitive nature of the mobile (sensor) data make this application ideal for studying the design of FL systems. Furthermore, this paper presents the first HAR study under FL. In addition to HAR, we analyzed other real-life applications [19] , [3] , [37] , [39] , [27] to inform our system design. A list of important questions emerges, and many of these questions are not addressed in existing FL system designs [3] , [39] , [13] , [37] that largely ignored the constraints of mobile devices: How can we balance FL model performance with resource constraints on the phones? How to ensure the training conducted on phones is completed on time, despite limited resources, i.e., computation power and battery life? How can the server achieve seamless scalability in the presence of large and variable numbers of users who typically train different models and how can the system simultaneously cope with potential communication failures (e.g., connectivity lost on the phone)? How does the server aggregate individual training outcomes efficiently for an accurate model? After a global model is shared with the phones, how can a third-party DL app utilize this model? Currently, there is no FL system in the literature that can address most of these questions. In terms of design, the closest related work is [3] , which focuses on system scalability, secure aggregation, and fault-tolerance. However, it does not present a system implementation and evaluation, and how to use thirdparty models and make them available to third-party apps on mobile devices. Mobile operating system (OS) providers use FL in their OSs for applications such as next word prediction on the keyboard [39] , but their solutions are applicationspecific and lack system details. Verma et al. [35] and Liu et al. [27] introduce web-services based FL architectures, which are not tailored to mobile devices. An under construction FL system [13] does not focus on application-system co-design aspects such as efficient data collection or scalability. Yet another open source system, FATE [10] , currently has only elementary support for deep learning and does not have any support for mobile devices. Key Contributions. This paper presents FLSys, the first FL system in the literature created using an application-system co-design approach to address the aforementioned research questions. FLSys is a key component toward creating an open ecosystem of FL models and apps that use these model. Such an FL ecosystem will allow third-party model/app developers to easily develop and deploy FL models/apps on smart phones. Consequently, the users will benefit from novel FL apps based on mobile [sensing] data collected on the phones. To tackle fault-tolerance and resource constraints on the phones, FLSys utilizes an asynchronous interaction model between mobile devices and the cloud, which 1) allows the devices to self-select for training when they have enough data and resources, and 2) allows the sever to operate correctly in the presence of communication failures with the phones. For energy efficient data collection, FLSys supports on-demand configuration of sensor types, sampling rates, and the period for data flushing from memory to storage. The FL server in the cloud achieves good scalability through a design based on function as a service computation and scalable storage. FLSys is flexible, in the sense that it can train multiple models concurrently. It provides a common API for third-party apps to train different DL models with different FL aggregation methods in the cloud. While implemented in Android and AWS, FLSys has a general system design and API that can be extended to other mobile OSs and cloud platforms. To study how HAR can be supported by FLSys in the wild, we collected data from 100+ college students in two areas during April -August 2020. The students used their own Android phones, and their daily-life activities were not constrained in any way by our experiment. Data collected on mobile devices is non-IID, which affects FL-trained models [20] . We have evaluated a variety of HAR models in both centralized and federated training, and designed HAR-Wild, a Convolution Neural Network (CNN) model with a data augmentation mechanism to mitigate the non-IID problem. To showcase the ability of FLSys to work with different FL models, we also built and evaluated a natural language sentiment analysis (SA) model on a dataset with 46,000+ tweets from 436 users. We carried out a comprehensive evaluation of FLSys together with HAR-Wild and SA to quantify the model utility and the system feasibility in real life conditions. We performed the evaluation under three training settings: 1) centralized training, 2) simulated FL, and 3) Android FL. Centralized training provides an upper bound on model accuracy and is used to compare our HAR-Wild model with baseline approaches. The results demonstrate that HAR-Wild outperforms the baseline models in terms of accuracy. Furthermore, the federated HAR-Wild performance using simulations (Tensor-Flow and DL4J 1 ), Android emulations, and Android phone experiments is close to the upper bound performance achieved by the centralized model. The results on smart phones demonstrate that FLSys can perform communication and training tasks within the allocated time and resource limits, while the FL server is able to handle a variable number of users. Finally, micro-benchmarks on Android phones show FLSys with HAR-Wild and SA are practical in terms of training and inference time, memory and battery consumption. The rest of the paper is organized as follows. Section II discusses related work. Section III explains the design of FLSys, while Section IV describes its prototype implementation. Section V presents the HAR model and data. Section VI shows the experimental results. The paper concludes in Section VII with lessons learned and future work. This section reviews related work for FL systems, non-IID issue in federated training, and HAR models. FL can be categorized into Horizontal FL, Vertical FL, and Federated Transfer Learning (FTL) [38] . In Horizontal FL, data are partitioned by device user Ids, such that users share the same feature space [38] . In Vertical FL, different organizations have a large overlapping user space with different feature spaces. These organizations aim at jointly training a model to predict the same model outcomes, without sharing their data. In FTL, the datasets of these organizations differ in both the user space and the feature space. In Vertical FL and FTL, different organizations need to align their common users and exchange intermediate results by applying encryption techniques [11] . The server cannot just average the gradients, but it needs to minimize a joint loss. At inference stage, the organizations may have to send their individual intermediate results to the server to compute a final result. The systems of these two categories rely on cryptography and their interactions are more complex. Our FLSys focuses on Horizontal FL, with an option for extension to Vertical FL and FTL in the future. For simplicity, we will use FL to indicate Horizontal FL in the rest of our paper. The work in [3] describes the design of a scalable FL system for mobile devices. This system shares several of our design goals (e.g., scalability, fault-tolerance). However, this work does not present a system implementation and evaluation, as we do for FLSys. Also, FLSys addresses additional unanswered design questions such as how to train concurrently multiple models for different applications, and how third party app developers to use the system. Furthermore, unlike this work, FLSys focuses on data collected from the phone's sensors, which adds challenges related to efficient and effective data collection. Another system that shares some goals with FLSys is FedML [13] . However, this open source system is still under construction. In addition, FedML focuses more on software engineering aspects, rather than on system aspects such as efficient sensor data collection or scalability as in FLSys. A third open source system FATE [10] is still in its infancy with limited support of deep learning and does not work on mobile devices. While a significant amount of FL research focuses on improving the security and privacy of federated training procedures [31] , [4] , this is outside the scope of our research. We focus on system design, implementation, and evaluation using HAR and SA models. A well-reported issue restricting the performance of models trained by FL is non-IID data distribution across users [21] , [40] . Different from centralized learning, the datasets among different users may follow different distributions in FL, because of the heterogeneous devices, imbalanced class distribution, different user behaviors, etc. As a result, DL models trained in FL algorithms usually suffer from inferior performance when compared with centralized models [21] . To mitigate the non-IID issue, several algorithms have been proposed [32] , [34] , [40] , [9] , [36] . In FedProx [32] , a regularization is introduced to mitigate the gradient distortion from each device. Sarkar et al. [34] presented a cross-entropy loss to downweigh easy-to-classify examples and focus training on hard-to-classify examples. Verma et al. [36] proposed to estimate the global objective function by averaging different objective functions given a common region of features among users, and keep different objective functions estimated from local users' data in different regions of the feature space. Data augmentation approaches have been proposed [40] , including a global data distribution based data augmentation [9] . The federated training of our HAR-Wild and SA models use a uniform data augmentation method, similar to these techniques. Our HAR model focuses on sensing and classification of physical activities through smart phone sensors. Recent works show that deep learning models are effective in HAR tasks. For example, Ignatov [16] proposed a CNN based model to classify activities with raw 3-axis accelerometer data and statistical features computed from the data. Several works [30] , [15] , [7] proposed LSTM-based models and achieved similar performances. Most research on HAR models uses centralized learning on data collected in controlled lab environments with standardized devices and controlled activities, in which the participants only focus on collecting sensor data with a usually high and fixed sampling rate frequency, i.e., 50Hz or higher. Although there are good publicly available HAR datasets, e.g., WISDM [22] , UCI HAR [2] , and Opportunity [6] , they are not representative for real-life situations. Different from existing works, this paper shows that HAR-Wild over FLSys performs well on the data collected in the wild, which are subject to fluctuating sample rates and non-IID data distribution. This section presents the design of FLSys. Specifically, it describes the system requirements derived from an applicationsystem co-design, the FLSys architecture that addresses these requirements, and the three operation phases of FLSys, namely data collection and processing, federated training, and inference at the phones. Our aim is to design and build an FL system that addresses the list of important questions mentioned in Section I. We use the HAR model, detailed in Section V, to illustrate an entire category of FL models based on mobile [sensing] data collected in the wild. We extract five key requirements derived from this model and from other real-world FL applications, such as next word prediction, on-device search query suggestion [39] , on-device robotic navigation [26] , ondevice item ranking [3] , object recognition [27] , sentiment analysis, etc., and utilize them to guide our FLSys design: (R1) Effective data collection: The data collection on the phone must balance resource consumption (e.g., battery) with sampling rates required by different models; (R2) Tolerate phone unavailability during training: Since the phones may sometimes be disconnected from the network or choose not to communicate to save battery power, the interaction between the phones and the cloud must tolerate such unavailability during federated training; (R3) Scalability: The cloud-based FL server of our system must be able to scale to large numbers of users in terms of both computation and storage; (R4) Model flexibility: The system must support different DL models for different application scenarios and different aggregation functions in the cloud; and (R5) Support for third-party apps: The system must provide programming support for third party apps to concurrently access different models on the phones. FLSys addresses requirements R1 − R5 synergistically. Figure 1a show its overall process of one training round, and Figure 1b shows its system architecture. These figures emphasize four novel contributions made in FLSys, compared with existing FL systems [3] , [39] , [35] , [13] , [37] : (1) FLSys allows the phones to self-select for training when they have enough data and resources; (2) FLSys has an asynchronous design (Figure 1a) , in which the server in the cloud tolerates client failures/disconnections and allows clients to join training at any time. (3) FLSys supports multiple DL models that can be used concurrently by multiple apps; each phone trains and uses only the models for which it has subscribed; and (4) FLSys acts as a "central hub" on the phone to manage the training, updating, and access control of FL models used by different apps. These features balance model utility with mobile device constraints, and can help create an ecosystem of FL models and associated apps. FLSys allows different developers to build FL models/apps and provides a simple way for users to take advantage of these apps, as it offers a unifying system for the development and deployment of FL models and apps that use these models. FLSys acts as common middleware layer for all these apps and models. The users just need to download/install the apps, and FLSys will take care of downloading/installing the FL models used by the apps, will perform FL training as needed, and will run FL inference on behalf of the apps. The architecture ( Figure 1b ) has two main components: (1) FL Phone Manager, which coordinates the FL activities on the phone; and (2) FL Cloud Manager, which coordinates the FL activities in the cloud. These two components work together to support the three phases of the FL operation: data collection and preprocessing, model training and aggregation, and mobile apps using inference. In the following, we describe each phase and explain how the system architecture satisfies the five system requirements. Data Collection and Preprocessing. The FL Phone Manager controls the data collection using one or multiple Data Collectors. A basic Data Collector is tasked with collecting data from one sensor at a given sampling rate. Such basic Data Collectors could be embedded in more complex ones to collect different types of data at the same time. It is important to have one app that coordinates data collection because having multiple apps collecting overlapping sets of data multiple times is inefficient. Having the FL Manager to coordinate the data collection also simplifies sensor access control. To satisfy requirement R1, FLSys supports on-demand configuration of sensor types, sampling rates, and the period for data flushing from memory to storage. Each model informs the FL Phone Manager of the type of data and sampling rate it needs. In this way, the FL Phone Manager knows which Data Collectors to invoke and which sampling rates are needed. The FL Phone Manager balances sensing accuracy (i.e., high sampling rate) with resource consumption. To regulate and keep such balance aligned with the user experience, the FLSys has three features: (1) include several built-in sampling rate settings, with empirical values from our experiences; and (2) collect key statistics of the data collection (e.g., CPU time consumed, battery life impact, etc.) and show them to the user, upon request; and (3) provide global level controls for the user to adjust the data collection behaviors, should the user feel that their experience is impacted by data collection. The Data Collectors store the sensed data in the Raw Data Storage and inform the FL Phone Manager each time new data is added to the Raw Data Storage. For efficiency, the Data Collectors can buffer a certain amount of sensed data in memory before committing it to the storage. The FL Phone Manager can dynamically reconfigure the data flushing period that defines when the data is written to storage. Data Collectors set this data flushing period. Some models may use the raw data directly, while others may require additional processing. The FL Phone Manager decides when to invoke the modelspecific Data Processors, which will store the data in the Processed Data Storage. This is a matter of policy and can be done any time new data is available in the Raw Storage Data or at a regular interval. The only constraint is to have all the data preprocessed before a new local model training operation. Federated Training. To satisfy requirement R2, we make two design decisions. First, FLSys allows the phones to selfselect for training when they have enough data and resources. This is different from traditional FL architectures [3] , where the server selects the phones to participate in training, which may not be available or may not have enough data or resources for training. Second, in FLSys, the communication between the phones and the cloud is asynchronous to cope with phone disconnections. The software at the cloud side is designed to tolerate missing messages from the phones. Overall, FLSys reduces communication overhead and increases client utility, at the expense of less control in the client sampling process, compared to [3] . In order to use a given model on the phone, the FL Phone Manager first registers the phone with the FL Cloud Manager. If the phone model and mobile OS are known to work with the model, the FL Cloud Manager registers the phone with the New Model Notification Service, which works as a Publish-Subscribe cloud service, and returns the subscription to the phone. This subscription allows the phone to receive asynchronous notifications when a new global model is available for download. The FL Phone Manager downloads the model at a time determined based on the model usage which grants registration based on training settings. 2 Phone Manager of Client #1 fetches up-to-date global model from a designated storage, trains it with local data, and uploads local gradients to a designated storage. 3 Phone Manager of Client #2 tries to register, but is denied. 4 Phone Manager of Client #2 successfully registers at a later time, but the training misses the deadline, thus its gradients upload is denied. 5 Clients #1 and #2 try to register during server aggregation and are denied. 6 Each model's Aggregator loads the gradients updates, aggregates them, and saves the aggregated model. frequency and power settings. The training for each model is done in rounds. The FL Cloud Manager decides the duration of a round, based on preferences associated with each model. For example, the server may start a new aggregation (i.e., by invoking the Model Aggregator for a certain model) when a given time interval has passed or when a certain number of local training updates have been received from the phones. The FL Phone Manager decides when to participate in training. This decision is done based on local policies that attempt to balance inference accuracy, the amount of input data for training, and the resources consumed during training. The intention to participate in training for a given model is conveyed by a message sent to the FL Cloud Manager. Based on the model preferences (e.g., amount of data, and the number of users in a training round), the server may decide to ask the phone to train for the model and to provide the FL Phone Manager with a URL to upload the results in the Cloud Local Gradients Storage. If there is a deadline for participation in the round, the FL Cloud Manager lets the FL Phone Manager know about it. The FL Phone Manager invokes the Model Trainer for the given model and passes as parameter the location of the data in the Processed Data Storage. After the training is done, the Model Trainer stores the newly computed gradients in the Phone Local Gradients Storage. The FL Phone Manager decides when to upload these gradients to the Cloud Local Gradients Storage. The FL Cloud Manager will invoke the Model Aggregator for the model when the duration for the round expires or when enough updates have been uploaded. The Model Aggregator reads the updates from the Cloud Local Gradients Storage, computes the aggregated weights, and stores them in the Cloud Global Model Weights Storage. The intermediate training state is stored in the Training State Storage to provide lower I/O latency compared with the other types of cloud storage in our design. This is because FLSys needs frequent access to these data during training. Then, the Model Aggregator sends a notification via the New Model Notification Service to let the phones know that a new model version is available. The cloud-side system satisfies requirement R3, as it can scale to large numbers of users due to its modular design that decouples computation, communication, storage, and notification services. The cloud elasticity features of each service allow different services to scale up or down according to the workload. As we observe from the architecture, each model is managed individually by FLSys, and multiple models can co-exist both at the phones and the cloud. In the cloud, different models use independent cloud resources, which can be scaled independently. On the phone, independent model trainers and inference runners are responsible for different applications. The cloud contains all the models in the system, while each phone contains only the models for which it has subscribed. This modular design allows our system to satisfy requirement R4. Mobile Apps Using Inference. We decouple mobile apps that need inference on the phones from the models that provide the inference. This allows an app to use multiple models, while the same model can be used by multiple apps. FLSys provides an API and a library that can be used by thirdparty app developers to perform inference using DL models on the phone. In this way, the system architecture satisfies requirement R5. When an app needs an inference from a model, it sends a request to the FL Phone Manager using one of the OS IPC mechanisms. The FL Phone Manager then generates the input for the inference from the data stored in the Processed Data Storage of Raw Data Storage, and then invokes the Model Runner with this input. The Model Runner sends the result to the App using the IPC. Model Concurrency. Given the design of FLSys, both the FL Phone Manager and the FL Cloud Manager are able to handle multiple models concurrently. However, the meanings of concurrency are slightly different for each side. FL Cloud Manager needs to handle the aggregation of all models that are registered with it. Also there is the need to communicate to a potentially large number of clients for each model at the same time. FLSys handles this concurrency through services provided by the underlying cloud platform, which support concurrency by design. FLSys just needs to orchestrate the invocation of these services. The FL Phone Manager needs to handle concurrent training and inference. Our preliminary experiments on smart phones show parallel training of multiple models is very slow due to resource contention. It also affects the user experience on the phones. Therefore, we decided to train models sequentially. The FL Phone Manager can request to participate in training rounds for multiple models concurrently, but it locally decides a sequential order in which to train these models, based on parameters such as frequency of model usage by apps, the training round deadlines, and historical training latency for each model. Finally, the inference requests from the apps are executed as soon as they are received to maintain good user experience. We implemented an end-to-end FLSys prototype in Android and AWS cloud, which have been chosen because they are the market leaders for mobile OSs and cloud platforms, respectively. However, the FLSys design is general and it can be implemented in other mobile OSs and cloud platforms. The prototype implements all of the components described in the system architecture ( Figure 1b ). This section reviews the implementation technologies, the reasons for selecting them, and then focuses on the Android implementation and the AWS implementation of FLSys. Deep Learning Framework. We choose Deep Learning for Java (DL4J) as the underlying framework for the on-device DL-related operations (i.e., training and model execution) because it is the only mature framework that supports model training on Android devices. While the Model Aggregator in the cloud could be implemented using other DL technologies, for consistency, we implement it in DL4J as well. The models are stored as zipped JSON and bin files stored in folders on the phone and in AWS S3 buckets in the cloud. On-device Communication. For IPC among Android apps/services, we use Android Bound Service and Android Intent. A bound service can efficiently serve another application component because it does not run in the background indefinitely. Through IPC, the FL Phone Manager can provide third-party apps with an interface to request inference results without revealing the model or the data. Furthermore, it can communicate with the Data Collector. Cloud Platform and Services. We opt to utilize the Function-as-a-service (FaaS) architecture for our cloud computation. The core cloud components of FLSys are implemented and deployed as AWS Lambda functions [1]. We decided to choose FaaS for our implementation for five reasons. First, it matches our asynchronous, event-based design, as Lambda functions are triggered by events. Second, it provides finegrained scalability at the function level; therefore leading to less resource consumption in the cloud. Furthermore, computation and storage are scaled automatically and independently by the cloud platform. Third, unlike other cloud platforms, it does not require running virtual machines when no computation is necessary; this saves additional resources and reduces cost. Fourth, FaaS simplifies the development and deployment of our prototype because it does not require software installation, system configuration, etc. Fifth, different functions can be implemented in different programming languages making the implementation even more flexible. Lambda functions are triggered in different ways in our prototype. We use the AWS API Gateway to define and deploy HTTP and REST APIs. For instance, we create a REST API to relay clients' requests to participate in the FL training to the Lambda function that handles these requests. We also use the AWS EventBridge to define rules to trigger and filter events for Lambda functions. FLSys uses a number of cloud services for storage, authentication, and publish-subscribe communication. For model storage, validation datasets, and FL Cloud Manager configuration files, we use AWS S3, which offers a reliable and cost-effective solution for data accessed infrequently. More importantly, AWS S3 buckets can be accessed directly by phones, which simplifies the asynchronous communication in FLSys. To authenticate clients and allow them to upload and download models from the AWS S3, FLSys uses Identity Pool in AWS Cognito. To store data that is accessed frequently, such as training round states and model states, we use AWS DynamoDB, a reliable NoSQL database. AWS SNS is utilized in conjunction with the Google FCM to notify clients when newly trained models are ready. The use of a Google Cloud service in our AWS implementation was necessary in order to push notifications directly to apps on the phones when a new global model is ready in the cloud. The phone implementation (left-side of Figure 1b) consists of three apps: a FL Phone Manager, a HAR Data Collector, and a Testing App used to test model inference. Data Collector. We implemented a HAR Data Collector app designed for long-term and battery efficient data collection. To that end, sensor values are not collected at an enforced fixed high frequency, but are instead collected independently through Android listeners whose actual frequency is variable, determined by the underlying OS. This is appropriate for data collection in the wild. In our experience, this tends to be much friendlier to the performance and battery life of the user devices, lowering the risk that a user abandons FLSys prematurely due to concerns about how it is affecting their device resources. Furthermore, users are given the option to pause or stop data collection of all or a subset of sensors in case they have resource consumption or privacy concerns. For simplicity, the raw data and the processed data are stored as files. FL Phone Manager. The FL Phone Manager app decides to initiate an on-device training round based on evaluating a Ready To Config policy (RTCp). We implemented a simple policy to check if the phone is charging and is connected to the network before declaring its availability for training. If yes, it sends a Ready To Config message (RTCm) to the FL Cloud Manager. RTCm is implemented as an HTTP request with JSON payload and is sent to a REST API URL in AWS. The FL Cloud Manger either accepts or denies the phone's participation in this training round, based on a simple Accept/Deny for Training policy (A/DFTp) that checks the phone model and client identity. The phone is accepted for a round of training when it receives an Accept For Training message (AFTm). AFTm contains the information of the AWS S3 locations from where to download the latest global model weights and where to upload the local gradients. The message also contains the deadline for this training round's completion. The FL Phone Manager evaluates a Start To Train policy (STTp) based on the available device resources and the round's deadline to determine whether to actually perform the on-device training for this round or not. The FL Phone Manager will create the corresponding Model Trainer if it decides to train. The Model Trainer is implemented with Android native AsyncTask class to ensure the trainer is not terminated by Android, even when the app is idle. AsyncTask also enables multiple trainers to train in the background. Once the training is complete, the Model Trainer uploads the local gradients to the corresponding AWS S3 location. Model inference is implemented as a background service with Android Interface Definition Language (AIDL), and it gets inference requests from third-party apps. When such a request is received, the FL Phone Manager uses the current sensor data from the Data Collector as input for the model, runs the inference, and responds to the third-party apps with the inference results. Testing App. We implemented a simple testing App to test model inference. The App uses AidlConnection to interface with the FL Phone Manager. Let us note that the App itself does not access any data or model. The cloud implementation (right-side of Figure 1b) consists of two main components: FL Cloud Manager and Model Aggregator. FL Cloud Manager. The FL Cloud Manager is implemented as a series of Lambda Functions (FaaS service in AWS). When starting a training round, it reads a configuration file and determines the deadline for the round (i.e., the time when the round must finish). During the period between the start time and the deadline, the FL Cloud Manager accepts or denies clients' requests for training (RTCm). When the deadline is reached, the FL Cloud Manager executes the Model Aggregator according to the Start for Aggregation policy (SFAp). The current policy checks if enough clients have submitted their local gradients in the AWS S3 (a configurable parameter). Then, the Lambda function implementing the FL CLoud Manager schedules an event for itself to perform the next training round and terminate. The training process stops when the pre-defined number of rounds is achieved, or the desired performance (model accuracy) is achieved, if the model developers provided a validation dataset. Model Aggregator. For implementation simplicity, the Model Aggregator uses the federated average technique [28] , with the assumption that each client contributes equally to the global model in each training round. When it is invoked, it loads the uploaded local gradients, and aggregates their gradients to the global model of this round. Once the global model is updated, the Model Aggregator invokes AWS SNS to notify clients that they can download the newly aggregated model. Note that the Model Aggregator is called dynamically through reflection, such that different aggregation functions can be dynamically swapped. Algorithm 1 shows the pseudo-code of our asynchronous federated averaging process. The algorithm consist of three procedures, which execute asynchronously. "ClientLoop" (lines 1-12) runs at clients and executes a round of training (lines 7-12), if the phone self-selects for training and the cloud accepts it (lines 1-6). "ServerRTCmHandler" (lines 13-17) is a part of the FL Cloud Manager and decides whether a phone is accepted for training. "ServerLoop" (lines 18-40) also runs at the FL Cloud Manager. It performs the aggregation of local gradients and controls the progression of training. The clients participating in a training round must submit their local gradients before the deadline for the round expires. When the deadline comes, the procedure first evaluates the Start for Aggregation policy, which checks whether there are enough local gradient updates in order to preform aggregation. If yes, the aggregation is preformed (line 24-26); if not, this round is aborted, but the uploaded gradient updates will be carried to the next round. After aggregation, the procedure may check against pre-defined conditions to decide whether this aggregation outcome should be accepted or not (lines [27] [28] [29] [30] . Finally, the procedure checks if a new round should be started by evaluating the Start New Round policy. If a new round is to be started, a new deadline will be set (lines [33] [34] [35] [36] . Otherwise, the procedure terminates. By design, FLSys acts as a service provider that handles multiple FL models with minimum input from the users. The setup procedures for FLSys are divided into two stages. The first stage involves the FL Cloud Manager and the app developers without user involvement. The second stage involves the FL Phone Manager and the mobile apps that use FL models, and it requires minimum user involvement. The FL Cloud Manager is deployed before the first stage, and the FL Phone Manager should be installed on the user's device if readyT oConf ig then 5: response ← SENDRTCM( ) 6: if response == "AFT" then 7: B ← SAMPLING(D L ) 8: θ l ← θ t 9: for batch b ∈ B do 10: θ l ← θ l − η∇L(θ l ; b) 11: ∆ l ← θ l − θ t 12: UPLOADCLIENTGRADIENTS(∆ l ) 13: procedure SERVERRTCMHANDLER(RT Cm) 14: if EVALUATEACCEPTFORTRAININGPOLICY(RT Cm) then 15: RETURNRESPONSE("AFT") 16 : deadlineT riggered ← f alse 20: SETUPDEADLINE( ) (deadlineT riggered ← true when triggered) 21: while true do 22: if deadlineT riggered then 23: if EVALUATESTARTFORAGGREGATIONPOLICY( ) then 24: WAIT( ) before the second stage. To illustrate these stages, let us briefly explain the setup workflow using the HAR app as an example. In the first stage, the developers of the HAR app need to register the model with the FL Cloud Manager. If the model is developed by the app developers, then the developers need to provide the FL model to be trained and the training plan (e.g., training frequency, number of rounds, number of participants in a round, etc.) to register the app. If the developers plan to use an existing FL model, then they need to specify which model to use to register the app. After registration, an unique key for the authentication between the app and the FL Phone Manager in the second stage will be provided. The second stage is typically triggered during the installation process of the HAR app on the user's device. The app will communicate with the FL Phone Manager and authenticate itself using the aforementioned unique key. Once the app is successfully authenticated, the FL Phone Manager will perform a series of operations and eventually become ready to serve the FL model for the app. We co-designed FLSys with a HAR model, which was used to extract the main requirements for FLSys, and then to demonstrate the efficiency and effectiveness of FLSys. To show that FLSys works with different concurrent models, we also implemented and evaluated a sentiment analysis (SA) model, as described in Section VI. In this section, we describe the HAR dataset, our HAR-Wild model, and its training algorithm using data augmentation to deal with non-IID data in the wild. Although there are good HAR datasets publicly available, e.g., WISDM [22] , UCI HAR [2] , they are not representative for real-life situations. These datasets were collected in rigorously controlled environments on standardized devices and controlled activities, in which the participants only focused on collecting sensor data with a usually high and fixed sampling rate frequency, i.e., 50Hz or higher. Thus, given our goal to test FLSys with data collected in the wild, we have used our Data Collector, described in Section IV-B, to collect data from 116 users at two universities. The data collection was approved by the IRBs at both universities. The students collected data from April 1st to August 10th, 2020. Each user provided mobile accelerometer data and labels of their activities on their personal Android phones. We provided labels in five categories for participants to choose form: "Walking," "Sitting," "In Car," "Cycling," and "Running". The phones were naturally heterogeneous, and the daily-life activities were not constrained by our experiments. Therefore, we collected a novel HAR dataset in the wild that is different from the existing datasets in the following three aspects: (1) The sensors' sampling rates vary from time to time and from user to user, due to battery constrains, device variability, and usability targets; (2) The same basic activity will generate different signals since different users will have different habits of carrying smart phones; (3) Label distributions are not just biased, but vary significantly among users. Our data processing consists of the following steps: (1) Any duplicated data points (e.g., data points that have the same timestamp) are merged by taking the average of their sensor values; (2) Using 300 milliseconds as the threshold, continuous (3) Data sessions that have unstable or unsuitable sampling rates are filtered out. We only keep the data sessions that have a stable sampling rate of 5Hz, 10Hz, 20Hz, or 50Hz; (4) Data sessions are also filtered with the following two criteria to ensure good quality: (a) The first 10 seconds and the last 10 seconds of each data session are trimmed, due to the fact that users were likely operating the phone during these time periods; (b) Any data session longer than 30 minutes is trimmed down to 30 minutes, in order to mitigate the potential inaccurate labels due to users' negligence (forgot to turn off labeling); and (5) We sample data segments at the size of 100 data points with sliding windows. Different overlapping percentages were used for different classes and different sampling rates. The majority classes have 25% overlapping to reduce the number of data segments, while the minority classes have up to 90% overlapping to increase the available data segments. The same principle is applied to sessions with different sampling rates. We sample 15% of data for testing, while the rest are used for training. Details are shown in Table I . Data Normalization. In our models, the accelerometer data is normalized as x ∈ [−1, 1] d to achieve better model utility. We compute the mean and variance of each axis (i.e., X, Y , and Z) using only training data to avoid information leakage from the training phase to the testing phase. Then, both training and testing data are normalized with z-score, based on the mean and variance computed from training data. Based on this results, we choose to clip the values in between [min, max] = [−2, 2] for each axis, which covers at least 90% of possible data values. Finally, all values are linearly scaled to [−1, 1] to finish the normalization process: (1) The design of our HAR-Wild model has two requirements: low computational complexity and small memory footprint. Satisfying these requirements ensures the model can work efficiently on phones. Figure 2 shows our model architecture. For a low computation complexity, HAR-Wild is based on CNN (instead of RNN, e.g., LSTM) and tailored to work well on mobile devices. In addition, instead of using data from The accelerometer data are processed into data segments of shape [3, 100] , indicating 100 data points of 3 axis: X, Y, and Z. We leverage the recipe of ResNet model [14] into a smallsize model, by using the processed accelerometer data as input of (1) a sequence of a 1D-CNN -a Batch Norm -a 1D-CNN -a Batch Norm -a Flatten layer, and (2) a sequence of a 1D-CNN -a Batch Norm -a Flatten layer. The two flatten layers are concatenated before feeding them into a sequence of a Drop Out layer -a Dense layer -and an Output layer. By doing so, HAR-Wild can memorize and transfer the low level latent features learned from the very first 1D-CNN, directly derived from the input data, to the output layer for better classification. We use Global Average Pooling [25] given its small memory footprint, instead of the popular Local Max/Average Pooling [12] . A small-size model is expected to perform better on data collected in the wild, since the data will likely have more distribution drift, increasing the chance of model overfitting on large-size models. The performance of FL models is negatively affected by non-IID data distribution [21] , [40] , [20] , and we observed this to be true for HAR-Wild as well. Figure 3 shows the data distribution of HAR-Wild. To address this problem, we leverage data augmentation training [18] and tailor it to mitigate the distortion in computing gradients at client-side by balancing the client data with a small number of augmentation data samples without an undue computational cost. The pseudo-code for HAR-Wild Asynchronous Augmented Learning is described in Algorithm 2. This algorithm is integrated in Algorithm 1 by replacing lines 7-12 from Algorithm 1 with the AUGMENTEDGRADIENTS procedure in Algorithm 2. Before the whole training process starts, the FL Cloud Manager executes the procedure INIT (lines 1-3, Algorithm 2), which first collects a small pool of random samples for each class that will be used for data augmentation (line 2). These data can be collected from a small number of volunteers or controlled users who share IID data with the FL Cloud Manager in FLSys. The augmentation data pool could also come from publicly available datasets. Then, the augmentation data pool A is delivered to each client (line Algorithm 2 HAR-Wild Asynchronous Augmented Learning 1: procedure INIT(clients) 2: augmentation pool A ← SAMPLEAUGMENTDATA(clients) 3: DELIVERAUGMENTPOOL(A, clients) 4: procedure AUGMENTEDGRADIENTS(Round t, Client i) 5: Augmentation data pool A 6: Local data pool L i 7: θ l ← θ t 8: augmentation data D A = SAMPLEAUGMENTDATA(A) 9: local data D L = SAMPLEDATA(L i ) 10: training data D T = CONCATENATE(D A , D L ) 11: for batch b ∈ D T do 12: 3). In each training round, each client (i.e., phone) randomly samples the augmentation data (line 8). Then, the sampled augmentation data D A will be combined with the local data D L (line 10, CONCATENATE(D A , D L )) to compute the local gradients (lines 11-13, LOCALTRAINING). The local gradients are then sent to the cloud for the asynchronous average aggregation and model update (line 14). In order to deliver the augmentation data to the clients (line 3), we consider two objectives: (i) privacy-preserving, and (ii) communication efficiency. One naive approach is to send data to augment the missing classes at the clients in each training round, since the local missing data can change over time. In this approach, the FL Cloud Manager needs to know which classes are missing for each client in each training round. This could increase the communication cost and significantly increase data privacy risk; since the cloud learns certain aspects of the user behavior based on the classes that miss data over time. To achieve both privacy-preserving and communication efficiency, the approach implemented in our FLSys (Algorithm 2) first delivers the entire augmentation data to every client only once at the beginning of the training process. Then, the clients use only the data necessary to augment their missing data in each training round. Given the small size of the augmentation data, the overhead of also sending unnecessary augmentation data is minimal, without causing extra data privacy risk. The evaluation has two main goals: (i) Analyze the performance of the two FL models, HAR-Wild and sentiment analysis (SA), to understand if they perform similarly with their centralized counter-parts; (ii) Quantify the system performance of FLSys with HAR-Wild and SA on Android and AWS. In terms of system performance, we investigate energy efficiency and memory consumption on the phone, system tolerance to phones that do not upload local gradients, and FL aggregation scalability in the cloud. We also study the overall response time for third party apps that use FLSys on the phone. For model evaluation, we use Accuracy, Precision, Recall, and F1score metrics. For system performance, we report execution time and memory consumption for both the phones and the cloud, and battery consumption on the phones. Training 48855 51499 49185 14281 1920 Testing 8514 8828 8595 2514 319 Most of the evaluation is done in the context of HAR-Wild, which illustrates a typical FL model based on mobile sensing data. However, to demonstrate that FL works for different models, we also show results for the SA model. The rest of the section is organized as follows: Section VI-A evaluates HAR-Wild under different scenarios. Section VI-B describes the SA model and shows its performance. Section VI-C shows the HAR-Wild performance over the FLSys prototype. Since we did not have enough phones, we show the results using Android emulators to replay each user's data. Section VI-D presents scalability and fault-tolerance results for HAR-Wild over FLSys in AWS. Finally, Section VI-E presents results for HAR-Wild and SA over FLSys on two types of Android phone models. Table I shows the basic information of our collected dataset used for all HAR-Wild experiments. Some of the users have very limited numbers of labeled activities; thus, we select data from 51 users who labeled a reasonable amount of samples. We perform centralized and simulated evaluation to assess HAR-Wild's utility compared with several baselines. Centralized training works as an upper bound performance for FL models. In addition, it allows us to fine-tune the model's hyper parameters. The evaluation includes three variants of HAR-Wild: HAR-W-32, HAR-W-64, and HAR-W-128, which have the numbers of convolution-channels set to 32, 64, and 128. We derive two versions of our HAR-W-64 model in simulated FL with (HAR-W-64-uniform) and without (HAR-W-64-stock) the data augmentation. The augmentation data, consisting of 640 samples of each class, is fixed and shared with all clients. Baseline approaches. To demonstrate that HAR-Wild is competitive with respect to state-of-the-art HAR models, we consider two baseline models: (1) Bidirectional LSTM with 3axial accelerometer data as input. This is a typical model for time-series data, and we fine-tune it based on grid-search of hyperparameters; and (2) The CNN-based models proposed by Ignatov [16] , with additional features (CNN-Ig) and without additional features (CNN-Ig featureless) using recommended settings in [16] . For a fair comparison, we used TensorFlow implementations for all models. Table II shows all the hyperparameters and model configurations. Results. Figure 4a shows that HAR-Wild models outperform the baseline approaches. On average, HAR-W-64 performs best and achieves 82.49% accuracy compared with 78.68%, 76.39%, and 77.08% of the BiLSTM, CNN-Ig and CNN-Ig-featureless. Our HAR-Wild models also achieve the best performance in all the other metrics (Table III) . Overall, HAR-W-64 (60,613 trainable weights) has the best trade-off among model accuracy, convergence speed, and model size, and we use it in all the following experiments for HAR-Wild. In the simulated FL, we replay the data collected in the wild for each user. HAR-Wild trained with FL and data augmentation achieves 71.8% accuracy, which is about 10% less than the accuracy of the centralized-trained HAR-Wild. This is the cost of privacy-protection provided by FL. We notice the substantial benefits of data augmentation, since HAR-W-64-uniform outperforms HAR-W-64-stock (without data augmentation), in all metrics. FLSys is designed and implemented to be flexible, in the sense that the training and inference of multiple models can run concurrently. On the server, different applications use independent AWS resources. On the phone, independent model trainers and inference runners are responsible for different applications. This subsection showcases the training performance of the SA model, a text analysis application that interprets and classifies the emotions (positive or negative) from text data. With the inferred emotions of mobile users' private text data, a smart keyboard may automatically generate emoji to enrich the text before sending. We build the SA model specifically for tweet data. We use the FL benchmark dataset Sentiment140 2 , which consists of 1,600,498 tweets from 660,120 users. We select the users with at least 70 tweets, and this sub-dataset contains 46,000+ samples from 436 users. Figure 5 shows our SA model architecture. We first extract a feature vector of size 768 from each tweet with DistilBERT [33] . Then, we apply two fully connected layers with relu and softmax activation, respectively, to classify the feature vector into positive or negative. The number of hidden states of the first fully connected layer is set to 128 to balance the convergence speed and model size. In the FL version of the model, 5% of the users are used for data augmentation, and the rest of the users follow 4:1 train-test split. Results. While the reference implementation associated with this benchmark dataset reached 70% accuracy [5] using 100 users with stacked LSTM in FL simulation, our SA model achieves superior performance, as shown in Table IV . Centralized learning achieves 81% accuracy, while FL achieves 79% accuracy (an acceptable drop). This set of experiments tests FLSys with HAR-Wild by running the actual phone code in Android emulators. However, since Android emulation is slow and costly, we run several larger-scale experiments with the same DL4J algorithms and functions in Linux, which is much faster; this enables us to All the phone components of the prototype, except for Data Collector and Data Preprocessor, run in the emulators. The cloud part of the prototype runs in AWS. The Android emulators run on top of virtual machines (VMs) in Google Cloud, as AWS does not support nested virtualization. We run 10 VMs in Google Cloud, and each VM has 16 vCPUs and 60GB memory. On each instance, we run 4 Android v10 emulators from AVD manager in Android Studio. Each emulator is loaded with 3 users' data files, and each file is sampled twice as different clients. In each round, each Android emulator participates in training on behalf of a few clients. We set the deadline for the round in the FL Cloud Manager to 6 minutes. Results. Figure 4b shows that HAR-Wild with 64 clients emulation in both Android and Linux on FLSys achieve comparable accuracy with the simulated FL with TensorFlow, i.e., 69.07%, 68.50%, and 66.00%. Figure 4c shows the results of HAR-Wild with higher number of clients (up to 960) using Linux emulations. The client data was over-sampled from the original 51 users. HAR-Wild model achieves up to 69.17% accuracy, and more clients help the model converge quicker with better performance. Fault Tolerance. In daily life, some clients may fail to upload a trained model to the FL Cloud Manager due to network or computation issues. This set of experiments verifies the fault tolerance of FLSys in terms of model performance as a percentage of clients dropped out randomly in each round. Figure 4c shows the accuracy of HAR-Wild with up to 50% clients dropping out randomly from 480 clients in each round. With 1,000 rounds of training, the accuracy is reduced by at most 3.11%. This is a promising result showing that FLSys can tolerate reasonably large dropout rates during training. Scalability. As discussed in Section IV, computation and storage scale independently in the cloud for FLSys. This set of experiments verifies the scalability of FLSys across training rounds. The only FL function that may be computationally intensive in the cloud is the Model Aggregator. Figure 6 shows the Model Aggregator in AWS scales linearly with the number of participating clients. We also observe that the We benchmarked FLSys with HAR-Wild and SA on Android phones using a testing app to evaluate training and inference performance. We also assessed the resource consumption on the phones. We used two phones with different specs (Google Pixels 3 and 3a). Training Performance. Table VI shows the training time and the resource consumption on the phones. The training time is recorded by training 650 samples for 5 epochs for HAR-Wild, and 100 samples for 5 epochs for SA, which are the optimum scenarios determined in Section VI-C. Foreground training is done while leaving the screen on, and it uses the full single core capacity. It provides a lower bound for the training time. However, in reality, we expect training to be done in the background, either on battery or on charger. As in practice, other apps or system processes working in background may interfere with training. We take 10 measurements for each benchmark, and report the mean and standard deviation. Training for one round is fast on the phones. The foreground training time on the more powerful phone, Pixel 3, is just 0.7 min for HAR-Wild, and 0.22 min for SA. The background training time on charger, which is the expected situation for FL training, is good for any practical situation. The phones experience a higher training time compared with the foreground case (completed one training round in less than 4 minutes). The background training time on battery is notably longer, since Android attempts to balance computation with battery saving. The results show training is also feasible in terms of resource consumption. The maximum RAM usage of the app is less than 165MB, and modern phones are equipped with sufficient RAM to handle it. While we did not perform experiments for battery consumption in the foreground (as this test was used just for a lower bound on computation time), we measured battery consumption for background training on battery. The phones could easily perform hundreds of rounds of training on a fully charged battery. It is worth noting that, typically, one round of training per day is enough, as the users need enough time to collect new data. Inference Performance. The results in Table VII demonstrate that FLSys can be used efficiently by third-party apps. The inference time is measured within the testing app, and thus includes the latency due to both FLSys and the FL models. We continuously perform predictions for 30 minutes and report the average values. The inference time for the three scenarios on the third-party app, foreground, background on charger, and background on battery, follows a similar trend as training. FLSys and HAR-Wild/SA have reasonable resource consumption, which make them effective in practice. This paper presented our experience with designing, building, and evaluating FLSys, an end-to-end federated learning system. FLSys was designed based on requirements derived from real-life applications that learn from mobile user data collected in the wild, such as human activity recognition (HAR). Compared with existing FL systems, FLSys balances model utility with resource consumption on the phones, tolerates client failures/disconnections and allows clients to join training at any time, supports multiple DL models that can be used concurrently by multiple apps, and acts as a "central hub" on the phone to manage the training, updating, and access control of FL models used by different apps. We built a complete prototype of FLSys in Android and AWS, and used this prototype to demonstrate that FLSys is effective and efficient in practice in terms of model performance, resource usage, and latency. We believe FLSys can open the path toward creating an FL ecosystem of models and apps for privacypreserving deep learning on mobile sensing data. In terms of actual deployment of FLSys in practice, we believe it can be offered as FL as a Service (FLaaS) by cloud providers. Next, we report lessons learned and future work. Build mechanisms to cope with non-IID data. Since our data collection happened during the Covid-19 pandemic, we expected to see somewhat similar data from users who mostly stayed indoors. However, the data was non-IID strengthening the idea that data collected in the wild will almost always be non-IID. A future work in FLSys is to provide support for model and data-specific augmentation and other approaches to cope with non-IID data. Beware the simulation pitfalls. One common practice in FL simulations is to use the same instances/placeholders in memory for the different clients. Such simulations must carefully reset the instances for different clients to avoid any information leakage among clients, which can never happen in a real system. Our initial experiments showed unexpectedly different results between simulations and Android emulators with DL4J for the same settings. The first problem we discovered was that Batch Normalization (BN) is not supported in DL4J for specific data shapes. We implemented our own BN in DL4J, but the simulation results still did not match the experimental results. Finally, we realized that BN does not work well for FL (consistent with [24] ), but it does work in the simulations due to shared instances among the simulated clients. Thus, the FL models used in the reported experiments do not use BN. The second problem we noticed was that the Adam optimizer worked well for simulation, but not for the Android emulator experiments. This was also caused by shared instances accessed by all clients in the simulation. This should not happen in practice given privacy leakage through the shared instances. The lesson learned was that simulation may show better results than experiments with real systems for FL. Since most of FL papers in the literature are based on simulations, which may suffer from similar problems with the ones described here. We believe FLSys offers an opportunity to test such FL models in real-life conditions. Balance mobile resources and model accuracy. In the current FL literature, there are no results to show the FL models work well on mobile devices, while consuming a limited amount of resources on these devices (e.g., battery power, memory). A lesson that we understood early on is that FLSys will need to balance resource usage on mobiles with model accuracy. Therefore, FLSys used an asynchronous design in which policies on the mobile devices are evaluated to decide when it makes sense for the device to participate in training and consume resources. Our results show that good model accuracy can be achieved even when a significant number of mobile devices do not participate in training in order to save resources. Let us also note that real systems cannot expect to run the same number of rounds that we observe in simulations. For example, it is common to see 10,000 rounds in simulations. However, in real life, mobile devices may not train more than once a day due to both resource consumption and lack of enough new data. In such a situation, running 10,000 rounds will take over 27 years. Thus, models must be optimized for a realistic number of rounds. Design for flexibility. FLSys was designed for model flexibility on the phones from the beginning (i.e., allow apps to use multiple interchangeable models). However, we did not originally design for flexiblity in the cloud. At first, we used virtual machines in the cloud and durable cloud storage for all FL operations. However, when we analyzed scalability and performance issues, we realized that an FaaS solution and different types of storage are necessary. Therefore, we changed the design of the FLSys in the cloud to allow for different types of cloud platforms and storage options. Thus, FLSys can easily be ported to other cloud platforms beyond AWS. Future Work. In the near future, we will add features to allow FLSys to support continuous data collection, which is what we expect to see in real-life scenarios. We will also focus on designing and implementing privacy and security components for FLSys. Finally, we plan to improve FLSys from a DevOps point of view. We will evaluate the system performance under concurrent training of multiple models, plug-n-play modules, and support a dashboard. A public domain dataset for human activity recognition using smartphones Towards federated learning at scale: System design Practical secure aggregation for privacy-preserving machine learning Leaf: A benchmark for federated settings The opportunity challenge: A benchmark database for on-body sensor-based activity recognition Lstm networks for mobile human activity recognition A second pandemic: mental health spillover from the novel coronavirus (covid-19) Self-balancing federated learning with global imbalanced data in mobile systems An Industrial Grade Federated Learning Framework Securegbm: Secure multi-party gradient boosting Deep learning Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated machine learning Deep residual learning for image recognition Human activity recognition on smartphones using a bidirectional lstm network Real-time human activity recognition from accelerometer data using convolutional neural networks Statewide covid-19 stay-athome orders and population mobility in the united states Communication-efficient on-device machine learning: Federated distillation and augmentation under noniid private data Praneeth Vepakomma Advances and open problems in federated learning Federated optimization: Distributed machine learning for ondevice intelligence Activity recognition using cell phone accelerometers Federated learning: Challenges, methods, and future directions Fedbn: Federated learning on non-iid features via local batch normalization Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems Fedvision: An online visual object detection platform powered by federated learning Communication-efficient learning of deep networks from decentralized data Privacyfl: A simulator for privacy-preserving and secure federated learning Deep recurrent neural networks for human activity recognition Privacypreserving deep learning via additively homomorphic encryption On the convergence of federated optimization in heterogeneous networks Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter Fed-focal loss for imbalanced data classification in federated learning Federated ai for the enterprise: A web services based implementation Approaches to address the data skew problem in federated learning Federated machine learning: Concept and applications Federated machine learning: Concept and applications Applied federated learning: Improving google keyboard query suggestions Federated learning with non-iid data