key: cord-0469023-b61trxet
authors: Chu, Tianyue; Garcia-Recuero, Alvaro; Iordanou, Costas; Smaragdakis, Georgios; Laoutaris, Nikolaos
title: Securing Federated Sensitive Topic Classification against Poisoning Attacks
date: 2022-01-31
journal: nan
DOI: nan
sha: b5e01546fc77a914131d02e49b7aba7f1f430d8d
doc_id: 469023
cord_uid: b61trxet

We present a Federated Learning (FL) based solution for building a distributed classifier capable of detecting URLs containing GDPR-sensitive content related to categories such as health, sexual preference, political beliefs, etc. Although such a classifier addresses the limitations of previous offline/centralised classifiers,it is still vulnerable to poisoning attacks from malicious users that may attempt to reduce the accuracy for benign users by disseminating faulty model updates. To guard against this, we develop a robust aggregation scheme based on subjective logic and residual-based attack detection. Employing a combination of theoretical analysis, trace-driven simulation, as well as experimental validation with a prototype and real users, we show that our classifier can detect sensitive content with high accuracy, learn new labels fast, and remain robust in view of poisoning attacks from malicious users, as well as imperfect input from non-malicious ones.

Most people are not aware of the fact that tracking services are present even at sensitive web domains.Being tracked on a cancer discussion forum, a dating site, or a news site with non-mainstream political affinity can be considered an "elephant in the room" when it comes to the anxieties that many people have about their online privacy. The General Data Protection Regulation (GDPR) [11] puts specific restrictions on the collection and processing of sensitive personal data "revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, also genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural persons sex life or sexual orientation". So do other public bodies around the world, e.g. in California (California Consumer Privacy Act (CCPA) [43] ), Canada [46] , Israel [44] , Japan [48] , and Australia [45] .

In a recent paper, Matic et al. [39] showed how to train a classifier for detecting whether the content of a URL relates to any of the above-mentioned sensitive categories. The classifier was trained using 156k sensitive URLs obtained from the Curlie [13] * Also with University Carlos III. † Corresponding authors ‡ Corresponding authors crowdsourced web taxonomy project. Despite the demonstrated high accuracy, this method has limitations that stem from being centralised and tied to a fixed training set. The first limitation means that the method cannot be used "as is" to drive a privacy-preserving distributed classification system. The second limitation implies that it is not straight-forward to cover new labels related to yet unseen sensitive content. For example, in their work the Health category could be classified with accuracy greater than 90%. However, the training labels obtained from Curlie in 2020 did not include any labels related to the COVID-19 pandemic. Therefore, as will be shown later, this classifier classifies COVID-19 related sites that contain sensitive health content with only 53.13% accuracy. Federated learning (FL) [33, 40] offers a natural solution to the above-mentioned limitations since it allows different clients to train their classification models locally without revealing new or existing sensitive URLs that they label locally, while collaborating by sharing model updates that can be combined to build a superior global classification model. FL has proved its value in a slew of realworld applications, ranging from mobile computing [17, 23, 58] to health and medical applications [8, 25, 36] . However, due to its very nature, FL is sensitive to so-called poisoning attacks [2, 4] mounted by malicious clients that may intentionally train their local models with faulty labels, and then disseminate the resulting updates with the intention of reducing the classification accuracy for other benign clients. State-of-the-art approaches for defending against such attacks depend on robust aggregation techniques [5, 18, 19, 57] which, however, as we will demonstrate later, are slow to converge, thereby making them impractical for the sensitivecontent classification problem that we tackle in this paper. Our Contributions: In this paper, we employ FL for the first time for sensitive content classification. We show how to develop a robust FL method for classifying arbitrary URLs that may contain GDPR sensitive content. Such a FL-based solution allows building a distributed classifier that can be offered to end-users in the form of a web browser add-on in order to 1) warn them before and while they navigate into such websites, especially when they are populated by trackers, and 2) allow them to contribute new labels, e.g., such as health-related websites about COVID-19, and thus keeping the classifier always up-to-date. To the best of our knowledge this method represents the first use of Federated Learning in such task.

Our second major contribution is the development of a reputation score for protecting our FL-based solution from poisoning attacks [2, 4] . Our approach is based on a novel combination of subjective logic [29] with residual-based attack detection. Our third contribution is the development of an extensive theoretical and experimental performance evaluation framework for studying the accuracy, convergence, and resilience to attacks of our proposed mechanism. Our final contribution is the implementation of our methods in a prototype system called EITR (standing for "Elephant In the Room" of privacy) and our preliminary experimental validation with real users tasked to provide fresh labels for the accurate classification of COVID-19 related URLs. Our findings: Using a combination of theoretical analysis, tracedriven simulation, and experimentation with real users, we have conducted the following:

• Demonstrate experimentally that our FL-based classifier achieves a comparable accuracy with the centralised one presented in [39] . • Prove analytically that subjected to a poisoning attack from malicious users, our reputation-based robust aggregation built around subjective logic, converges to a near-optimal solution of the corresponding Byzantine fault tolerance problem under mild assumptions. The resulting performance gap is determined by the percentage of malicious users. • Evaluate experimentally our solution against state-of-the-art algorithms such as Federated Averaging, Median, FoolsGold, Residual-base aggregation, and show that our algorithm is robust under Byzantine attacks by using different real-world datasets. We show that our solution outperforms these popular solutions in terms of convergence speed by a factor ranging from 3.3× to 5.3× while achieving at least a 2.2%-3.4% higher average accuracy. In terms of Attack Success Rate (ASR) our method yields an average 3.8%-15.2% improvement against all other methods, with the exception of FoolsGold that, however, is the slowest one to converge and with the lowest accuracy. • Validate using our EITR add-on that our FL-based solution can quickly learn to classify health-related sites about COVID-19 even in view of noisy/inconsistent input provided by real users.

The remainder of the article is structured as follows: Section 2 introduces the background for our topic. Section 3 presents our reputations scheme for FL-based sensitive content classification as well as its theoretical analysis and guarantees. Section 4 covers our extensive performance evaluation against the state-of-the-art and Section 5 some preliminary results from our EITR add-on. Section 6 concludes the paper and points to on-going and future work.

Matic et al. [39] have shown how to develop a text classifier able to detect URLs that contain sensitive content. This classifier is centralised and was developed in order to conduct a one-off offline study aimed at estimating the percentage of the web that includes such content. Despite achieving an accuracy of at least 88%, utilising a high-quality training set meticulously collected by filtering the Curlie web-taxonomy project [13] , this classifier cannot be used "as is" to protect real users that visit sensitive URLs populated by third party tracking services.

From offline to online: The classifier in [39] was trained using a dataset of 156k sensitive URLs. Despite being the largest dataset of its type in the recent literature, this dataset is static and thus represents sensitive topics up to the time of its collection. This does not mean, of course, that a classifier trained with this data won't be able to accurately classify new URLs pertaining to those sensitive categories. This owes to the fact that categories such as Health, involve content and terms that do not change radically with time. Of course, new types of sensitive content may appear that, for whatever reason, may not be accurately classified using features extracted from past content of the same sensitive category. Content pertaining to the recent COVID-19 pandemic is such an example. Although the Health category had 74,764 URLs in the training set of [39] which lead to a classification accuracy of 88% for Health, as we will show later (Section 5.4), the classifier of [39] classifies accurately as Health only 53.13% of the COVID-19 URLs with which we tested it. This should not come as a surprise since the dataset of [39] corresponds to content generated before the first months of 2020, during which COVID- 19 was not yet a popular topic. Therefore, we need to find a way to update an existing classifier so that it remains accurate as new content appears pertaining to existing or new sensitive categories. From centralised to distributed: A natural way to keep a classifier up-to-date is to ask end-users to label new sensitive URLs as they encounter them. End-users can report back to a centralised server such URLs which can then be used to retrain the classification model. This, however, entails obvious privacy challenges of "Catch-22" nature, since to protect users by warning them about the presence of trackers on sensitive URLs, they would first be required to report to a potentially untrusted centralised server that they visit such URLs. Federated Learning (FL), as already mentioned, is a promising solution for avoiding the above Catch22 by conducting a distributed, albeit, privacy-preserving, model training. In an FL approach to our problem, users would label new URLs locally, e.g., a COVID-19 URL as Health, retrain the classifier model locally, and then send model updates, not labelled data, to a centralised server that collects such updates from all users, compiles and redistributes the new version of the model back to them. In Section 3 we will show how to develop a distributed version of the sensitive topic classifier of [39] using FL. The price, however, of using FL, is that the distributed learning group becomes vulnerable to attacks, such as "label-flipping" poisoning attacks discussed in Section 4. This paper develops a reputation scheme for mitigating such attacks. Other types of attacks and measures for preserving the privacy of users that participate in a FL-based classification system for sensitive content are discussed in Section 6.

Privacy preserving crowdsourcing: Similar challenges to the ones discussed in the previous paragraph have been faced in services like the Price $heriff [27] and eyeWnder [26] that have used crowdsourcing to detect online price discrimination and targeted advertising, respectively. Secure Multi-Party Computation (SMPC) techniques such as private -means [51] and count-min sketches [12] were used in these cases in order to allow end-users to report data in a centralised server in a secure and privacy-preserving manner. The centralised computation performed by Price $heriff and eye-Wnder is not of ML nature, thus leaving data anonymisation as the main challenge, for which SMPC is a good fit. Classifying content as sensitive or not is a more complex ML-based algorithm for which FL is a more natural solution than SMPC. General works on federated learning: Federated Learning [33, 40] is a compelling technique for training large-scale distributed machine learning models while maintaining security and privacy.

The motivation for FL is that local training data is always kept by the clients and the server has no access to the data. Due to this benefit that alleviates privacy concerns, several corporations have utilised FL approaches in some applications to transfer training from a central to a distributed way [6, 40] . In mobile devices, FL is used to predict keyboard input [23] , human mobility [17] and behaviour for the Internet of Things [58] . FL is also applied in healthcare to predict disease [8, 25] , detect patient similarity [36] to overcome the privacy constrain. For the classification, FL is not only implemented for image classification [1] but also text classification [37] .

Resilience to poisoning attacks: Owing to its nature [2, 4] , FL is vulnerable to poisoning attacks, such as label flipping [18] and backdoor attacks [2] and, therefore, several defence methods have been developed [18, 19, 57] . While these state-of-the-art approaches perform excellently in some scenarios, they are not without limitations. First, they are unsuitable for our sensitive content classification, which necessitates a classifier that responds very fast to fresh sensitive information appearing on the internet. In existing methods, the primary objective is to achieve a high classification accuracy. This is achieved via statistical analysis of client-supplied model updates and discarding of questionable outliers before the aggregation stage. However, since the server distrusts everyone by default, even if an honest client discovers some fresh sensitive labels, its corresponding updates may be discarded or assigned low weights, up until more clients start discovering such new labels. This results in an inefficiently slow learning rate from new data. Second, recent studies [2, 4] have shown that existing Byzantine-robust FL methods are still vulnerable to local model poisoning since they are stateless (forgetful) and do not track information from previous aggregation rounds. Thus, an attacker can attack across time eventually leading to divergence towards ineffective models [22] . For example, [31] recently showed that even after infinite training epochs, any aggregation which is neglectful of the history cannot converge to an efficient solution.

The preceding studies demonstrate the importance of incorporating clients' previous long-term performance for evaluating their trustworthiness in order to establish trust between the server and the users. Hence, in the next sections we show how to design a robust aggregation based on the historical reputation of clients.

In this section, we first show how to build a FL-based classifier for sensitive content. Moving on, we design a reputation score for protecting against poisoning attacks. We analyse theoretically the combined FL/reputation-based solution and establish convergence and accuracy guarantees under common operating assumptions. Table 1 presents the notation that we will use in the remainder of the paper. Our goal of using FL for the sensitive classification task is to learn a global model for the server by aggregating local models of clients, which have the same structure as the global model. Here, we apply the model mentioned in Section 4 as the global model to predict 5 sensitive (Ethnicity, Health, Political Beliefs, Religion and Sexual Orientation) and non-sensitive categories.

Suppose we have clients participating in our classification training task and the dataset D = =1 , where ∼ ( , 2 ) denotes the local data of client from non-independent and nonidentically (Non-IID) distribution with the mean and standard deviation . In our task, the clients' data is the textual content of URLs stripped of HTML tags. The objective function of FL, ℒ : R → R which is the negative log likelihood loss in our task, can be described as

where (w;D) is the cost function of parameter ∈ ⊆ R . Here we assume is a compact convex domain with diameter . Therefore, the task becomes * = arg min ∈ ℒ( ) To find the optimal * , stochastic gradient descent (SGD) is carried out to optimise the objective function. During the broadcast phase, the server broadcasts the classification task and training instructions to clients. Then, the clients apply the following standard pre-processing steps on the webpage content, that is, transformation of all letters in lowercase and the removal of stop words. Next, the clients extract the top 1k features utilising the Term Frequency-Inverse Document Frequency (TF-IDF) [49] as in [39] and send a dictionary with feature names and values to the server. After receiving all the dictionaries, the server ranks the features by aggregating their values, weighted by the percentage of each client's training sample size, and returns a list of the top 1k feature to the clients. Finally, clients prepare input for training based on this list from their own dataset. At iteration , the client receives the current global model and then following the training instructions from server, trains the local model on its training data and optimises = arg min ℒ ( ) by using SGD:

, and means the -th sample and the number of samples of the client respectively, and is the fixed positive step size of SGD. In every iteration, after finishing the training process the clients send back their local updates to the server. Then, the server computes a new global model update by combining the local model updates via an aggregation method AGG as follows:

Here we utilise the basic aggregation method (FedAvg) [40] , which uses the fraction of each client's training sample size in total training samples as the average weights:

We introduce other robust aggregation methods in the next subsection. Subsequently, the server uses the global model update to renew the global model . Based on this FL framework for classifying sensitive content, we show the initial result in Figure 1 that the impact of increasing the total number of clients on average accuracy with a fixed text 

dataset (see details about this dataset in Section 4). We show results on average accuracy as well as results for the Health and Religion sensitive categories as described in [39] . A first observation from these results is that when a fixed dataset is divided into multiple segments and distributed to more clients, the model's accuracy decreases. This is due to the fact that each client has less data for training. Compared to the centralised classifier using the same data, the FL classifier is slightly less accurate. Losing accuracy is normal when training is distributed to a larger number of clients. The average difference in average accuracy between FL classifier and centralised classifier is 5.76% and the overall loss (3%) throughout distribution is acceptable. Looking at the particular sensitive categories of Health and Religion we see that the FL-based classifier achieves an accuracy very close to the corresponding one of the centralised classifier for these categories (on average 0.8533 vs. 0.88 and 0.9366 vs. 0.94, respectively). Figure 2 shows the overview of our reputation-based aggregation algorithm which consists of three components: the attack detection scheme, the reputation model, and the aggregation module. The attack detection scheme re-scales and rectifies damaging updates received from clients. Then, the reputation model calculates each client's reputation based on their past detection results. Finally, the aggregation module computes the global model's final updates by averaging updates of clients using their reputation scores as weights. We detail each component in the following subsections.

3.2.1 Attack detection scheme. The attack detection scheme aims to identify and eliminate suspicious updates that may have been the result of an attack. To start with, when all model updates from clients during one iteration arrive at the server, the server uses Algorithm 1 to rescale the range of values for these parameters.

This restriction on the value range aims not only to eliminate the impact of abnormal updates from attackers but also to limit the slope for the repeat median regression. Considering the -th parameter in round from all the participants, we calculate the standard deviation ( , ) of this series. Then we sort them in ascending order and determine the range by subtracting the lowest value from the highest one. If the result is above the threshold , we rescale the highest and lowest value by deducting and adding its standard deviation respectively to further bound their range.

Then, a robust regression [50] is carried out to identify outliers among the updates in current round. Outlier detection is a wellestablished topic in statistics and robust regression methods that often handle this issue using the median estimator. Median-based aggregation methods have a rich and longstanding history in the area of robust statistics [42] . However, the methods developed by the traditional robust statistics can only withstand a small fraction of Byzantine clients, resulting in a low "breakdown point" [38] . Different from many other variations of the univariate median, the repeated median [50] is impervious to atypical points even when their percentage is nearly 50%. The repeated median is defined as a modified U-statistic and the concept behind it is to utilise a succession of partial medians for computing approximationˆof the parameter : For ∈ N, the value of parameter ( 1 , · · · , ) is determines by subset of data points 1 , · · · , .

= median

In our case, the interceptˆand slopˆare estimated by repeated median as bellow:

Where ( , ) = , − , , − , . Next, we employ the IRLS scheme [55] to generate each parameter's confidence score , based on the normalised residual from repeated median regression, which is also utilised in a residualbased aggregation method [18] :

where confidence interval Ψ( ):

and the hat matrix :

The distance between the point and the robust line is described by the confidence score derived from normalised residual, which can be used to evaluate if the point is anomalous. Following the computation of the parameter's confidence score, and in light of the fact that some attackers want to generate updates with abnormally magnitude in order to boost the damage, an useful protection is to identify low confidence values based on a threshold . Once the server recognises an update w , of the client with confidence values less than , rather than altering this update to the repeat median estimation, our technique would replace it with the median of , as follows:

With the above, not only we bound the range of updates, but also improve the aggregation by introducing a robustness estimator.

3.2.2 Reputation model. During the aggregation phase in FL, we use a subjective logic model to produce client reputation scores. The subjective logic model is a subset of probabilistic logic that depicts probability values of belief and disbelief as degrees of uncertainty [29] . In the subjective logic model, reputation score for client in iteration correlates to a subjective belief in the dependability of the client's behaviour, as measured by the belief metric opinion [30] . An opinion is comprised of three elements: belief , disbelief and uncertainty , with restrictions that + + = 1 and , , ∈ [0, 1]. The reputation score may be calculated as the expected value of an opinion ( ) which can be regarded as the degree of trustworthiness in client . As a result, the value of the client's reputation is defined as follows:

where ∈ [0, 1] denotes the prior probability in the absence of belief, which reflects the fraction of uncertainty that may be converted to belief. On the other side, distinct observations determined by the rectification phase in our Algorithm 2 are used to count belief, disbelief, and uncertainty opinions. The positive observation denoted by indicates that the update from client is accepted in the current round, whereas a negative observation denoted by indicates that the update is rejected. As a consequence, the positive observations boost the client's reputation, and vice versa. To penalise the negative observations from the unreliable updates, a higher weight is assigned to negative observations than the weight to positive observations with constrain + = 1. Therefore, in Beta distribution below:

with the constraints 0 ≤ ≤ 1, parameters > 0, > 0, and ≠ 0 if < 1 and ≠ 1 if < 1. The parameters and that represent positive and negative observations respectively, can be expressed as below

where is the non-information prior weight and the default value is 2 [29] . As a consequence, the expected value of Beta distribution, which also stands for reputation value, can be calculated as follows:

Based on Equation (5) and Equation (8), we can derive

In addition, in order to take the client's historical reputation values in previous rounds into consideration, a time decay mechanism is included to lower the relevance of past performances without disregarding their influence. In other words, the reputation value from most recent iteration contributes the most to the reputation model. We use exponential time decay in our model, as shown below: 0) . We include a sliding window with a window length that allows us to get a reputation for a certain time interval rather than the entire training procedure. We remove expired tuples with timestamps outside the window period during computation since they cannot provide meaningful information for the reputation.

Hence, the final reputation score˜can be expressed as:

To demonstrate how the reputation model evolves, we consider four scenarios where each client: (i) only attacks once at the same iteration, (ii) attacks continuously after launching an attack at the same iteration, (iii) only attack once at different iteration, (iv) attacks continuously after launching an attack at different iteration. Here, clients conduct attacks as described in Section 4.1.2 by utilising polluted data training the local model and our attack detection mechanism identifies them. with Client X, who has X parameters in their local models, under single and continuous attack. In Figure 3(a) , all of the clients only attack once at third iteration. When they start attacking, their reputation score plummets dramatically. In both scenarios, we observe the client who has more parameters has a larger relative decline in reputation score. This is also compatible with Corollary 3.2 in the Section 3.3, that is, increasing the number of parameters (actual weights) in the global model results in a lower error rate. Figure 4 shows the last two scenarios (iii) and (iv) respectively with clients, who have 20 parameters in their local models, under single and continuous attack. In Figure 4(a) , Client only launches ; end Obtain = 1 , · · · , ; end Client: for ← 1 to do in parallel Receive −1 from the server; Draw the samples, compute, and sent parameter back to server end an attack at iteration. We observe that only one attack would lead to at least a 25.11% relative decrease in reputation score. In Figure 4 (b), Client launches an attack at iteration and keep attacking in the following iterations. We observe in the end that 80% of their reputation scores are below 0.5, nearly half of reputation score that honest clients have, implying that the damage that they can inflict during the aggregation process is greatly reduced.

Algorithm. Algorithm 2 explains our aggregation method based on the attack detection scheme and subjective logic reputation model. First, the server sends all clients the pretrained global model with initial parameters. Then, using their own data samples, clients train the global model locally and send the trained parameters back to the server. At this point, the server executes the attack detection scheme. In round , if the -th update parameter w , from the client has been rectified by the reweighing scheme in Section 3.2.1 to the median value, the server regards it as a negative observation, whereas no rectification represents a positive observation. Then, the server punishes the negative observation by reducing the corresponding client's reputation. Both types of observations are accumulated through all the parameters of client to obtain the reputation value˜in round for client so as to all the other clients.

After the server gets correction updates and the renewed reputation of each client, it aggregates the updates using average weighted reputation as the weights to get our global model updates for the current iteration. In this way, even over many training rounds, the attackers are still incapable of shifting parameters notably from the target direction and this ensures the quality of the resulting global model as will be demonstrated experimentally and analytically next.

In addition to demonstrate the impact of model hyper-parameters, we grid search the associated hyper-parameters in the Appendix B. The result demonstrates that our approach is very stable and efficient in terms of hyper-parameter selection, and it achieves a high degree of precision. Furthermore, the result is compatible with the theoretical analysis in Section 3.3.

In this subsection, we prove the convergence of our reputation based aggregation method. Our major results are Theorem 3.1 and Corollary 3.2 that state the convergence is guaranteed in bounded time. Regarding the performance of our algorithm in terms of metric average accuracy and convergence as will be shown later, we show that it is consistent with our theoretical analysis. We start by stating our assumptions, which are standard and common for such types of results and in accordance with recent works such as [56, 57] . Assumption 1 (Smoothness). The loss function of client ( , ) as well as server ℒ( ) are L-smooth on R , which means they are continuously differentiable and their gradient are Lipschitz-continuous with Lipschitz constant > 0, whereas:

Suppose the percentage of attackers in the whole clients is , and all the clients in the system participant every training iteration.

is the learning rate( < 1 ) andˆ= max { } =1 .

∀ ∈ , we denote m (w ) = * if ∈ malicious clients ∇ℒ (w ) if ∈ honest clients where * stands for an arbitrary value from the malicious clients.

Consider the assumptions and lemmas presented in Appendix C, we have Theorem 3.1. Under Assumptions 1, 2, 3 and 4, ∃ > 0 that:

After rounds, Algorithm 2 converges with probability at least 1 − ∈

where

with Φ (·) being the cumulative distribution function of Wald distribution.

we have: 

we observe the experimental results in Figure 3 , 4 and Appendix B, when varying the parameters of , , and , results are consistent with this error rate.

Remark 2. Derived from the Corollary 3.2 and Remark 1, there is a trade-off problem between convergence speed and error rate according to the level of reward and punishment from the reputation model. This trade-off problem is mainly based on the fact that if the model penalises the bad behaviours of clients heavily, it would decrease their reputation dramatically so the model would take a longer time to converge. On the other hand, mitigating the punishment to increase the reward, would lead to an increase in the error rate.

The objectives of the experimental evaluation are the following: (a) evaluate the performance of our aggregation method with other state-of-art robust aggregation methods, (b) benchmark it in three different scenarios, namely no attack, under label flipping attack and backdoor attack, and (c) do so using different real-world datasets, namely, SURL and CIFAR-10. And in addition (d), whereas we observe whether the result are consistent with our theoretical analysis.

4.1.1 Datasets. We evaluate our aggregation method on the text dataset SURL from [39] . SURL is our main evaluation dataset since it relates to sensitive content classification. The SURL dataset comes from a crowdsourcing taxonomy in the Curlie project, containing six categories of URLs: five sensitive categories as defined by GDPR (i.e., Health, Politics, Religion, Sexual Orientation, Ethnicity) and one for non-sensitive URLs, with a total of 442,190 webpages. The number of URLs in sensitive and non-sensitive categories are equally balanced. Each sample contains content, metadata and a class label of the webpage. For the SURL text classification task, we train a neural network with three fully connected layers and a final softmax output layers. We also employ the image dataset CIFAR-10 [35] for evaluation completeness of FL methods with typical datasets. CIFAR-10 is irrelevant in terms of sensitive content classification but we use it as a secondary means of assessing the robustness of our reputation scheme to poisoning attacks. The CIFAR-10 dataset is a 32 × 32 colour image dataset that includes ten classes with a total number of 50k images for training and 10k images for testing. Here we use ResNet-18 [24] model pertaining to ImageNet [14] .

In order to fulfil the fundamental setting of an heterogeneous and unbalanced dataset for FL, we sample from a Dirichlet distribution [41] with the concentration parameter = 0.9 as in [2] , which controls the imbalance level of the dataset, then assigns a , fraction of samples in class to client , with the intention of generating non-IID and unbalanced data partitions.

Model. The threat model for attacks under which we demonstrate the effectiveness of the FL methods evaluated as follows: Attack capability: In the FL setting, the malicious clients have complete control over their local training data, training process and training hyper-parameters, e.g., the learning rate, iterations and batch size. They can pollute the training data as well as the parameters of the trained model before submitting it to the server but cannot impact the training process of other clients. We follow the common practice in the computer security field of overrating the attacker's capability rather than underrating it, so we limit our analysis to worst-case scenarios. There, an attacker has perfect knowledge about the learning algorithm, the loss function, the training data and is able to inspect the global model parameters. However, attackers would still have to train with the model published by the server, thus complying with the prescribed training scheme by FL to their local data. Furthermore, the percentage of byzantine clients is an important factor that determines the level of success for the attack. We assume that the number of attackers is less than the number of honest clients, which is a common setting in similar methods [16] to the ones we evaluate and compare our method with. Attack strategy: We study two typical attack strategies, namely, (i) label flipping attack [3] and (ii) backdoor attack [2] . We apply these attacks with two different datasets, namely the aforementioned SURL and CIFAR-10. In label flipping attack, the attacker flips the labels of training samples to a targeted label and trains the model accordingly. Particularly, in SURL dataset, attackers change the label of "Health" to "Non-sensitive". In CIFAR-10 dataset, attackers switch the label of "cat" images to the "dog". A successful label flipping attack would produce a model that classifies polluted data incorrectly whose labels are altered to the target labels.

In a backdoor attack, attackers inject a designed pattern into their local data and train these manipulated data with clean data, in order to develop a local model that learns to recognise such pattern. We realise backdoor attacks inserting the top 10 frequent words with their frequencies for the "Health" category in the SURL dataset and a special pixel pattern [18] to CIFAR-10 dataset at training time. Therein the backdoor targets are the labels "non-sensitive" and "bird" respectively. A successful backdoor attack would acquire a global model that predicts the backdoor target label for data along with specific patterns.

For both attacks, instead of a single-shot attack where an attacker only attacks in one round during the training, we enhance the attacker by a repeated attack schedule in which an attacker submits the malicious updates in every round of the training process. Moreover, we allow extra training epochs for an attacker, namely being able to train the local models with 5 more epochs.

We compare the performance of our aggregation method with the existing "vanilla" aggregation FedAvg [40] and popular robust aggregation methods, namely, Coordinate-wise median [57] , FoolsGold [19] and Residualbased re-weighting [18] . FedAvg is a state-of-the-art FL method that demonstrates impressive empirical performance in non-adversarial settings [40] . Nevertheless, even a single adversarial client could control the global model in FedAvg easily [5] . This method averages local model updates of clients as a global model update weighted by the fraction of training samples size of each client compared to total training samples size. We use it as baseline evaluation to assess the performance of our method. Median is using coordinate-wise median for aggregation. After receiving the updates in round , the global update is equal to the coordinate-wise median of the updates, where the median is the 1-dimensional median. FoolsGold presents a strong defence against attacks in FL based on a similarity metric. Such approach identifies attackers based on the similarity of client updates and decreases the aggregate weights of participating parties that provide indistinguishable gradient updates frequently while keeping the weights of parties that offer distinct gradient updates. It is an effective defence but it requires more iterations to converge to an acceptable accuracy. Residual-based re-weighting weights each local model by accumulating the outcome of its residual-based parameter confidence multiplying the standard deviation of parameter based on the robust regression through all the parameters of this local model. In our reputation-based aggregation method, we implement the same re-weighting scheme IRLS [55] as residual-based aggregation, but choose the collection of reputation as the weights of clients' local models. In addition, there is existence of targeted attacks that aim to attack a specific label while keeping the accuracy of classification on other labels unaltered. Therefore, instead of Avg-ACC, we choose the attack success rate (ASR) to measure how many of the samples that are attacked, are classified as the target label chosen by a malicious client, namely: ASR = # successfully attacked samples # attacked samples A robust federated aggregation method would obtain higher Avg-ACC as well as a lower ASR under poisoning attacks. An ideal aggregation method can achieve 100% Avg-ACC and has the ASR as low as the fraction of attacked samples from the target label.

For the malicious attack, we assume 30% of the clients are malicious for SURL and CIFAR-10 dataset as in [5] , which is also a common byzantine consensus threshold for resistance to failures is in a typical distributed system [10] . For the server-side setting, in order to evaluate the reliability of the local model updates sent by client to server, we assume that the server has the ability to look into and verify the critical properties of the updates from clients before aggregation. Furthermore, there are some secure, server-side techniques to preserve the privacy of client updates by disguising them with noise [7] . We can not yet assume obscuring updates on the server-side, so we are able to employ our detection scheme by tracking fraudulent updates so rather than addressing privacy protection, we emphasise the robustness of our method. In future work we will explore aggregating updates with differential privacy or similar privacy methods.

Also, we only consider FL to be executed in a synchronous manner, as most existing FL defences require [5, 16, 56, 57] . For all of the reviewed aggregation methods under attack for both of the datasets, we perform 100 iterations in our main Dataset (a) SURL and 20 iterations in our secondary Dataset (b) CIFAR-10, similar to those utilised in prior FL research [34] , a batch size of 64 and 10 clients, same setting as in [2, 18, 57] , unless otherwise stated. More details of the training setting are presented in Appendix A.

In Figure 5 , we analyse the performance of our method in the no attack scenario and compare the convergence and accuracy of our method with others during training. We show the training loss (left axis) and average accuracy (right axis) during 100 training iterations for 5 methods under the SURL dataset.

Our aggregation starts with the lowest training loss and maintains it throughout the training process; in the meantime, it only takes 23 iterations to achieve 82% accuracy and then converge to 82.13%. In comparison, FedAvg and Residual-based have almost identical training loss and take 65 and 52 iterations to reach 82% accuracy and practically converge to 81.70% and 81.79%, respectively, which is 2.8× and 2.3× slower than our reputation method. Median reaches 81% at 83 rounds and after that converges to 79.94%, which amounts to a 3.6× slower converge rate than our method. Especially, Foolsgold is slow to converge and does not converge within 100 iterations, so our convergence rate is at least 4.3× better than Foolsgold. This demonstrates that our reputation model benefits from convergence speed and accuracy performance, because it gives higher weight to trustworthy clients.

In summary, we have two comments to make about these results: • Our method converges 2.3× to at least 4.3× faster than all competing state-of-the-art methods. • Our method is at least as good or outperforms competing methods in terms of classification accuracy.

The above validates that our reputation scheme is helpful even in the no attack scenario. This is due to the fact that previous Byzantine robust federated learning techniques discard certain local model updates when aggregating them into the global model, but our algorithm partially retains them by assigning different weights that relate to the reputation of different clients, which in the case of no attack, is high for all but clients with high quality updates.

We begin by analysing the performance with a static percentage (30%) of attackers, and then move on to the performance with a varying percentage of attackers under label flipping and backdoor attack.

Static percentage of attackers: Figure 6 shows the convergence of mentioned methods under label flipping attack. Our method converges 3.7× to at least 5.3× faster than all competing state-ofthe-art methods under attack, surpassing performance in no attack scenario. In addition, our method outperforms competing methods by at least 2.2% in terms of classification accuracy. Varied percentage of attackers: Then, we analyse the situation where the number of attackers increases. Figure 7 shows the change of performance metrics for varying percentage of attackers in both dataset for four evaluated methods except Foolsgold under label flipping attack. The reason that Foolsgold is not included in Figure 7 is that it performs consistently with 56.53%-57.65% Avg-ACC in SURL, 9.90%-10.27% Avg-ACC and less than 1% in CIFAR-10 respectively, when we increase ranging from 10% to 50% under label flipping attack. It demonstrates that Foolsgold is a strong defence, but needs more iterations to get acceptable accuracy. Figure 7 (a) shows that when the percentage of attackers ≤ 30%, our method is resistant against label flipping attacks with a small loss in accuracy and a consistent attack success rate. As approaches 50%, as expected, all defences become ineffective in mitigating the attack. The Avg-ACC of FedAvg and Median decreases linearly when their ASR increases, whereas the Avg-ACC of residual-based methods remains steady ( <50%). Figure 7(b) shows that accuracy only decreases slightly with the exception of the Median approach, and slowly increases the attack success rate during the process of reducing the number of honest clients in CIFAR-10.

Moreover, under label flipping attack during the whole process, our reputation-based method has the highest accuracy outperforming other methods in at least 1% in both SURL and CIFAR-10 datasets.

Ours has the lowest ASR, excluding Foolsgold, for both datasets by outperforming other methods on an average by 15.2% and 10.2% in SURL and CIFAR-10 respectively.

Static percentage of attackers: Figure 8 shows the convergence of mentioned methods under backdoor attack. Same as in no attack and label flipping attack scenario, our method converges 3.3× to at least 5× faster than all competing state-of-the-art methods. In addition, our method outperforms competing methods by at least 3.4% in terms of classification accuracy. Varied percentage of attackers: We examine the scenario in which the percentage of attackers increases. Figure 9 shows the performance when varying percentage of attackers for the four evaluated methods except Foolsgold in both datasets under backdoor attack. The reason that Foolsgold is not included in Figure 9 is that it performs steadily with 59.22%-62.34% Avg-ACC and less than 7.74% ASR in SURL. As well, 9.95%-10.3% Avg-ACC and less than 1% ASR in CIFAR-10. In both, the is changed from 10% to 50% under backdoor attack. Figure 9 (a) demonstrates that under backdoor attack in SURL, all methods gain a linear increase in ASR and decrease in Avg-ACC when changes from 10% to 50%. Figure 9 (b) demonstrates that accuracy of the methods only decrease slightly with the exception of the Median,but the attack success rate doubles when the proportion of attackers increases from 20% to 30%, after which the ASR exceeds 70%. In both datasets, our reputation-based method has the highest accuracy throughout the process with a low attack success rate, which is on average 3.8% and 10.1% in SURL and CIFAR-10 respectively, both lower than other methods excluding Foolsgold.

In conclusion, even under the two different attacks, our method: • converges 3.3× to at least 5.3× faster than all competing stateof-the-art methods. • Outperforms in at least 1% competing methods in terms of classification accuracy. • Yields an average 3.8%-15.2% improvement in ASR against all other methods, with the exception of FoolsGold which is the slowest one to converge and with the lowest accuracy.

Furthermore, the result is consistent with the theoretical analysis: as the increases, so does the error rate.

In this section we provide a high level description of our EITR system. We then present some preliminary results with real users demonstrating the ability of the system to quickly learn how to classify yet unseen sensitive content, in our case COVID-19 URLs pertaining to the category Health, even in view of inaccurate user input. A full in depth description of the system and its performance with more users over a longer time period is the topic of ongoing efforts and will be covered in a future work.

The EITR system is based on the client-server model. The back-end server is responsible to distribute the initial classification model and the consequent updated model(s) to the clients and receive new annotations from the different clients of the system. The client(s) is in the form of a web browser add-on that is responsible to fetch and load the most recent global classification model to the users' browser from the back-end server. The loaded model can then be used to label website in real time into the 5 different sensitive topics as defined by GDPR, i.e., Religion, Health, Politics, Ethnicity and Sexual Orientation. Next, we provide more details for each part of the system. Back-end server: The back-end server is written in JavaScript using the node.js Express [15] framework. To build the initial classification model we use the dataset provided by Matic et al. [39] , and the TensorFlow [52] and Keras [32] machine learning libraries. The final trained model is then exported using the TensorFlow.js [53] library in order to be able to distribute it to the system's clients (browser add-on). The back-end server also includes additional functionalities such as the creation and distribution of users' tasks -a short list of URLs that the users need to visit and annotate -and an entry point that collects the resulting users annotations during the execution of the task. Web browser add-on: Currently the browser add-on only supports the Google Chrome browser and is implemented in JavaScript using the Google Chrome Extension APIs [21] . To handle the classification model the add-on utilises the TensorFlow.js [53] library to load, use and update the model. The main functionality of the add-on is to classify the visited website in real time and provide information to the user related to the predicted class as depicted in Figure 10 . The website classification is based on the metadata (included in the website <head> HTML tag) and the visible text of the website. The add-on also allows users to provide their input related to their agreement or disagreement with the predicted class using a drop down list as depicted in Figure 10 with the label "Choose a new Class".

The goal of the real-user experiment is to evaluate our federated reputation-based method on real user activity (instead of systematic tests), and demonstrate that even with real users with different comprehensions of the definition of sensitive information, our method can learn new content fast and achieve higher accuracy than centralised classifiers, which is compatible with our simulation experiment.

For the setup, the participant is directed to visit the experiment website that provides the necessary information and instructions on the goal of our study, the definition of sensitive information provided by the current General Data Protection Regulation (GDPR) and how to participate in our study. In order to have access to the browser add-on and the installation instructions the user must in advance give explicit consent and accept the data privacy policy. Upon successful installation of the add-on, new users are asked to provide a valid email address (to contact them for the reward) and then receive their task, a list of 20 URLs, that they need to visit and provide their labels in order to successfully complete their task. The list of 20 URLs is sampled by the Dirichlet distribution with = 0.9 for each participant from a database, which includes 300 URLs with sensitive and non-sensitive content related to COVID-19.

We have ensured compliance with the General Data Protection Regulation pertaining to collecting, handling and storing data generated by real users. To that end, we have acquired all the proper approvals from our institutions. Furthermore, the participants are directed to visit a pre-selected set of URLs selected by us to avoid collecting the actual visiting patterns of our users. In addition, the user input is only collected if and only if the user explicitly provide input to the drop-down list labelled "Choose a new Class" to avoid collecting the visiting patterns of the user accidentally while they are executing their tasks. Finally, we only use the users' email address to contact them for the reward. The mapping between the user input and their email address is based on a random identifier that is generated during the installation time of the add-on.

Data Collection: We eventually have 50 users participate in our experiment. In order to evaluate the performance of the global model from the reputation-based FL method based on the real-user data, we define a methodology to label ground truth on COVID-19 related health websites. Ground truth Methodology: To set the ground truth for COVID-19 sites related to our sensitive or non-sensitive labels, we create a database of 100 websites, which we collect by searching on Google with the query "sensitive websites about COVID-19" and choose the top 100 sites returned from the query. Then, four experts in the privacy field, independently annotate them based on their professional expertise in order to achieve an agreement on whether each of those sites included sensitive or non-sensitive content. Ground truth annotation: In order to evaluate the annotation of the 100 websites from human experts, we calculate the inter-rater agreement among them using Fleiss Kappa. We obtain 0.56 of Fleiss Kappa, which is an acceptable agreement because the values of Fleiss Kappa 1 above 0.5 are regarded as good. Furthermore, given that COVID-19 is a controversial issue, it is difficult for humans to agree on what constitutes sensitive content relating to it. Even though, we still attain a valid ground truth of 85 items belonging to the health sensitive category with agreement ratings of at least 0.5. Note that we also classify the above 100 websites using the centralised classifier proposed in [39] and get only 53.13% accuracy. Result with real users: Figure 11 shows the results of accuracy and reputation score with 50 real users in the experiment. Figure 11(a) shows that the majority of users have reputation scores falling in the intermediate range, with some having a very high reputation and a few having a very low reputation. This indicates the divergence of the user's interpretation of the sensitive information as we expect. In Figure 11 (b) we compare the accuracy of the centralised classifier and the global model of our reputation-based FL approach for COVID-19 URLs. Despite the diversity of reputation scores of real users, our method converges as rapidly as in simulation and achieves an average accuracy of 80%, thereby verifying the quick convergence and high accuracy results presented in the previous sections.

As we observe, with real users holding our method achieve a good performance. This means that, as new sensitive content appears and/or is defined by GDPR or new upcoming legislation, we will be able to continue training our FL model for this type of task with quick convergence and good accuracy. The empirical results in Figure 11 (b) also shows that there is a quick convergence to the accuracy's stable value within a small number of iterations (around 30), in line with the theoretical results in Section 3.

In this paper we showed how to use federated learning to implement a robust to poisoning attacks distributed classifier for GDPRsensitive content. Having demonstrated the benefits of our approach in terms of convergence rate and accuracy against other state-of-the-art approaches, we implemented and validated it with real users using our EITR browser add-on and system. Collectively, our performance evaluation has showed that our reputation-based approach to thwarting poisoning attacks consistently converges faster than the state-of-the-art while maintaining or improving the classification accuracy.

We are currently working towards disseminating EITR to a larger user-base and using it to classify additional sensitive and nonsensitive types of content, including categories defined by the users themselves, for various purposes, not necessarily related to sensitive content classification. In the above context we are also considering additional attacks, such as physical backdoor attack [54] , sub-population data poisoning attack [28] and category inference attack [20] in FL.Overall, we believe that our subjective-logic based reputation scheme maybe beneficial to other types of machine learning applications going beyond classification but this is something to be demonstrated by future work.

Our simulation experiments are implemented with Pytorch framework [47] on the cloud computing platform Google Colaboratory Pro (Colab Pro) with access to Nvidia K80s, T4s, P4s and P100s with 25 GB of Random Access Memory.

For the training setting, the local models are trained independently and sequentially in each round of training before being aggregated into the global model. Also we presume that clients do not leave the system. Table 2 shows the default setting in our experiments. 

We employ four hyper-parameters in our reputation model: rewarding weight , prior probability , time decay parameter and window length . As shown in Remark 1, and don't affect the performance of our model, we only consider hyper-parameters and , where controls the reward weight to positive observations and controls the fraction of uncertainty converted to belief. To demonstrate the impact of these two hyper-parameters of our reputation model we grid search in [0.1, 0.2, 0.3, 0.4] and in [0.1, 0.3, 0.5, 0.7, 0.9]. The setup is the same as on SURL dataset under label flipping attack. The ultimate accuracy of stability of reputation-based aggregation are shown in Figure 12 . Note that these results are tested for the label flipping attack and they hold according to theory also for backdoor.

The result in Figure 12 demonstrates that our approach is very stable and efficient in terms of hyper-parameter selection, and that it achieves a high degree of precision. 

where = 0.4748.

After obtaining the bound of part and , now we have

Hence, we can prove Theorem 3.1 through iterations using the formula of a finite geometric series,

Unsupervised deep transfer feature learning for medical image classification

Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. 2020. How to backdoor federated learning

The security of machine learning

Analyzing federated learning through an adversarial lens

Machine learning with adversaries: Byzantine tolerant gradient descent

Towards federated learning at scale: System design

Practical secure aggregation for privacy-preserving machine learning

Federated learning of predictive models from federated electronic health records

Convex optimization: Algorithms and complexity

Practical byzantine fault tolerance

Data protection in the EU, The General Data Protection Regulation (GDPR)

An improved data stream summary: the count-min sketch and its applications

Curlie -The Collector of URLs

Imagenet: A large-scale hierarchical image database

Express -Fast, unopinionated, minimalist web framework for Node

Local model poisoning attacks to byzantine-robust federated learning

PMF: A privacy-preserving human mobility prediction framework via federated learning

Attack-resistant federated learning with residual-based reweighting

Mitigating sybils in federated learning poisoning

Secure Aggregation is Insecure: Category Inference Attack on Federated Learning. IEEE Transactions on Dependable and Secure Computing

Google Chrome Extension APIs

The hidden vulnerability of distributed learning in byzantium

Federated learning for mobile keyboard prediction

Deep residual learning for image recognition

Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records

Beyond content analysis: Detecting targeted ads via distributed counting

Who is fiddling with prices? building and deploying a watchdog service for e-commerce

Niklas Pousette Harger, and Alina Oprea. 2021. Subpopulation data poisoning attacks

Subjective logic

Trust network analysis with subjective logic

Learning from history for byzantine robust optimization

Federated learning: Strategies for improving communication efficiency

Flaas: Federated learning as a service

Learning multiple layers of features from tiny images

Privacy-preserving patient similarity learning in a federated environment: development and analysis

Xiang Ren, and Salman Avestimehr. 2021. FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks

Breakdown points of affine equivariant estimators of multivariate location and covariance matrices

Georgios Smaragdakis, and Nikolaos Laoutaris. 2020. Identifying Sensitive URLs at Web-Scale

Communication-efficient learning of deep networks from decentralized data

Estimating a Dirichlet distribution

Geometric median and robust estimation in Banach spaces

Protection of privacy regulations (data security)

Australian Privacy Prin-ciples guidelines; Australian Privacy Principle 5 -Notification of the collection of personal information

Amended Act on The Personal Information Protection and Electronic Documents Act

Automatic differentiation in pytorch

Amended Act on the Protection of Personal Information

Term-weighting approaches in automatic text retrieval. Information processing & management

Robust regression using repeated medians

Differentially private k-means clustering

TensorFlow -An end-to-end open source machine learning platform

TensorFlow.js is a library for machine learning in JavaScript

Backdoor Attacks Against Deep Learning Systems in the Physical World

Introduction to robust estimation and hypothesis testing

Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance

Byzantine-robust distributed learning: Towards optimal statistical rates

Learning context-aware policies from multiple smart homes via federated multi-task learning

The following are the lemmas we utilise in the proof of Theorem 3.1.Lemma C.1. From Assumption 1 and 4, ℒ( ) is -smooth and -strongly convex. Then ∀ 1 , 2 ∈ , one has ⟨∇ℒ(w 1 ) − ∇ℒ(w 2 ), w 1 − w 2 ⟩ ≥ + ∥w 1 Lemma C.2. The difference between m(w) and ∇ℒ(w) is bounded in every iteration :Base on the assumption 4, we have (w) is ( − )-strongly convex. from [9] 3.6, we haveAnd thereforeRefer to Assumption 1, we obtainLet = − , then we conclude the proof of Lemma C.1Hence, we proof Lemma C.2

We first consider a general problem of robust estimation of a one dimensional random variable. Suppose that there are clients, and percentage of them are malicious and own adversarial data.In iteration, we have:We bound part first. we haveUnder the Assumption 4 and Lemma C.1, we haveThen we combine inequalities 29 to equation 28from Assumption 1, we can derive:combine inequalities 30 and 31, we have:Let < 1 , we haveThen we turn to bound part . Based on Lemma C.2, we know ∥m(w) − ∇ℒ(w)∥ 2 ≤ ∥m 0 (w) − ∇ℒ(w) ∥ 2 + √ Δ 1 (34)