key: cord-0207262-b5ebvyac authors: Huang, Yangsibo; Li, Xiaoxiao; Li, Kai title: EMA: Auditing Data Removal from Trained Models date: 2021-09-08 journal: nan DOI: nan sha: 378666038fc2693234952f9852f8e729b764de1e doc_id: 207262 cord_uid: b5ebvyac Data auditing is a process to verify whether certain data have been removed from a trained model. A recently proposed method (Liu et al. 20) uses Kolmogorov-Smirnov (KS) distance for such data auditing. However, it fails under certain practical conditions. In this paper, we propose a new method called Ensembled Membership Auditing (EMA) for auditing data removal to overcome these limitations. We compare both methods using benchmark datasets (MNIST and SVHN) and Chest X-ray datasets with multi-layer perceptrons (MLP) and convolutional neural networks (CNN). Our experiments show that EMA is robust under various conditions, including the failure cases of the previously proposed method. Our code is available at: https://github.com/Hazelsuko07/EMA. An important aspect of protecting privacy for machine learning is to verify if certain data are used in the training of a machine learning model, i.e., data auditing. Regulations such as GDPR [18] and HIPPA [1] require institutions to allow individuals to revoke previous authorizations for the use of their data. In this case, such data should be removed not only from storage systems, but also from trained models. Previous work focuses on data removal instead of data auditing. Some investigate how training data can be memorized in model parameters or outputs [20, 3] so as to show the importance of data removal. Others study data removal methods from trained models, especially those that does not require retraining the model [4, 2] . However, independent of how data is removed, in order to meet the compliance of data privacy regulations, it is important, especially for healthcare applications such as medical imaging analysis, to have a robust data auditing process to verify if certain data are used in a trained model. The data auditing problem is an under-studied area. The closely related work is by Liu et al. [10] who proposed an auditing method to verify if a query dataset is removed, based on Kolmogorov-Smirnov (KS) distance and a calibration dataset. However, the method may fail under certain practical conditions, such as when the query dataset is similar to the training dataset or when the calibration dataset is not of high quality. To overcome these limitations, we propose an Ensembled Membership Auditing (EMA) method, inspired by membership inference attacks [16] , to audit data removal from a trained model (see Fig. 1 ). It is a 2-step procedure which ensembles multiple metrics and statistical tools to audit data removal. To verify if a trained model memorizes a query dataset, first, EMA auditor infers whether the model memorizes each sample of the query dataset based on various metrics. Second, EMA ensembles multiple membership metrics and utilizes statistical tools to aggregate the sample-wise results and obtain a final auditing score. Our contributions are summarized as follows: 1. We propose Ensembled Membership Auditing (EMA), an effective method to measure if certain data are memorized by the trained model. 2. EMA method improves the cost-efficiency of the previous approach [10] , as it does not need to train a model on the query dataset. 3. Our experiments on benchmark datasets and Chest X-ray datasets demonstrate that our approach is robust under various practical settings, including the conditions that the previous method fails. Our formulation of the data auditing problem is similar to that proposed by Liu et al. [10] . Suppose the dataset D is sampled from a given distribution D ⊂ R d , where d denotes the input dimension. A machine learning model f D : R d → C is trained on D to learn the mapping from an input to a label in the output space C. We denote the inference with a data point x ∈ R d as f D (x). The auditory institution (or the auditor) aims to tell if a query dataset D q is memorized by the trained model f D . In real applications, most machine learning models for healthcare are provided as Application Programming Interface (APIs). Users only have access to the model outputs rather than model parameters, referred to as a black-box access. Hence, similar to [10] , we assume a black-box setting for data auditing: the auditor has access to 1) the algorithm to train f D , and 2)f D (D q ), probability outputs of the query data D q on f D . The auditor does not have access to the training dataset, nor the network parameters of f D . Let us use D to denote the training dataset and D cal to denote the calibration dataset. Liu et al. proposes an auditing method [10] that uses Kolmogorov-Smirnov (KS) distance to compare the distance between f D (D q ) and f Dq (D q ) and that between f D cal (D q ) and f Dq (D q ), where D and D cal are drawn from the same domain. The criteria is defined as: where ρ KS ≥ 1 indicates the query dataset D q has been forgotten by f D . However, the ρ KS formula may fail in the following scenarios: when the query dataset is very similar to the original training dataset, the numerator is small, which will lead to a false negative result; when the calibration set is of low quality, the denominator is small, which will lead to a false positive result. Section 4 provides experimental results of the above limitations of using ρ KS . This section presents Ensembled Membership Auditing (EMA), a 2-step procedure to audit data removal from a trained model. The key idea of our approach is inspired by the Membership Inference Attack (MIA) [16] , which shares a same black-box setting as that of auditing data removal. Input: A, the training algorithm; fD, the target model; Dq, the query dataset; D cal , the calibration dataset; g1, · · · , gm, m different metrics for membership testing. Output: ρEMA ∈ [0, 1], the possibility that Dq is memorized by fD 1: procedure EnsembledMembershipAuditing 2: τ1, · · · , τm ← InferMembershipThresholds(A, D cal , g1, · · · , gm) 3: 2Samp-pvalue() returns the p-value of a two-sample statistical test, which determines if two populations are from the same distribution 8: return ρEMA 9: end procedure The final results are binarized by a threshold, and 1 indicates the membership. To formulate the decision rule h, in addition to knowing trained model outputs, MIA requires knowing another set of data (we refer to as calibration data), which is assumed to be similar to the training dataset. Previous work suggests that the decision rule h can either be a machine learning model that is trained on the calibration data [16, 11, 15] , or be thresholds of certain metrics that are computed using the calibration data [17] . Motivated by recent successes of MIA on single data points, we propose a framework that adapts MIA to audit whether a set of data points is removed. We propose Ensembled Membership Auditing (EMA), a 2-step auditing scheme for data removal (see Algorithm 1): to verify if a query dataset is memorized by a trained model, the auditor first infers if each sample is memorized based on certain metrics, and then utilizes some statistical tools to aggregate the samplewise results and to infer the probability that the query dataset is memorized. We name this probability as the EMA score and denote it by ρ EMA . Step 1: Infer if each sample is memorized. Given the target model f D , which is trained with training dataset D, and query dataset D q , the first step Algorithm 2 Infer Membership Thresholds [17] Input: A, the training algorithm; D cal , the calibration dataset; g1, · · · , gm, m different metrics for membership testing. Output: τ1, · · · , τm, thresholds for m different metrics for membership inference. Infer the threshold based on Eq 2 8: end for 9: return τ1, · · · , τm 10: end procedure infers if each sample in D q is memorized by f D (see Algorithm 1, line 2 to line 6). The auditor first computes τ 1 , · · · , τ m , thresholds for m different metrics by running a standard membership inference pipeline [17] on the calibration set. To select thresholds to identify training data, we define balanced accuracy on calibration data based on the balanced accuracy regarding True Positive Rate (TPR) and True Negative Rate (TNR): where given a threshold τ , The best threshold is selected to maximize the balanced accuracy (see Algorithm 2) . For each sample in D q , it will be inferred as a member or memorized by the target model, if it gets a membership score higher than the threshold for at least one metric (Algorithm 1, line 3 to 6). The auditor stores the membership results in M ∈ {0, 1} |Dq| : M i = 1 indicates that the i-th sample in D q is inferred as memorized by f D , and M i = 0 indicates otherwise. Our scheme uses the following 3 metrics for membership inference [17] : Step 2: Aggregate sample-wise auditing results. Given M, the samplewise auditing results from step 1, the auditor infers if the whole query dataset is memorized. A simple approach is to perform majority voting on M, however, the state-of-the-art MIA approaches [17] achieve only ∼70% accuracy with benchmark datasets. Majority voting may not achieve reliable results. The unreliability of a single entry in M motivates us to consider using the distribution of M: ideally, if a query dataset D * q is memorized, it should give M D * q = 1. Thus, we run a two-sample statistical test: we fix one sample to be 1 (an all-one vector), and use M as the second sample. We set the null hypothesis to be that 2 samples are drawn from the same distribution (i.e., M is the sample-wise auditing results for a memorized query dataset). The test will return a p-value, which is the final output of our EMA scheme, and we denote it as ρ EMA . We interpret ρ EMA as follow: if ρ EMA ≤ α, the auditor can reject the null hypothesis, and conclude that the query dataset is not memorized. Here, α is the threshold for statistical significance, and is set to 0.1 by default. Comparison with the previous method. Table 1 lists the differences between our method and Liu et al.'s [10] . As shown, our approach is more costefficient since it does not require training a model on the query dataset. It also addresses limitations of the previous method by avoiding possible false-positive (due to low quality calibration data) and false-negative cases (due to similar query data to training data), which we are going to show in the next section. We conduct two experiments to validate EMA and compare it with the method by Liu et al. [10] . The first (see Section 4.1) uses benchmark datasets (MNIST and SVHN) and the second (see Section 4.2) uses Chest X-ray datasets. Both methods are implemented in Pytorch framework [13] . We present the main results by using t-test as the statistical aggregation step of EMA. Appendix B provides the results of EMA using different statistical tests, and more results under various constraints of the query dataset. We start with verifying the feasibility of EMA and explaining the experiment setting on benchmark datasets for the ease of understanding. 3 MNIST dataset [9] contains 60,000 images with image size 28 × 28. SVHN dataset [12] contains 73,257 images in natural scenes with image size 32 × 32. We generate the training dataset, the calibration set, and the query dataset as follow. Training dataset. We randomly sample 10,000 images from MNIST as the training dataset and split it equally to 5 non-overlapping folds. Each fold contains 2,000 images. Calibration set. We sample 1,000 images from MNIST (disjoint with the training dataset) as the calibration set . To simulate a low-quality calibration set in practice, we keep k% of the original images, add random Gaussian noise to (100 − k)/2% of the images, and randomly rotate the other (100 − k)/2% of the images. We vary k in our evaluation. Table 2 : Auditing scores of both methods on benchmark datasets. Each column corresponds to a query dataset, and each row corresponds to a calibration set with quality controlled by k. False positive results are in red, while false negative results are in blue. Query dataset. We design the following three kinds of query dataset: -{M1, M2, M3, M4, M5}: 5 folds of MNIST images used in training, each with 2,000 images; -M6: 2,000 images randomly selected from the MNIST dataset (disjoint with the training and the calibration set); -S: 2,000 images randomly selected from the SVHN dataset. Target model. The target model is a three-layer multi-layer perceptron of hidden size (256, 256). Its training uses SGD optimizer [14] with learning rate 0.05 run for 50 epochs. The learning rate decay is set to 10 −4 . Results and discussion. Fig.2(a) shows that the distribution of metrics on M1 (memorized) is clearly distinguishable from those of M6 and S (not memorized). This validate that EMA can be used to infer whether a query dataset is memorized by the target model. Table 2 (a)). By contrast, EMA is robust for such a scenario. Both methods give correct answers for query dataset 'S' from SVHN whose appearance is significantly different from that of MNIST. We further evaluate EMA on medical image analysis. We use two Chest Xray datasets, including COVIDx [19] , a recent public medical image dataset which contains 15,173 Chest X-ray images, and the Childx dataset [7] , which contains 5,232 Chest X-ray images from children. We perform pneumonia/normal classification on both datasets. Appendix A.1 provides details and sample images of both datasets. We describe the training dataset, the calibration set, and the query dataset as follow. Training dataset. We randomly sample 4,000 images from COVIDx as the training dataset and split it equally to 5 non-overlapping folds. Each fold contains 800 images. Calibration set. We generate the calibration set using a subset of the COVIDx dataset, which is disjoint with the training dataset and contains 4,000 images as well. To simulate a potentially low-quality calibration set, we keep k% of the original images, and add random Gaussian noise to (100 − k)% of the images. Query dataset. We evaluate with different query datasets, including -{C1, C2, C3, C4, C5}, 5 folds of COVIDx images used in training, each with 800 images; -C6, 800 images randomly selected from the COVIDx dataset (disjoint with the training and the calibration set); -R, 800 images randomly selected from the Childx dataset. highly overlaps with f Dq (D q ), but the KS distance between them is larger than the KS distance between f D cal (D q ) and f Dq (D q ). This suggests KS distance may not be a good measure of distributions of prediction outputs. Table 3 : Auditing scores of both methods on Chest X-ray datasets. Each column corresponds to a query dataset, and each row corresponds to a calibration set with quality controlled by k. False positive results are in red while false negative results are in blue. Target model. The target model is ResNet-18 [5] . We use the Adam optimizer [8] with learning rate 2 × 10 −5 and run for 30 epochs (weight decay is set to 10 −7 ). Results and Discussion. The results further validate EMA can be used to infer whether a query dataset is memorized by the target model. As shown in Fig. 2(b) , the distribution of membership metrics on C1 (memorized) is clearly distinguishable from those of C6 and R (not memorized); however, the difference between distributions of metrics for memorized and not-memorized query datasets is smaller when compared to that on benchmark datasets. One potential rationale for this difference is that we perform a 10-way classification on benchmark datasets, but only do a binary classification for Chest X-ray datasets. Thus, the auditor may get less information from the final prediction of the target model on Chest X-ray datasets, as the the final prediction has fewer classes. When the query dataset is a subset of the training dataset (columns 'C1' to 'C5' in Table 3 ), EMA correctly indicates that the query dataset is memorized (ρ EMA = 1), whereas the results of the method by Liu et al. [10] are all false positive. For the case where the query dataset is not included in the training dataset (columns C6 and R in Table 3 ), EMA always gives correct answers when the quality level of the calibration set is equal to or higher than k = 60, namely with less than 40% noisy data. However, the method by Liu et al. gives a false positive result for C6 when k = 100 and all false positive results for R when k > 60. A possible explanation why the method by Liu et al. fails is that KS distance may not be a good measure when the number of classes is small (see Fig. 3 ). This paper presents EMA, a 2-step robust data auditing procedure to verify if certain data are used in a trained model or if certain data has been removed from a trained model. By examining if each data point of a query set is memorized by a target model and then aggregating sample-wise auditing results, this method not only overcomes two main limitations of the state-of-the-art, but also improves efficiency. Our experimental results show that EMA is robust for medical images, comparing with the state-of-the-art, under practical settings, such as lower-quality calibration dataset and statistically overlapping data sources. Future work includes testing EMA with more medical imaging tasks, and more factors that may affect the algorithm's robustness, such as the requirements of the calibration data, different training strategies and models, and other aggregation methods. A.1 Details of Chest X-ray datasets The Covid19 X-ray dataset. COVIDx [19] is a recent public medical image dataset which contains near 16,000 chest x-ray (CXR) images. Some participants are associated with more than one CXR. To simplify membership evaluation, we only keep the patients with one CXR, ending up with 15,173 cases and CXR images. We show some examples in Figure 4 . The Childx dataset. The Childx dataset is a Chest X-ray dataset selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children's Medical Center, Guangzhou [6] . This dataset contains a total of 5,232 chest X-ray images from children, including 3,883 characterized as depicting pneumonia and 1,349 normal. We show some examples in Figure 5 . B Ablation study We first evaluate the performance of EMA using two different statistical tests, including the two-sample Kolmogorov-Smirnov test and the two-sample t-test between the sample-wise auditing result M and 1: 1) the Kolmogorov-Smirnov statistic quantifies the distance between the empirical distribution function of both samples; 2) the t-test statistic is defined as As shown in Table 4 , the t-test always outperforms the Kolmogorov-Smirnov test, under different tasks and with calibration datasets of different qualities. A possible explanation is that both M and 1 are binary vectors, thus are not suitable for the Kolmogorov-Smirnov statistic which is designed to measure the distance between continuous distributions. Table 4 : Auditing results of EMA using different statistical tests on the benchmark (the first row) and Chest X-ray (the second row) datasets. False positive results are in red, while false negative results are in blue. We also evaluate the robustness of the EMA approach by varying D q , the size of the query dataset. As shown in Table 5 , for the benchmark dataset, EMA gives correct answers if D q > 200; when D q ≤ 200, the performance of EMA will be affected if the calibration dataset is not of perfect quality, i.e. k < 100. For the Chest X-ray dataset, EMA gives correct answers if D q > 20 (see Table 6 ), which suggests that EMA is quite robust even with a very small query dataset. Table 5 : Auditing results of EMA with query datasets of different sizes on the benchmark dataset. Size of the query datasets are annotated as |D q | in subcaptions. False positive results are in red, while false negative results are in blue. Health insurance portability and accountability act of 1996 Machine unlearning The secret sharer: Evaluating and testing unintended memorization in neural networks Certified data removal from machine learning models Deep residual learning for image recognition Labeled optical coherence tomography (oct) and chest x-ray images for classification Identifying medical diagnoses and treatable diseases by image-based deep learning Adam: A method for stochastic optimization MNIST handwritten digit database Have you forgotten? a method to assess if machine learning models have forgotten data Machine learning with membership privacy using adversarial regularization Reading digits in natural images with unsupervised feature learning Pytorch: An imperative style, highperformance deep learning library An overview of gradient descent optimization algorithms Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models Membership inference attacks against machine learning models Systematic evaluation of privacy risks of machine learning models The EU general data protection regulation (GDPR) Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images The secret revealer: Generative model-inversion attacks against deep neural networks ρEMA, using KS-test, Chest X-ray dataset k C1 C2 C3 C4 C5 C6 R (f) ρEMA, |Dq| = 5 Size of the query datasets are annotated as |D q | in subcaptions. False positive results are in red This project is supported in part by Princeton University fellowship and Amazon Web Services (AWS) Machine Learning Research Awards. The authors would like to thank Liwei Song and Dr. Quanzheng Li for helpful discussions.