key: cord-0629495-pbqz3ynz authors: Lee, Edward H.; Krell, Mario Michael; Tsyplikhin, Alexander; Rege, Victoria; Colak, Errol; Yeom, Kristen W. title: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU date: 2021-09-24 journal: nan DOI: nan sha: 268804c997ce319a15023eac46c9e4bd03cbacf2 doc_id: 629495 cord_uid: pbqz3ynz Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs. Microbatching is a standard method to alleviate this and is fully supported in the TensorFlow Privacy library (TFDP). However, this technique, while improving training times also reduces the quality of the gradients and degrades the classification accuracy. Recent works that for example use the JAX framework show promise in also alleviating this but still show degradation in throughput from non-private to private SGD on CNNs, and have not yet shown ImageNet implementations. In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system. Differentially private stochastic gradient descent [2] (DPSGD) is a technique to train neural networks on sensitive, personal data while providing provable guarantees of privacy. Since [2] , recent works have been limited to small datasets and networks in part due to computational challenges. Overcoming these challenges required a large mini-batch [19] . However, differential privacy and especially larger mini-batches impact the privacy loss as well as the accuracy [4] . Whereas the original implementation of DP-SGD was in TensorFlow [1, 2, 18] more recent approaches use JAX with some success [3, 26] , and there are several other approaches to tackle the acceleration on the framework level [6, 23, 27] This paper explores the application of DPSGD to the ImageNet dataset with a ResNet-50 [8] architecture and analyzes the effect of some parameters on the accuracy, speed, and privacy on two very different hardware architectures. Recently, Graphcore's Intelligence Processing Unit (IPU) has been introduced. Some key properties of the Mk2 IPU are that is has 1472 processor tiles, 896MiB on-chip SRAM, 7.8TB/s inter tile communication, and that it is a MIMD architecture [9] . This allows for fine-grained operations on the chip without excessive communication to host for fetching weights or instructions. Instead, in most cases, the instructions and intermediate activations reside on-chip. Thus in numerous applications, where other acceleration hardware is challenged, the IPU has shown significant performance advantages like EfficientNet [16] , approximate Bayesian computation [10] , multi-horizon forecasting [29] , bundle adjustment [22] , and particle physics [15] . This paper focuses on processing ImageNet [24] . We use this as a proxy for large scale image processing. However, there are numerous other applications that can benefit from it, like brain tumor segmentation [13, 25] , cancer detection [21] , and COVID-19 lung scan analysis [12] , and many more in clinical context [5, 7] which have grown in the number of training examples as well as the size each data. For example, in MRI datasets, there are multiple sequences per patient and each sequence contains volumetric data. Larger and more complex models are therefore needed. After the introduction of DPSGD, it has also seen increased usage in natural language processing [3, 6, 20, 26] . A common approach is to actually pretrain on a public dataset without privacy and then finetune on the privacy sensitive data [14] . Thus larger networks can be trained but the challenge of training on big data while still respecting privacy is not addressed. An interesting aspect of image processing are the normalization techniques. A common approach for Resnet50 [8] is to use batch normalization. However, batch normalization mixes information across different data samples and thus violates privacy. Thus, we will stick to group norm [28] in this paper. Whereas batch norm has an optimal batch size between 32 and 64, group norm enables high accuracy with much lower batch sizes [17] . Recently, an alternative method, proxy norm [11] , has been developed that combines the benefits of batch norm as the best approach to normalize data and group norm as the best approach that works even on batch-size of one and provides the capability of speeding up EfficientNet significantly [16] . This paper addresses acceleration on the hardware level and image processing which has so far been underrepresented in the literature. For GPU experiments we used the public ResNet-50 implementation. We replaced batch norm by group norm to address privacy preferences and used vmap with TFDP for the DPSGD part. For IPU experiments we used Graphcore's provided public examples repository for CNNs and added the respective code for clipping and noising to obtain a DPSGD implementation. For simplicity, we obtain larger total batch sizes over gradient accumulation on Mk1 (DSS8440) and Mk2 (IPU-POD16). In DPSGD, we randomly sample a batch of images at step t, clip per-example gradients such that ||g t || 2 ≤ C, accumulate clipped gradients over the entire batch and inject noise N (0, (σC) 2 I), where C AND σ are hyperparameters that establish the privacy budget . The hyperparameters are usually chosen to maximize classification accuracy (utility) for a given . In our experiments, we fix the total number of epochs and measure both the accuracy and . We keep σ and C constant throughout training and across layers for all experiments. We use the moment accountant implemented in TFDP [18] to compute the noise budget. In both our Mk1 and Mk2 implementations, we clip and inject noise for each gradient example independently to encapsulate DPSGD within an already existing framework, but further throughput gains can be obtained with noising after accumulating. After clipping and noising, gradients are then used to update the parameters via SGD without momentum and with a stepped learning rate decay policy. In our Mk1 experiments, the ResNet-50 model is pipelined and is split into 4 stages. We motivate experiments with pipelining as it has become an important part in building ever more larger and complex models with larger datasets. To ensure little degradation in throughput and to prevent limitations in pipelining schemes, we disallow any stage to communicate gradient signals (including gradient norm) to other stages. This means that to enable DPSGD, we clip the per-stage gradients of stage j across M stages independently while ensuring that the original gradient norm bound is respected. Namely, ||g t || 2 2 = M j=1 ||g t,j || 2 2 ≤ C 2 . In our experiments, we choose a simple uniform partitioning by imposing that each stage has ||g t,j || ≤ C/ √ M . This places a tighter constraint on the gradient norm than for the non-pipelined case. The use of pipelining motivates future work into adaptive and layer-wise strategies. What is the optimal batch size that maximizes throughput? On the A100 GPU, we ran throughput experiments as shown in Table 2 . For DPSGD, the optimal total batch size that maximizes throughput for the GPU is 8 independent of the micro-batch size, whereas for SGD, throughput is (usually) maximized for larger batch sizes. While larger accumulation results in better gradient signal quality, smaller accumulation leads to more incremental, but noisier weight updates. In Fig. 1 , we illustrate this on a small ResNet-8 model on Cifar-10. Gradient accumulation counts of down to 1 (and batch size of 1) can achieve better convergence than with 16. Furthermore, larger batch sizes increases the privacy budget (TFDP [18] ) for a fixed number of epochs. The motivation for large batch sizes is to ensure large gradient signalto-noise ratios with the hope of improving utility. However, in order to meet the same privacy budget as an equivalent experiment with smaller batch sizes, the number of epochs must be reduced. With this in mind, we perform ImageNet experiments on the Mk1, where we analyzed the interplay between gradient accumulation count, micro-batching, and two schemes of adjusting the learning rate to compensate for the total batch size. In the first case, we kept the initial learning rate set at 1.0, which was the highest that allowed for fast convergence without diverging. In the other case, the initial learning rate is scaled linearly with the change of the gradient accumulation count. For all experiments, the learning rate is scaled by 0.1× in a step-wise manner after training for 25, 50, 75, 90 epochs. The accuracy results after 100 epochs of training are displayed in Figure 1 . We can see that our learning rate scaling approach slightly improves performance and that increasing the micro-batch size from 1 to 2 clearly decreases performance. Hence, for the following experiments, we use a micro-batch size of 1. Multiple publications address challenges in getting differential privacy to run fast. Hence, we compare two fundamentally different hardware architectures in this section: GPUs and IPUs. In Section 2.3, we showed that µBS=1 delivers the best accuracy. Also, smaller micro-batch sizes result in better privacy. Thus, we focus on this setting for the comparison of different hardware and SGD and DPSGD. The total batch size is a result of the single device batch size and the number of replicas for GPUs. For the IPU, we use a local batch size of 1 and then use gradient accumulation and replicas for respective larger total batch sizes. We compare machines with 8 GPUs to 16 IPUs to match the systems TDP Watts and normal packaging setup. The results are displayed in Table 3 . On GPUs, DPSGD reduces performance by 50 − 90%. On IPUs, there is only a reduction of 10% on Mk1 and up to 25% on Mk2. All compared hardware choices have an additional performance hit due to memory constraints by DPSGD and being able to run SGD with much higher batch sizes. Given that Mk2 and A100 are the successors of Mk1 and V100 and share the same lithographic node (TSMC 7N for Mk2/A100 and 12N for Mk1/V100), it is common to compare those pairs. All A100 experiments were performed on the Google Cloud Platform instance a2-highgpu-8g with 96 vCPUs and 680 GB memory, and the V100 experiments on a dual Intel Xeon Gold 6248 CPU with 755 GB memory Thinkmate workstation with 8 V100s. In this setting, GPUs are 8-11 times slower than IPUs. This means that compared to the 6 hour training to obtain 71% on an Mk2 IPU-POD16 system, it would take at least 3.5 days to obtain the same results with an A100 GPU. A summary of DPSGD ImageNet experiments are shown in Table 4 with and without pipelining. Without pipelining, we achieve 71% accuracy with = 11.4 (δ = 10 −6 ). These values are in the expected range with maximum possible accuracy of 76% and commonly observed epsilons in other applications. With pipelining, due to the tighter constraint on pipelined clipping, we achieve a slightly worse accuracy-trade-off than for the non-pipelined case. In this paper, we showed that training DPSGD on a large dataset like ImageNet is not only feasible with a per-device batch size of 1 but smaller batch sizes are preferred. We show that we can train 100 epochs on ImageNet with ResNet-50 in 6 hours on Graphcore's Mk2 IPU-POD16, whereas comparable hardware takes 10 times longer. In the future, we would like to establish an ImageNet benchmark for DPSGD, explore other optimizers, learning rate schedulers, and proxy norm to get faster convergence and thus better privacy guarantees. Also more acceleration improvements are of interest like reimplementing and running the experiments with the JAX framework. From the application point of view, we want to transfer the findings to federated learning and the analysis of sensitive data like lung scans for COVID-19 data analysis [12] . Tensorflow: A system for large-scale machine learning Deep Learning with Differential Privacy Scale Differentially Private BERT. arXiv Differential Privacy Has Disparate Impact on Model Accuracy Privacy-Preserving Distributed Deep Learning for Clinical Data An Efficient DP-SGD Mechanism for Large Scale NLP Models Differentially Private Federated Learning: A Client Level Perspective Deep residual learning for image recognition Dissecting the Graphcore IPU architecture via microbenchmarking Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19 Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT Privacy-Preserving Federated Brain Tumour Segmentation Scalable differential privacy with sparse network finetuning Studying the Potential of Graphcore® IPUs for Applications in Particle Making EfficientNet More Efficient: Exploring Batch-Independent Normalization Revisiting Small Batch Training for Deep Neural Networks A General Approach to Adding Differential Privacy to Iterative Training Procedures Communication-Efficient Learning of Deep Networks from Decentralized Data Learning Differentially Private Recurrent Language Models A shallow convolutional neural network predicts prognosis of lung cancer patients in multi-institutional computed tomography image datasets Bundle Adjustment on a Graph Processor Scalable Private Learning with PATE Multiinstitutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization Three Tools for Practical Differential Privacy. arXiv Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units. arXiv