key: cord-0226912-6ndis84s authors: Bai, Xiang; Wang, Hanchen; Ma, Liya; Xu, Yongchao; Gan, Jiefeng; Fan, Ziwei; Yang, Fan; Ma, Ke; Yang, Jiehua; Bai, Song; Shu, Chang; Zou, Xinyu; Huang, Renhao; Zhang, Changzheng; Liu, Xiaowu; Tu, Dandan; Xu, Chuou; Zhang, Wenqing; Wang, Xi; Chen, Anguo; Zeng, Yu; Yang, Dehua; Wang, Ming-Wei; Holalkere, Nagaraj; Halin, Neil J.; Kamel, Ihab R.; Wu, Jia; Peng, Xuehua; Wang, Xiang; Shao, Jianbo; Mongkolwat, Pattanasak; Zhang, Jianjun; Liu, Weiyang; Roberts, Michael; Teng, Zhongzhao; Beer, Lucian; Sanchez, Lorena Escudero; Sala, Evis; Rubin, Daniel; Weller, Adrian; Lasenby, Joan; Zheng, Chuangsheng; Wang, Jianming; Li, Zhen; Schonlieb, Carola-Bibiane; Xia, Tian title: Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence date: 2021-11-18 journal: nan DOI: nan sha: 10f8b87efe810eb6e20cb19455c5072b58160316 doc_id: 226912 cord_uid: 6ndis84s Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution under a federated learning framework (FL) without data sharing. Here we show that our FL model outperformed all the local models by a large yield (test sensitivity /specificity in China: 0.973/0.951, in the UK: 0.730/0.942), achieving comparable performance with a panel of professional radiologists. We further evaluated the model on the hold-out (collected from another two hospitals leaving out the FL) and heterogeneous (acquired with contrast materials) data, provided visual explanations for decisions made by the model, and analysed the trade-offs between the model performance and the communication costs in the federated training process. Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK. Collectively, our work advanced the prospects of utilising federated learning for privacy-preserving AI in digital health. As the gold standard for identifying COVID-19 carriers, reverse transcription-polymerase chain reaction (RT-PCR) is the primary diagnostic modality to detect viral nucleotide in specimens from cases with suspected infection. However, due to the various disease courses in different patients, the detection sensitivity hovers at around only 0.60 -0.71 [1] [2] [3] [4] , which results in a considerable number of false negatives. As such, clinicians and researchers have made tremendous efforts searching for alternatives [5] [6] [7] and complementary modalities 2, [8] [9] [10] [11] to improve the testing scalability and accuracy for COVID-19 and beyond. It has been reported that coronavirus carriers present certain radiological features in chest CTs, including ground-glass opacity, interlobular septal thickening, and consolidation, which can be exploited to identify COVID-19 cases. Chest CTs have thus been utilised to diagnose COVID-19 in some countries and regions with reported sensitivity ranging from 0.56 to 0.98 [12] [13] [14] [15] . However, these radiological features are not explicitly tied to , and the accuracy of CT-based diagnostic tools heavily depends on the radiologists' own knowledge and experience. A recent study 16 has further investigated the substantial discrepancies in differentiating COVID-19 from other viral pneumonia by different radiologists. Such inconsistency is undesirable for any clinical decision system. Therefore, there is an urgent demand to develop an accurate and automatic method to help address the clinical deficiency in current CT-based approaches. Successful development of an automated method relies on a sufficient amount of data accompanied by precise annotations. We identified three challenges, specifically data-related, for developing a robust and generalised AI model for CT-based COVID-19 identifications: (i) Incompleteness. High-quality CTs that were used for training was only a small subset of the entire cohort; therefore, they are unlikely to cover the complete set of useful radiological features for identification. (ii) Isolation. CTs acquired across multiple centres were difficult to transfer for training due to security and privacy concerns, while a locally trained model may not be generalised to or improved by the data collected from other sites. (iii) Heterogeneity. Due to the different acquisition protocols (e.g., contrast agents and reconstruction kernels), CTs collected from a single hospital are still not yet well standardised; therefore, it is challenging to train a precise model based on a simple combination of the data 17 . Furthermore, it remains an open question whether the COVID-19 patients from diverse geographies and varying demographics show similar or distinct patterns. All these challenges will impede the development of a well-generalised AI model, and thus, of a global intelligent clinical solution. It is worth noting that these challenges are generally encountered by all the possible trails in applying AI models in clinical practices, not necessarily COVID-19 related. To tackle these problems, we launched the Unified CT-COVID AI Diagnostic Initiative (UCADI, in Fig. 1 and 2). It was developed based on the concept of federated learning 18, 19 , which enables machine learning engineers and clinical data scientists to collaborate seamlessly, yet without sharing the patient data. Thus, in UCADI, every participating institution can benefit from, and contribute to, the continuously evolving AI model, helping deliver even more precise diagnoses for COVID-19 and beyond. Training an accurate AI model requires comprehensive data collection. Therefore, we first gathered, screened, and anonymized the chest CTs at each UCADI participating institute (5 hospitals in China and 18 hospitals in the UK), comprising a total of 9,573 CTs of 3,336 cases. We summarised the demographics and diagnoses of the cohort in the Supplementary Table 1 and 2. Developing an accurate diagnostic model requires a sufficient amount of high-quality data. Consequently, we identified the three branches of Wuhan Tongji Hospital Group (Main Campus, Optical Valley and Sino-French) and the National COVID-19 Chest Imaging Database (NCCID) 20 as individual UCADI participants. Each site contains adequate high-quality CTs for the development of the 3D CNN model. We used 80% of the data for training and validation (trainval) and the rest 20% for testing. Additionally, we utilize the CTs collected from Tianyou hospital and Wuhan Union hospital as hold-out test sets. We consistently use the same partition in both the local and federated training processes for a fair comparison. NCCID is an initiative established by NHSX, providing massive CT and CXR modalities of COVID-19 and non-COVID-19 patients from over 18 partnership hospitals in the UK. Since each hospital's data quantity and categorial distribution are quite uneven, we pooled all the CTs and identified the entire NCCID cohort as a single participant. Unlike the CTs procured from China which are all non-contrast, around 80% of CTs from NCCID are acquired with contrast materials (e.g., iodine). These contrast materials are usually utilized to block X-rays and appeared with higher attenuation on CTs, which could help emphasise tissues such as blood vessels and intestines (in Supplementary Fig. 1 and Table 3 ). However, in practice, we found that a simple combination of the contrast and the non-contrast CTs did not back the training of a well-generalized model since their intrinsic differences induced in the acquisition procedures 21 . Therefore, to overcome the data heterogeneity between the contrast and non-contrast CTs in the NCCID, we applied an unpaired image-to-image translation method called CycleGAN 22 to transform the contrast CTs into non-contrast variants as augmentations during the local model training. In Supplementary Table 4, we have compared CycleGAN with two other recent image translation methods (CouncilGAN 23 and ACL-GAN 22 ) . We showed that the model trained on CycleGAN transformed contrast CTs has the best performance (test on the non-contrast CTs). However, this modality transformation is not always helpful, as the performance degenerated when training on the raw plus translated contrast CTs. We developed a densely connected 3D convolutional neural network (CNN) model based on the massive cohort collection towards delivering precise diagnoses with AI approaches. We term it 3D-DenseNet and report its architectural designs and training optimisations in the Methods and Supplementary Fig. 2 . We examined the predictive power of 3D-DenseNet on a four-class pneumonia classification task as well as COVID-19 identification. In the first task, we aimed at distinguishing COVID-19 ( Fig. 3a, Supplementary Fig. 3 and Table 5 ) from healthy cases and two other pneumonia types, namely non-COVID-19 viral and bacterial pneumonia (Fig. 3b ). We preferred a four-class taxonomy since further distinguishment of COVID-19 with community-acquired pneumonia (CAP) 24, 25 can help deliver more commendatory clinical treatments, where the bacterial and the viral are two primary pathogens of CAP 26 (Fig. 2c) . However, given different institutions accompanied by various annotating protocols, it is more feasible for the model to learn to discriminate COVID-19 from all non-COVID-19 cases. Therefore, we base the experimental results on this two-category classification in the main text. We report the four-class experiments based on the Wuhan Tongji Hospital Group's cohort in Supplementary Fig. 3 and Table 5 . Table 6 and 7, we further compared 3D-DenseNet with two other 3D CNN baseline models: 3D-ResNet 27 and 3D-Xception 28 . As a result, we demonstrated that 3D-DenseNet had better performance and smaller size, presenting it as highly suitable for federated learning. To interpret the learned features of the model, we performed gradient-weighted class activation mapping (GradCAM) 29 analysis on the CTs from the test set. We visualised the featured regions that lead to identification decisions. It has been found that the generated heatmaps (Fig. 3c) primarily characterised local lesions that are highly overlapped with the radiologists' annotations, suggesting the model is capable of learning robust radiologic features rather than simply overfitting 30 . This heatmap can help the radiologists localise the lesions quicker for delivering diagnoses in an actual clinical environment. Moreover, localising the lesions will also provide a guide for further CT acquisition and clinical test. A similar idea has been described as "region-of-interest (ROI) detection" in a previous similar study 31 . To examine the cross-domain generalisation ability of the locally trained models, we tested China's locally trained model on Britain's test set and vice versa. We reported the numerical results in Fig. 4 . However, due to incompleteness, isolation, and heterogeneity in the various data resources, we found that all the locally trained models exhibited less-than-ideal test performances on other sources. Specifically, the model trained on NCCID non-contrast CTs had a sensitivity/specificity/AUC of 0.313/0.907/0.745 in identifying COVID-19 on the test set of China, which is lower than locally trained ones, and vice versa. Next, we describe how to incorporate federated learning for the cross-continent privacy-preservation collaboration on training a generalised AI diagnostic model, mitigating the domain gaps and data heterogeneity. -Enable multination privacy-preserving collaboration with federated learning We developed a federated learning framework to facilitate the collaboration nested under UCADI and NCCID, integrating diverse cohorts as part of a global joint effort on developing a precise and robust AI diagnostic tool. In traditional data science approaches 17, 31 , sensitive and private data from different sources are directly gathered and transported to a central hub where the models are deployed. However, such procedures are infeasible in real clinical practises; hospitals are usually reluctant (and often not permitted) for data disclosure due to privacy concerns and legislation 32 . On the other side, the federated learning technique proposed by Google 33 , in contrast, is an architecture where the AI model is distributed to and executed at each host institution without data centralisation. Furthermore, transmitting the model parameters effectively reduced the latency and the cost associated with sending large amounts of data during internet connections. More importantly, the strategy to preserve privacy by design enables medical centres to collaborate on developing models without sharing sensitive clinical data with other institutions. Recently, Swarm Learning 34 is proposed towards the model decentralisation via edge computation. However, we conjecture it is immature for the privacy-preserving machine learning 35 applications based on massive data collection and participants due to the exponential increase in computation. In UCADI, we have provided: (i) An online diagnostic interface allowing people to query the diagnostic results on identifying COVID-19 by uploading their chest CTs; (ii) A federated learning framework that enables UCADI participants to collaboratively contribute to improving the AI model for COVID-19 identification. Each UCADI participant will send the model weights back to the server via a customised protocol during the collaborative training process every few iterations. To further mitigate the potential for data leaks during such a transmission process, we applied an additive homomorphic encryption method called Learning with Errors (LWE) 36 to encrypt the transmitted model parameters. By so doing, participants will keep within their data and infrastructure, with the central server having no access whatsoever. After receiving the transmitted packages from the UCADI participants, the central server then aggregates the global model without comprehending the model parameters of each participant. The updated global model would then be distributed to all participants, again utilising LWE encryption, enabling the continuation of the model optimisation at the local level. Our framework is designed to be highly flexible, allowing dynamic participation and breakpoint resumption (detailed in Methods). With this framework, we deployed the same experimental configurations to validate the federated learning concept for developing a generalized CT-based COVID-19 diagnostic model (detailed in Methods). We compared the test sensitivity and specificity of the federated model to the local variations ( Fig. 4) . We plotted the ROC curves and calculated the corresponding AUC scores, along with 95% confidence intervals (CI) and p-values, to validate the model's performance (Fig. 4) . As confirmed by the curves and numbers, the federated model when applied to the test set of the NCCID (from 18 UK hospitals), vastly outperforming all the locally trained models. We based the performance measure on the CT level instead of the patient level, coherent with the prior study 31 . We illustrated that the federated framework is an effective solution to mitigate against the issue that we cannot centralise medical data from hospitals worldwide due to privacy and legal legislation. We further conducted a comparative study on the same task with a panel of expert radiologists. With an average of 9 years' experience, six qualified radiologists from the Department of Radiology, Wuhan Tongji Hospital (Main Campus), were asked to make diagnoses on each CT from China, as one of the four classes. The six experts were first asked to provide diagnoses individually, then to address integrated diagnostic opinions via majority votes (consensus) in a plenary meeting. We presented the radiologists and AI models with the same data partition for a fair comparison. In differentiating COVID-19 from the non-COVID-19 cases, the six radiological experts obtained an average 0.79 in sensitivity (0.88, 0.90, 0.55, 0.80, 0.68, 0.93, respectively), and 0.90 in specificity (0.92, 0.97, 0.89, 0.95, 0.88, 0.79, respectively). In reality, the consideration of a clinical decision is usually made by consensus decision among the experts. Here, we use the majority votes among the six expert radiologists to represent such a decision-making process. We provide the detailed diagnostic decisions of each radiologist in Supplementary Table 5 . We found that the majority vote helps reduce the potential bias and risk: the aggregated diagnoses are with the best performance among individual radiologists. In Fig. 4a , we plotted the majority votes in blue markers (sensitivity/specificity: 0.900/0.956) and remarked that the federatively trained 3D-DenseNet had shown comparable performance (sensitivity/specificity: 0.973/0.951) with the expert panel. We have further presented and discussed the models' performance on the hold-out test sets (645 cases from Wuhan Tianyou Hospital and 506 cases from Wuhan Union Hospital) in Supplementary Table 8 . We proved that the federatively trained model also performed better on these two hold-out datasets, yet the confidence sometimes is not well calibrated. During the federated training process, each participant is required to synchronise the model weights with the server every few training epochs using web sockets. Intuitively, more frequent communication should lead to better performance. However, each synchronisation accumulates extra time. To investigate the trade-off between the model performance and the communication cost during the federated training, we conduct parallel experiments with the same settings but different training epochs between the consecutive synchronisations. We report the models' subsequent test performance in Fig. 5a and time usage in Fig. 5b . We observe that, as expected, more frequent communication leads to better performance. Compared with the least frequently communication scenario, to download the model from the beginning and train locally without intermediate communications, synchronizing at every epoch will achieve the best test performance with less than 20% increment in time usage. COVID-19 is a global pandemic. Over 200 million people have been infected worldwide, with hundreds of thousands hospitalized and mentally affected 37, 38 , and as of Oct 2021, above four million are reported to have died. There are borders between countries, yet the only barrier is the boundary between humankind and the virus. We urgently demand a global joint effort to confront this illness effectively. In this study, we introduced a multination collaborative AI framework, UCADI, to assist radiologists in streamlining and accelerating CT-based COVID-19 diagnoses. Firstly, we developed a new CNN model that achieved performance comparable to expert radiologists in identifying COVID-19. The predictive diagnoses can be utilised as references while the generated heatmap helps with faster lesion localisation and further CT acquisition. Then, we formed a federated learning framework to enable the global training of a CT-based model for precise and robust diagnosis. With CT data from 22 hospitals, we have herein confirmed the effectiveness of the federated learning approach. We have shared the trained model and open-sourced the federated learning framework. It is worth mentioning that our proposed framework is with continual evolution, is not confined to the diagnosis of COVID-19 but also provides infrastructures for future use. The uncertainty and heterogeneity are the characteristics of clinical work. Because of the limited medical understanding of the vast majority of diseases, including pathogenesis, pathological process, treatment, etc., the medical characteristics of diseases can be studied by the means of AI. Along with this venue, research can be more instructive and convenient in dealing with large (sometimes isolated) samples, especially suitable for transferring knowledge in studying emerging diseases. However, certain limitations are not well addressed in this study. First is the potential bias in the comparison between experts and models. Due to legal legislation, it is infeasible and impossible to disclose the UK medical data with radiologists and researchers in China or vice versa. Thus, radiologists are from nearby institutions. Though their diagnostic decisions are quite different, it is not unrealistic to conclude that our setting and evaluation process eliminate biases. The Second is engineering efforts. Although we have developed mechanisms such as dynamic participation and breakpoint resumption, the participants still happened to drop from the federated training process for the unstable internet connection. Also, the computation efficiency of the 3D CNN model still has space for improvements (in Supplementary Table 7 ). There are always engineering advancements that can be incorporated to refine the framework. We first described how we constructed the dataset, then we discussed the details of our implementations for collaboratively training the AI model, we provided further analysis of our methods at the end of this section. Additionally, we collected independent cohorts including 507 COVID-19 cases from Wuhan Union Hospital and 645 COVID-19 cases from Wuhan Tianyou Hospital. These hold-out test sets were used for testing the generalisation of the locally trained models as well as the federated model. Since the data source only contained COVID-19 cases, we did not utilize it during the training process. We also summarised and reported the demographic information (i.e., gender and age) of the cohort in Supplementary Table 1. - For the total 2,682 CTs that were acquired from the 18 partner hospitals located in the United Kingdom (See Supplementary Table 3 Toshiba Aquilion ONE/PRIME. Settings such as filter sizes, slice thickness and reconstruction protocols are also quite diverse among these CTs. This might explain the reason why the NCCID locally trained model failed to perform as well as the Chinese locally trained variant (see Fig. 4c ). Regarding the material differences, 2,145 out of 2,682 CTs were taken after the injection of an iodine contrast agent (i.e., contrast CTs). As pointed out by previous study 21 We also noticed that a small subset of the CTs only contained partial lung regions, we removed these insufficient CTs whose number of slices are less than 40. As for our selection criteria in this regard, although the partial lung scans might be infeasible for training segmentation or detection models, we believe that a sufficient number of slices is enough to ensure the model effectively captures the requisite features and thereby help with the precise classification in medical diagnosis. We reported patient demographical information (i.e., gender and age) of the cohort in Supplementary Table 2 . However, the reported demographics is not inclusive since the demographical attributes of non-COVID-19 cases are not recorded. In comparison to the demographical information of the COVID-19 cases acquired from China, COVID-19 cases in the UK were with larger averaged ages and had more male patients. These demographical differences might also explain why the UK locally trained model failed to perform well when applied to the CTs acquired from China. -Data pre-processing, model architecture and training setting We pre-processed the raw acquired CTs for standardisation as well as to reduce the burden on computing resource. We utilized an adaptive sampling method to select 16 slices from all sequential images of a single CT case using random starting positions and scalable transversal intervals. During the training and validation process, we sampled once for each CT study, while in testing we repeated the sampling five independent times to obtain five different subsets. We then standardised the sampled slices by removing the channel-wise offsets and rescaling the variation to uniform units. During testing, the five independent subsets of each case were fed to the trained CNN classifier to obtain the prediction probabilities of the four classes. We then averaged the predictive probabilities over these five runs to make the final diagnostic prediction for that case. By so doing, we can effectively include impacts from different levels of lung regions as well as to retain scalable computations. To further improve the computing efficiency, we utilised trilinear interpolation to resize each slice from 512 to 128 pixels along each axis and rescaled the lung windows to a range between -1200 and 600 Hounsfield units before feeding into the network model. We named our developed model 3D-DenseNet ( Supplementary Fig. 2) . It was developed based on DenseNet 39 , a densely connected convolutional network model that performed remarkably well in classifying 2D images. To incorporate such design with the 3D CT representations, we adaptively customized the model architecture into fourteen 3D-convolution layers distributed in six dense blocks and two transmit blocks (insets of Supplementary Fig. 2 ). Each dense block consists of two 3D convolution layers and an inter-residual connection, whereas the transmit blocks are composed of a 3D convolution layer and an average pooling layer. We placed a 3D DropBlock 40 instead of simple dropout 41 before and after the six dense blocks, which proved to be more effective in regularising the training of convolution neural networks. We set the momentum of batch normalisation 42 to be 0.9, and the negative slope of LeakyReLU activation as 0.2. During training, the 3D-DenseNet took the pre-processed CT slice sequences as the input, then output a prediction score over the four possible outcomes (pneumonia types). Due to the data imbalance, we defined the loss function as the weighted cross entropy between predicted probabilities and the true categorical labels. The weights were set as 0.2, 0.2, 0.4, 0.2 for healthy, COVID-19, other viral pneumonia, and bacterial pneumonia cases, respectively. We utilised SGD optimiser with a momentum of 0.9 to update parameters of the network via backpropagation. We trained the networks using a batch size of 16. At the first five training epochs, we linearly increased the learning rate to the initial set value of 0.01 from zero. This learning rate warm-up heuristic proved to be helpful, since using a large learning rate at the very beginning of the training may result in numerical instability 43 . We then used cosine annealing 44 to decrease the learning rate to zero over the remaining 95 epochs (100 epochs in total). During both local and federated training processes, we utilized a five-fold cross-validation on trainval split, and then selected the best model and reported their test performance (in Fig. 4 and Supplementary Fig. 2 ). At the central server, we adapted the FedAvg 33 algorithm to aggregate the updated model parameters from all clients (i.e., UCADI participants), that is, to combine the weights with respect to clients' dataset sizes and the number of local training epochs between consecutive communications. To ensure secure transmissions between the server and the clients, we used an encryption method called "Learning with Errors" (LWE) 36 to further protect all the transmitted information (i.e., model parameters and metadata). LWE is an additively homomorphic variant of the public key encryption scheme, therefore the participant information cannot even leak to the server, which is to say, that the server has no access to the explicit weights of the model. Compared with other encryption methods, such as differential privacy (DP) 45 , Moving Horizon Estimations (MHE) 46 and Model Predictive Control (MPC) 47 , LWE differentiates itself by essentially enabling the clients to achieve identical performance with the variants trained without decryption. However, the LWE method would add additional costs to the federated learning framework in terms of the extra encryption/decryption process and the increased size of the encrypted parameters during transmission. The typical time usage of a single encryption-decryption round is 2.7s (average over 100 trials under a test environment consisting of a single CPU (Intel Xeon E5-2630 v3 @ 2.40GHz) and the encrypted model size arises from 2.8MB to 62 MB, which increases the transmission time from 3.1s to 68.9s, in a typical international bandwidth environment 48 of 900KB/s (Fig. 5 ). - We further conducted a comparative study on this four-type classification between the CNN model and expert radiologists. We asked six qualified radiologists (average of 9 years of clinical experience, range from 4 to 18 years) from the Tongji Hospital Group, to make the diagnoses based on the CTs. We provided the radiologists with the CTs and their labels from the China-derived trainval split. We then asked them to diagnose each CT from the test split into one of the four classes. We reported the performance of each single radiologist and the majority votes on the COVID-19 vs non-COVID-19 CTs in Fig. 4 (detailed comparisons are presented in Supplementary Table 5 and 9 ). If there are multiple majority votes for different classes, the radiologist panel will make further discussions until reaching a consensus. Following similar procedures as previous work 21 Table 3 , we reported the test performance of these trained models on the non-contrast and contrast CTs respectively. We observed that augmenting the non-contrast CTs with CycleGAN would result in a better identification ability of the model while this was not held when converting the non-contrast ones into contrast. The clinical data collected from the 23 hospitals that utilised in this study remains under their custody. Part of the data are available via applications from qualified teams. Please refer to the NCCID website (https://www.nhsx.nhs.uk/covid-19-response/data-and-covid-19/national-covid-19-chest-imaging-databasenccid/) for more details. - The online application to join UCADI is provided at http://www.covid-ct-ai.team. Codes are publicly available at: https://github.com/HUST-EIC-AI-LAB/UCADI, ref 50 We don't have tables in the main text. Correlation of Chest CT and RT-PCR Testing for Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Sensitivity of chest CT for COVID-19: Comparison to RT-PCR Essentials for radiologists on COVID-19: An update-radiology scientific expert panel Variation in False-Negative Rate of Reverse Transcriptase Polymerase Chain Reaction-Based SARS-CoV-2 Tests by Time Since Exposure Massively multiplexed nucleic acid detection with Cas13 CRISPR-Cas12a target binding unleashes indiscriminate single-stranded DNase activity. Science (80-. ) CRISPR-Cas12-based detection of SARS-CoV-2 Diagnostic performance between CT and initial real-time RT-PCR for clinically suspected 2019 coronavirus disease (COVID-19) patients outside Wuhan, China Diagnostics for SARS-CoV-2 detection: A comprehensive review of the FDA-EUA COVID-19 testing landscape Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 The role of chest radiography in confirming covid-19 pneumonia Clinical characterization and chest CT findings in laboratory-confirmed COVID-19: a systematic review and meta-analysis. medRxiv Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study Chest CT findings in 2019 novel coronavirus (2019-NCoV) infections from Wuhan, China: Key points for the radiologist. Radiology CT imaging features of 2019 novel coronavirus (2019-NCoV) Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets Federated Learning: Strategies for Improving Communication Efficiency Towards Federated Learning at Scale: System Design National COVID-19 Chest Image Database (NCCID) Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks Breaking the cycle-colleagues are all you need Clinical and radiological findings of adult hospitalized patients with communityacquired pneumonia from SARS-CoV-2 and endemic human coronaviruses Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy Diagnosis and treatment of community-acquired pneumonia in adults: 2016 clinical practice guidelines by the Chinese Thoracic Society Would mega-scale datasets further enhance spatiotemporal 3d cnns? arXiv Prepr Xception: Deep learning with depthwise separable convolutions Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization AI for radiographic COVID-19 detection selects shortcuts over signal End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography Commission, E. reform of EU data protection rules Communication-efficient learning of deep networks from decentralized data Swarm Learning for decentralized and confidential clinical machine learning End-to-end privacy preserving deep learning on multi-institutional medical imaging Privacy-Preserving Deep Learning via Additively Homomorphic Encryption Prevalence and predictors of general psychiatric disorders and loneliness during COVID-19 in the United Kingdom The unfolding COVID-19 pandemic: A probability-based, nationally representative study of mental health in the United States Densely connected convolutional networks Dropblock: A regularization method for convolutional networks Dropout: A simple way to prevent neural networks from overfitting Batch normalization: Accelerating deep network training by reducing internal covariate shift Bag of tricks for image classification with convolutional neural networks SGDR: Stochastic gradient descent with warm restarts The algorithmic foundations of differential privacy Moving Horizon Observers and Observer-Based Control Model predictive control: Theory and practice-A survey International Telecommunication Union. Yearbook of Statistics, Telecommunication/ICT Indicators Deep residual learning for image recognition COVID-19 Diagnosis With Federated Learning Supplementary Table 9 | Confusion matrices of locally/federatively trained models Table 3 | COVID-19 pneumonia identification performance of CNN models trained on contrast and non-contrast split of NCCID dataset (UK). "Real + Synthetic" means the training and validation images include the original split (non-contrast/contrast) as well as the synthesized ones from their couterpart (contrast/non-contrast) via CycleGAN. We conduct no modification on the test set. key: cord-0924313-3i9tx5kn authors: Bai, Xiang; Wang, Hanchen; Ma, Liya; Xu, Yongchao; Gan, Jiefeng; Fan, Ziwei; Yang, Fan; Ma, Ke; Yang, Jiehua; Bai, Song; Shu, Chang; Zou, Xinyu; Huang, Renhao; Zhang, Changzheng; Liu, Xiaowu; Tu, Dandan; Xu, Chuou; Zhang, Wenqing; Wang, Xi; Chen, Anguo; Zeng, Yu; Yang, Dehua; Wang, Ming-Wei; Holalkere, Nagaraj; Halin, Neil J.; Kamel, Ihab R.; Wu, Jia; Peng, Xuehua; Wang, Xiang; Shao, Jianbo; Mongkolwat, Pattanasak; Zhang, Jianjun; Liu, Weiyang; Roberts, Michael; Teng, Zhongzhao; Beer, Lucian; Escudero Sanchez, Lorena; Sala, Evis; Rubin, Daniel; Weller, Adrian; Lasenby, Joan; Zheng, Chuangsheng; Wang, Jianming; Li, Zhen; Schönlieb, Carola-Bibiane; Xia, Tian title: Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence date: 2021-11-18 journal: ArXiv DOI: nan sha: 10f8b87efe810eb6e20cb19455c5072b58160316 doc_id: 924313 cord_uid: 3i9tx5kn Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution under a federated learning framework (FL) without data sharing. Here we show that our FL model outperformed all the local models by a large yield (test sensitivity/specificity in China: 0.973/0.951, in the UK: 0.730/0.942), achieving comparable performance with a panel of professional radiologists. We further evaluated the model on the hold-out (collected from another two hospitals leaving out the FL) and heterogeneous (acquired with contrast materials) data, provided visual explanations for decisions made by the model, and analysed the trade-offs between the model performance and the communication costs in the federated training process. Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK. Collectively, our work advanced the prospects of utilising federated learning for privacy-preserving AI in digital health. As the gold standard for identifying COVID-19 carriers, reverse transcription-polymerase chain reaction (RT-PCR) is the primary diagnostic modality to detect viral nucleotide in specimens from cases with suspected infection. However, due to the various disease courses in different patients, the detection sensitivity hovers at around only 0.60 -0.71 [1] [2] [3] [4] , which results in a considerable number of false negatives. As such, clinicians and researchers have made tremendous efforts searching for alternatives [5] [6] [7] and complementary modalities 2, [8] [9] [10] [11] to improve the testing scalability and accuracy for COVID-19 and beyond. It has been reported that coronavirus carriers present certain radiological features in chest CTs, including ground-glass opacity, interlobular septal thickening, and consolidation, which can be exploited to identify COVID-19 cases. Chest CTs have thus been utilised to diagnose COVID-19 in some countries and regions with reported sensitivity ranging from 0.56 to 0.98 [12] [13] [14] [15] . However, these radiological features are not explicitly tied to , and the accuracy of CT-based diagnostic tools heavily depends on the radiologists' own knowledge and experience. A recent study 16 has further investigated the substantial discrepancies in differentiating COVID-19 from other viral pneumonia by different radiologists. Such inconsistency is undesirable for any clinical decision system. Therefore, there is an urgent demand to develop an accurate and automatic method to help address the clinical deficiency in current CT-based approaches. Successful development of an automated method relies on a sufficient amount of data accompanied by precise annotations. We identified three challenges, specifically data-related, for developing a robust and generalised AI model for CT-based COVID-19 identifications: (i) Incompleteness. High-quality CTs that were used for training was only a small subset of the entire cohort; therefore, they are unlikely to cover the complete set of useful radiological features for identification. (ii) Isolation. CTs acquired across multiple centres were difficult to transfer for training due to security and privacy concerns, while a locally trained model may not be generalised to or improved by the data collected from other sites. (iii) Heterogeneity. Due to the different acquisition protocols (e.g., contrast agents and reconstruction kernels), CTs collected from a single hospital are still not yet well standardised; therefore, it is challenging to train a precise model based on a simple combination of the data 17 . Furthermore, it remains an open question whether the COVID-19 patients from diverse geographies and varying demographics show similar or distinct patterns. All these challenges will impede the development of a well-generalised AI model, and thus, of a global intelligent clinical solution. It is worth noting that these challenges are generally encountered by all the possible trails in applying AI models in clinical practices, not necessarily COVID-19 related. To tackle these problems, we launched the Unified CT-COVID AI Diagnostic Initiative (UCADI, in Fig. 1 and 2). It was developed based on the concept of federated learning 18, 19 , which enables machine learning engineers and clinical data scientists to collaborate seamlessly, yet without sharing the patient data. Thus, in UCADI, every participating institution can benefit from, and contribute to, the continuously evolving AI model, helping deliver even more precise diagnoses for COVID-19 and beyond. Training an accurate AI model requires comprehensive data collection. Therefore, we first gathered, screened, and anonymized the chest CTs at each UCADI participating institute (5 hospitals in China and 18 hospitals in the UK), comprising a total of 9,573 CTs of 3,336 cases. We summarised the demographics and diagnoses of the cohort in the Supplementary Table 1 and 2. Developing an accurate diagnostic model requires a sufficient amount of high-quality data. Consequently, we identified the three branches of Wuhan Tongji Hospital Group (Main Campus, Optical Valley and Sino-French) and the National COVID-19 Chest Imaging Database (NCCID) 20 as individual UCADI participants. Each site contains adequate high-quality CTs for the development of the 3D CNN model. We used 80% of the data for training and validation (trainval) and the rest 20% for testing. Additionally, we utilize the CTs collected from Tianyou hospital and Wuhan Union hospital as hold-out test sets. We consistently use the same partition in both the local and federated training processes for a fair comparison. NCCID is an initiative established by NHSX, providing massive CT and CXR modalities of COVID-19 and non-COVID-19 patients from over 18 partnership hospitals in the UK. Since each hospital's data quantity and categorial distribution are quite uneven, we pooled all the CTs and identified the entire NCCID cohort as a single participant. Unlike the CTs procured from China which are all non-contrast, around 80% of CTs from NCCID are acquired with contrast materials (e.g., iodine). These contrast materials are usually utilized to block X-rays and appeared with higher attenuation on CTs, which could help emphasise tissues such as blood vessels and intestines (in Supplementary Fig. 1 and Table 3 ). However, in practice, we found that a simple combination of the contrast and the non-contrast CTs did not back the training of a well-generalized model since their intrinsic differences induced in the acquisition procedures 21 . Therefore, to overcome the data heterogeneity between the contrast and non-contrast CTs in the NCCID, we applied an unpaired image-to-image translation method called CycleGAN 22 to transform the contrast CTs into non-contrast variants as augmentations during the local model training. In Supplementary Table 4, we have compared CycleGAN with two other recent image translation methods (CouncilGAN 23 and ACL-GAN 22 ) . We showed that the model trained on CycleGAN transformed contrast CTs has the best performance (test on the non-contrast CTs). However, this modality transformation is not always helpful, as the performance degenerated when training on the raw plus translated contrast CTs. We developed a densely connected 3D convolutional neural network (CNN) model based on the massive cohort collection towards delivering precise diagnoses with AI approaches. We term it 3D-DenseNet and report its architectural designs and training optimisations in the Methods and Supplementary Fig. 2 . We examined the predictive power of 3D-DenseNet on a four-class pneumonia classification task as well as COVID-19 identification. In the first task, we aimed at distinguishing COVID-19 ( Fig. 3a, Supplementary Fig. 3 and Table 5 ) from healthy cases and two other pneumonia types, namely non-COVID-19 viral and bacterial pneumonia (Fig. 3b ). We preferred a four-class taxonomy since further distinguishment of COVID-19 with community-acquired pneumonia (CAP) 24, 25 can help deliver more commendatory clinical treatments, where the bacterial and the viral are two primary pathogens of CAP 26 (Fig. 2c) . However, given different institutions accompanied by various annotating protocols, it is more feasible for the model to learn to discriminate COVID-19 from all non-COVID-19 cases. Therefore, we base the experimental results on this two-category classification in the main text. We report the four-class experiments based on the Wuhan Tongji Hospital Group's cohort in Supplementary Fig. 3 and Table 5 . Table 6 and 7, we further compared 3D-DenseNet with two other 3D CNN baseline models: 3D-ResNet 27 and 3D-Xception 28 . As a result, we demonstrated that 3D-DenseNet had better performance and smaller size, presenting it as highly suitable for federated learning. To interpret the learned features of the model, we performed gradient-weighted class activation mapping (GradCAM) 29 analysis on the CTs from the test set. We visualised the featured regions that lead to identification decisions. It has been found that the generated heatmaps (Fig. 3c) primarily characterised local lesions that are highly overlapped with the radiologists' annotations, suggesting the model is capable of learning robust radiologic features rather than simply overfitting 30 . This heatmap can help the radiologists localise the lesions quicker for delivering diagnoses in an actual clinical environment. Moreover, localising the lesions will also provide a guide for further CT acquisition and clinical test. A similar idea has been described as "region-of-interest (ROI) detection" in a previous similar study 31 . To examine the cross-domain generalisation ability of the locally trained models, we tested China's locally trained model on Britain's test set and vice versa. We reported the numerical results in Fig. 4 . However, due to incompleteness, isolation, and heterogeneity in the various data resources, we found that all the locally trained models exhibited less-than-ideal test performances on other sources. Specifically, the model trained on NCCID non-contrast CTs had a sensitivity/specificity/AUC of 0.313/0.907/0.745 in identifying COVID-19 on the test set of China, which is lower than locally trained ones, and vice versa. Next, we describe how to incorporate federated learning for the cross-continent privacy-preservation collaboration on training a generalised AI diagnostic model, mitigating the domain gaps and data heterogeneity. -Enable multination privacy-preserving collaboration with federated learning We developed a federated learning framework to facilitate the collaboration nested under UCADI and NCCID, integrating diverse cohorts as part of a global joint effort on developing a precise and robust AI diagnostic tool. In traditional data science approaches 17, 31 , sensitive and private data from different sources are directly gathered and transported to a central hub where the models are deployed. However, such procedures are infeasible in real clinical practises; hospitals are usually reluctant (and often not permitted) for data disclosure due to privacy concerns and legislation 32 . On the other side, the federated learning technique proposed by Google 33 , in contrast, is an architecture where the AI model is distributed to and executed at each host institution without data centralisation. Furthermore, transmitting the model parameters effectively reduced the latency and the cost associated with sending large amounts of data during internet connections. More importantly, the strategy to preserve privacy by design enables medical centres to collaborate on developing models without sharing sensitive clinical data with other institutions. Recently, Swarm Learning 34 is proposed towards the model decentralisation via edge computation. However, we conjecture it is immature for the privacy-preserving machine learning 35 applications based on massive data collection and participants due to the exponential increase in computation. In UCADI, we have provided: (i) An online diagnostic interface allowing people to query the diagnostic results on identifying COVID-19 by uploading their chest CTs; (ii) A federated learning framework that enables UCADI participants to collaboratively contribute to improving the AI model for COVID-19 identification. Each UCADI participant will send the model weights back to the server via a customised protocol during the collaborative training process every few iterations. To further mitigate the potential for data leaks during such a transmission process, we applied an additive homomorphic encryption method called Learning with Errors (LWE) 36 to encrypt the transmitted model parameters. By so doing, participants will keep within their data and infrastructure, with the central server having no access whatsoever. After receiving the transmitted packages from the UCADI participants, the central server then aggregates the global model without comprehending the model parameters of each participant. The updated global model would then be distributed to all participants, again utilising LWE encryption, enabling the continuation of the model optimisation at the local level. Our framework is designed to be highly flexible, allowing dynamic participation and breakpoint resumption (detailed in Methods). With this framework, we deployed the same experimental configurations to validate the federated learning concept for developing a generalized CT-based COVID-19 diagnostic model (detailed in Methods). We compared the test sensitivity and specificity of the federated model to the local variations ( Fig. 4) . We plotted the ROC curves and calculated the corresponding AUC scores, along with 95% confidence intervals (CI) and p-values, to validate the model's performance (Fig. 4) . As confirmed by the curves and numbers, the federated model when applied to the test set of the NCCID (from 18 UK hospitals), vastly outperforming all the locally trained models. We based the performance measure on the CT level instead of the patient level, coherent with the prior study 31 . We illustrated that the federated framework is an effective solution to mitigate against the issue that we cannot centralise medical data from hospitals worldwide due to privacy and legal legislation. We further conducted a comparative study on the same task with a panel of expert radiologists. With an average of 9 years' experience, six qualified radiologists from the Department of Radiology, Wuhan Tongji Hospital (Main Campus), were asked to make diagnoses on each CT from China, as one of the four classes. The six experts were first asked to provide diagnoses individually, then to address integrated diagnostic opinions via majority votes (consensus) in a plenary meeting. We presented the radiologists and AI models with the same data partition for a fair comparison. In differentiating COVID-19 from the non-COVID-19 cases, the six radiological experts obtained an average 0.79 in sensitivity (0.88, 0.90, 0.55, 0.80, 0.68, 0.93, respectively), and 0.90 in specificity (0.92, 0.97, 0.89, 0.95, 0.88, 0.79, respectively). In reality, the consideration of a clinical decision is usually made by consensus decision among the experts. Here, we use the majority votes among the six expert radiologists to represent such a decision-making process. We provide the detailed diagnostic decisions of each radiologist in Supplementary Table 5 . We found that the majority vote helps reduce the potential bias and risk: the aggregated diagnoses are with the best performance among individual radiologists. In Fig. 4a , we plotted the majority votes in blue markers (sensitivity/specificity: 0.900/0.956) and remarked that the federatively trained 3D-DenseNet had shown comparable performance (sensitivity/specificity: 0.973/0.951) with the expert panel. We have further presented and discussed the models' performance on the hold-out test sets (645 cases from Wuhan Tianyou Hospital and 506 cases from Wuhan Union Hospital) in Supplementary Table 8 . We proved that the federatively trained model also performed better on these two hold-out datasets, yet the confidence sometimes is not well calibrated. During the federated training process, each participant is required to synchronise the model weights with the server every few training epochs using web sockets. Intuitively, more frequent communication should lead to better performance. However, each synchronisation accumulates extra time. To investigate the trade-off between the model performance and the communication cost during the federated training, we conduct parallel experiments with the same settings but different training epochs between the consecutive synchronisations. We report the models' subsequent test performance in Fig. 5a and time usage in Fig. 5b . We observe that, as expected, more frequent communication leads to better performance. Compared with the least frequently communication scenario, to download the model from the beginning and train locally without intermediate communications, synchronizing at every epoch will achieve the best test performance with less than 20% increment in time usage. COVID-19 is a global pandemic. Over 200 million people have been infected worldwide, with hundreds of thousands hospitalized and mentally affected 37, 38 , and as of Oct 2021, above four million are reported to have died. There are borders between countries, yet the only barrier is the boundary between humankind and the virus. We urgently demand a global joint effort to confront this illness effectively. In this study, we introduced a multination collaborative AI framework, UCADI, to assist radiologists in streamlining and accelerating CT-based COVID-19 diagnoses. Firstly, we developed a new CNN model that achieved performance comparable to expert radiologists in identifying COVID-19. The predictive diagnoses can be utilised as references while the generated heatmap helps with faster lesion localisation and further CT acquisition. Then, we formed a federated learning framework to enable the global training of a CT-based model for precise and robust diagnosis. With CT data from 22 hospitals, we have herein confirmed the effectiveness of the federated learning approach. We have shared the trained model and open-sourced the federated learning framework. It is worth mentioning that our proposed framework is with continual evolution, is not confined to the diagnosis of COVID-19 but also provides infrastructures for future use. The uncertainty and heterogeneity are the characteristics of clinical work. Because of the limited medical understanding of the vast majority of diseases, including pathogenesis, pathological process, treatment, etc., the medical characteristics of diseases can be studied by the means of AI. Along with this venue, research can be more instructive and convenient in dealing with large (sometimes isolated) samples, especially suitable for transferring knowledge in studying emerging diseases. However, certain limitations are not well addressed in this study. First is the potential bias in the comparison between experts and models. Due to legal legislation, it is infeasible and impossible to disclose the UK medical data with radiologists and researchers in China or vice versa. Thus, radiologists are from nearby institutions. Though their diagnostic decisions are quite different, it is not unrealistic to conclude that our setting and evaluation process eliminate biases. The Second is engineering efforts. Although we have developed mechanisms such as dynamic participation and breakpoint resumption, the participants still happened to drop from the federated training process for the unstable internet connection. Also, the computation efficiency of the 3D CNN model still has space for improvements (in Supplementary Table 7 ). There are always engineering advancements that can be incorporated to refine the framework. We first described how we constructed the dataset, then we discussed the details of our implementations for collaboratively training the AI model, we provided further analysis of our methods at the end of this section. Additionally, we collected independent cohorts including 507 COVID-19 cases from Wuhan Union Hospital and 645 COVID-19 cases from Wuhan Tianyou Hospital. These hold-out test sets were used for testing the generalisation of the locally trained models as well as the federated model. Since the data source only contained COVID-19 cases, we did not utilize it during the training process. We also summarised and reported the demographic information (i.e., gender and age) of the cohort in Supplementary Table 1. - For the total 2,682 CTs that were acquired from the 18 partner hospitals located in the United Kingdom (See Supplementary Table 3 Toshiba Aquilion ONE/PRIME. Settings such as filter sizes, slice thickness and reconstruction protocols are also quite diverse among these CTs. This might explain the reason why the NCCID locally trained model failed to perform as well as the Chinese locally trained variant (see Fig. 4c ). Regarding the material differences, 2,145 out of 2,682 CTs were taken after the injection of an iodine contrast agent (i.e., contrast CTs). As pointed out by previous study 21 We also noticed that a small subset of the CTs only contained partial lung regions, we removed these insufficient CTs whose number of slices are less than 40. As for our selection criteria in this regard, although the partial lung scans might be infeasible for training segmentation or detection models, we believe that a sufficient number of slices is enough to ensure the model effectively captures the requisite features and thereby help with the precise classification in medical diagnosis. We reported patient demographical information (i.e., gender and age) of the cohort in Supplementary Table 2 . However, the reported demographics is not inclusive since the demographical attributes of non-COVID-19 cases are not recorded. In comparison to the demographical information of the COVID-19 cases acquired from China, COVID-19 cases in the UK were with larger averaged ages and had more male patients. These demographical differences might also explain why the UK locally trained model failed to perform well when applied to the CTs acquired from China. -Data pre-processing, model architecture and training setting We pre-processed the raw acquired CTs for standardisation as well as to reduce the burden on computing resource. We utilized an adaptive sampling method to select 16 slices from all sequential images of a single CT case using random starting positions and scalable transversal intervals. During the training and validation process, we sampled once for each CT study, while in testing we repeated the sampling five independent times to obtain five different subsets. We then standardised the sampled slices by removing the channel-wise offsets and rescaling the variation to uniform units. During testing, the five independent subsets of each case were fed to the trained CNN classifier to obtain the prediction probabilities of the four classes. We then averaged the predictive probabilities over these five runs to make the final diagnostic prediction for that case. By so doing, we can effectively include impacts from different levels of lung regions as well as to retain scalable computations. To further improve the computing efficiency, we utilised trilinear interpolation to resize each slice from 512 to 128 pixels along each axis and rescaled the lung windows to a range between -1200 and 600 Hounsfield units before feeding into the network model. We named our developed model 3D-DenseNet ( Supplementary Fig. 2) . It was developed based on DenseNet 39 , a densely connected convolutional network model that performed remarkably well in classifying 2D images. To incorporate such design with the 3D CT representations, we adaptively customized the model architecture into fourteen 3D-convolution layers distributed in six dense blocks and two transmit blocks (insets of Supplementary Fig. 2 ). Each dense block consists of two 3D convolution layers and an inter-residual connection, whereas the transmit blocks are composed of a 3D convolution layer and an average pooling layer. We placed a 3D DropBlock 40 instead of simple dropout 41 before and after the six dense blocks, which proved to be more effective in regularising the training of convolution neural networks. We set the momentum of batch normalisation 42 to be 0.9, and the negative slope of LeakyReLU activation as 0.2. During training, the 3D-DenseNet took the pre-processed CT slice sequences as the input, then output a prediction score over the four possible outcomes (pneumonia types). Due to the data imbalance, we defined the loss function as the weighted cross entropy between predicted probabilities and the true categorical labels. The weights were set as 0.2, 0.2, 0.4, 0.2 for healthy, COVID-19, other viral pneumonia, and bacterial pneumonia cases, respectively. We utilised SGD optimiser with a momentum of 0.9 to update parameters of the network via backpropagation. We trained the networks using a batch size of 16. At the first five training epochs, we linearly increased the learning rate to the initial set value of 0.01 from zero. This learning rate warm-up heuristic proved to be helpful, since using a large learning rate at the very beginning of the training may result in numerical instability 43 . We then used cosine annealing 44 to decrease the learning rate to zero over the remaining 95 epochs (100 epochs in total). During both local and federated training processes, we utilized a five-fold cross-validation on trainval split, and then selected the best model and reported their test performance (in Fig. 4 and Supplementary Fig. 2 ). At the central server, we adapted the FedAvg 33 algorithm to aggregate the updated model parameters from all clients (i.e., UCADI participants), that is, to combine the weights with respect to clients' dataset sizes and the number of local training epochs between consecutive communications. To ensure secure transmissions between the server and the clients, we used an encryption method called "Learning with Errors" (LWE) 36 to further protect all the transmitted information (i.e., model parameters and metadata). LWE is an additively homomorphic variant of the public key encryption scheme, therefore the participant information cannot even leak to the server, which is to say, that the server has no access to the explicit weights of the model. Compared with other encryption methods, such as differential privacy (DP) 45 , Moving Horizon Estimations (MHE) 46 and Model Predictive Control (MPC) 47 , LWE differentiates itself by essentially enabling the clients to achieve identical performance with the variants trained without decryption. However, the LWE method would add additional costs to the federated learning framework in terms of the extra encryption/decryption process and the increased size of the encrypted parameters during transmission. The typical time usage of a single encryption-decryption round is 2.7s (average over 100 trials under a test environment consisting of a single CPU (Intel Xeon E5-2630 v3 @ 2.40GHz) and the encrypted model size arises from 2.8MB to 62 MB, which increases the transmission time from 3.1s to 68.9s, in a typical international bandwidth environment 48 of 900KB/s (Fig. 5 ). - We further conducted a comparative study on this four-type classification between the CNN model and expert radiologists. We asked six qualified radiologists (average of 9 years of clinical experience, range from 4 to 18 years) from the Tongji Hospital Group, to make the diagnoses based on the CTs. We provided the radiologists with the CTs and their labels from the China-derived trainval split. We then asked them to diagnose each CT from the test split into one of the four classes. We reported the performance of each single radiologist and the majority votes on the COVID-19 vs non-COVID-19 CTs in Fig. 4 (detailed comparisons are presented in Supplementary Table 5 and 9 ). If there are multiple majority votes for different classes, the radiologist panel will make further discussions until reaching a consensus. Following similar procedures as previous work 21 Table 3 , we reported the test performance of these trained models on the non-contrast and contrast CTs respectively. We observed that augmenting the non-contrast CTs with CycleGAN would result in a better identification ability of the model while this was not held when converting the non-contrast ones into contrast. The clinical data collected from the 23 hospitals that utilised in this study remains under their custody. Part of the data are available via applications from qualified teams. Please refer to the NCCID website (https://www.nhsx.nhs.uk/covid-19-response/data-and-covid-19/national-covid-19-chest-imaging-databasenccid/) for more details. - The online application to join UCADI is provided at http://www.covid-ct-ai.team. Codes are publicly available at: https://github.com/HUST-EIC-AI-LAB/UCADI, ref 50 We don't have tables in the main text. Correlation of Chest CT and RT-PCR Testing for Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Sensitivity of chest CT for COVID-19: Comparison to RT-PCR Essentials for radiologists on COVID-19: An update-radiology scientific expert panel Variation in False-Negative Rate of Reverse Transcriptase Polymerase Chain Reaction-Based SARS-CoV-2 Tests by Time Since Exposure Massively multiplexed nucleic acid detection with Cas13 CRISPR-Cas12a target binding unleashes indiscriminate single-stranded DNase activity. Science (80-. ) CRISPR-Cas12-based detection of SARS-CoV-2 Diagnostic performance between CT and initial real-time RT-PCR for clinically suspected 2019 coronavirus disease (COVID-19) patients outside Wuhan, China Diagnostics for SARS-CoV-2 detection: A comprehensive review of the FDA-EUA COVID-19 testing landscape Artificial intelligence-enabled rapid diagnosis of patients with COVID-19 The role of chest radiography in confirming covid-19 pneumonia Clinical characterization and chest CT findings in laboratory-confirmed COVID-19: a systematic review and meta-analysis. medRxiv Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study Chest CT findings in 2019 novel coronavirus (2019-NCoV) infections from Wuhan, China: Key points for the radiologist. Radiology CT imaging features of 2019 novel coronavirus (2019-NCoV) Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets Federated Learning: Strategies for Improving Communication Efficiency Towards Federated Learning at Scale: System Design National COVID-19 Chest Image Database (NCCID) Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks Breaking the cycle-colleagues are all you need Clinical and radiological findings of adult hospitalized patients with communityacquired pneumonia from SARS-CoV-2 and endemic human coronaviruses Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy Diagnosis and treatment of community-acquired pneumonia in adults: 2016 clinical practice guidelines by the Chinese Thoracic Society Would mega-scale datasets further enhance spatiotemporal 3d cnns? arXiv Prepr Xception: Deep learning with depthwise separable convolutions Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization AI for radiographic COVID-19 detection selects shortcuts over signal End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography Commission, E. reform of EU data protection rules Communication-efficient learning of deep networks from decentralized data Swarm Learning for decentralized and confidential clinical machine learning End-to-end privacy preserving deep learning on multi-institutional medical imaging Privacy-Preserving Deep Learning via Additively Homomorphic Encryption Prevalence and predictors of general psychiatric disorders and loneliness during COVID-19 in the United Kingdom The unfolding COVID-19 pandemic: A probability-based, nationally representative study of mental health in the United States Densely connected convolutional networks Dropblock: A regularization method for convolutional networks Dropout: A simple way to prevent neural networks from overfitting Batch normalization: Accelerating deep network training by reducing internal covariate shift Bag of tricks for image classification with convolutional neural networks SGDR: Stochastic gradient descent with warm restarts The algorithmic foundations of differential privacy Moving Horizon Observers and Observer-Based Control Model predictive control: Theory and practice-A survey International Telecommunication Union. Yearbook of Statistics, Telecommunication/ICT Indicators Deep residual learning for image recognition COVID-19 Diagnosis With Federated Learning Supplementary Table 9 | Confusion matrices of locally/federatively trained models Table 3 | COVID-19 pneumonia identification performance of CNN models trained on contrast and non-contrast split of NCCID dataset (UK). "Real + Synthetic" means the training and validation images include the original split (non-contrast/contrast) as well as the synthesized ones from their couterpart (contrast/non-contrast) via CycleGAN. We conduct no modification on the test set.