key: cord-1010326-wfoxck7k
authors: Warnat-Herresthal, Stefanie; Schultze, Hartmut; Shastry, Krishnaprasad Lingadahalli; Manamohan, Sathyanarayanan; Mukherjee, Saikat; Garg, Vishesh; Sarveswara, Ravi; Händler, Kristian; Pickkers, Peter; Aziz, N. Ahmad; Ktena, Sofia; Siever, Christian; Kraut, Michael; Desai, Milind; Monnet, Bruno; Saridaki, Maria; Siegel, Charles Martin; Drews, Anna; Nuesch-Germano, Melanie; Theis, Heidi; Netea, Mihai G.; Theis, Fabian; Aschenbrenner, Anna C.; Ulas, Thomas; Breteler, Monique M.B.; Giamarellos-Bourboulis, Evangelos J.; Kox, Matthijs; Becker, Matthias; Cheran, Sorin; Woodacre, Michael S.; Goh, Eng Lim; Schultze, Joachim L.
title: Swarm Learning as a privacy-preserving machine learning approach for disease classification
date: 2020-06-29
journal: bioRxiv
DOI: 10.1101/2020.06.25.171009
sha: 704744e561cd3cb41e50bd132b64635ffd841c25
doc_id: 1010326
cord_uid: wfoxck7k

Identification of patients with life-threatening diseases including leukemias or infections such as tuberculosis and COVID-19 is an important goal of precision medicine. We recently illustrated that leukemia patients are identified by machine learning (ML) based on their blood transcriptomes. However, there is an increasing divide between what is technically possible and what is allowed because of privacy legislation. To facilitate integration of any omics data from any data owner world-wide without violating privacy laws, we here introduce Swarm Learning (SL), a decentralized machine learning approach uniting edge computing, blockchain-based peer-to-peer networking and coordination as well as privacy protection without the need for a central coordinator thereby going beyond federated learning. Using more than 14,000 blood transcriptomes derived from over 100 individual studies with non-uniform distribution of cases and controls and significant study biases, we illustrate the feasibility of SL to develop disease classifiers based on distributed data for COVID-19, tuberculosis or leukemias that outperform those developed at individual sites. Still, SL completely protects local privacy regulations by design. We propose this approach to noticeably accelerate the introduction of precision medicine.

Fast and reliable detection of patients with severe illnesses is a major goal of precision 70 medicine 1 . The measurement of molecular phenotypes for example by omics technologies 2 71 and the application of sophisticated bioinformatics including artificial intelligence (AI) 72 approaches 3-7 opens up the possibility for physicians to utilize large-scale data for diagnostic 73 purposes in an unprecedented way. Yet, there is an increasing divide between what is 74 points, we introduce the concept of Swarm Learning (SL). SL combines decentralized 120 hardware infrastructures, distributed ML technique based on standardized AI engines with a 121 permissioned blockchain to securely onboard members, dynamically elect the leader among 122 the members, and merge model parameters. All processes are orchestrated by an SL library 123 and an iterative learning procedure applying AI solutions to compute problems with 124 decentralized private data. 125

Medicine is a prime example to illustrate the advantages of this AI approach. Without any 126 doubt, numerous medical features including radiograms or computed tomographies, 127 proteomes, metagenomes or microbiomes derived from body fluids including nasal or throat 128 swaps, blood, urine or stool are all excellently suitable medical data for the development of AI-129 based diagnostic or outcome prediction classifiers. We here chose to evaluate the cellular 130 compartment of peripheral blood, either in form of peripheral blood mononuclear cells (PBMC) 131 or whole blood-derived transcriptomes, since blood-derived transcriptomes include important 132 information about the patients' immune response during a certain disease, which in itself is an 133 important molecular information 42, 43 . In other words, in addition to the use of blood-derived 134 high-dimensional molecular features for a diagnostic or outcome classification problem, blood 135 transcriptomes could be further utilized in the clinic to systematically characterize ongoing 136 pathophysiology, predict patient-specific drug targets and trigger additional studies targeting 137 defined cell types or molecular pathways, making this feature space even more attractive to 138 answer a wide variety of medical questions. Here, we illustrate that newly generated blood 139 transcriptome data together with data derived from more than 14,000 samples in more than 140 100 studies combined with AI-based algorithms in a Swarm Learning environment can be 141 successfully applied in real-world scenarios to detect patients with leukemias, tuberculosis or 142 active COVID-19 disease in an outbreak scenario across distributed datasets without the 143 necessity to negotiate and contractualize data sharing. infrastructure is sufficiently available locally, ML can be performed locally ('at the edge') ( Fig.  152 1a). However, often medical data are not sufficiently large enough locally and similar 153 approaches are performed at different locations in a disconnected fashion. These limitations 154 have been overcome by cloud computing where data are moved centrally to perform training 155 of ML algorithms in a centralized compute environment (Fig. 1b) . Compared to local 156 approaches, cloud computing can significantly increase the amount of data for training ML 157 algorithms and therefore significantly improve their results 26 . However, cloud computing has 158 other disadvantages such as data duplication from local to central data storage, increased 159 data traffic and issues with locally differing data privacy and security regulations 46 . As an 160 alternative, federated cloud computing approaches such as Google's federated learning 38 and 161

Facebook's elastic averaging SGD (Deep learning with Elastic Averaging SGD, 162 http://papers.neurips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf) have been 163 developed. In these models, dedicated parameter servers are responsible for aggregating and 164 distributing local learning (Fig. 1c) . A disadvantage of such star-shaped system architectures 165 is the remainder of a central structure, which hampers implementation across different 166 jurisdictions and therefore still requires the respective legal negotiations. Furthermore, the risk 167 for a single point of failure at the central structure reduces fault-tolerance. 168

In an alternative model, which we introduce here as Swarm Learning (SL), we dismiss the 169 dedicated server and allow parameters and models to be shared only locally (Fig. 1d) . While 170 parameters are shared via the swarm network, the models are built independently on private 171 data at the individual sites, here referred to as swarm edge nodes (short 'nodes') ( Fig. 1e) . SL 172 provides security measures to guarantee data sovereignty, security and privacy realized by a 173 private permissioned blockchain technology which enables different organizations or consortia 174 to efficiently collaborate (Fig. 1f) . In a private permissioned blockchain network, each 175 participant is well defined and only pre-authorized participants can execute the transactions. 176

Hence, they use computationally inexpensive consensus algorithms, which offers better 177 performance and scalability. Onboarding of new members or nodes can be done dynamically 178 with the appropriate authorization measures to know the participants of the network, which 179 allows continuous scaling of learning (Extended Data Fig. 1a) . A new node enrolls via a 180 blockchain smart contract, obtains the model, and performs local model training until defined 181 conditions for synchronization are met. Next, model parameters are exchanged via a Swarm 182 API with the rest of the swarm members and merged for an updated model with updated 183 parameter settings to start a new round of training at the nodes. This process is repeated until 184 stopping criterions are reached, which are negotiated between the swarm nodes/members. 185

The leader is dynamically elected using a blockchain smart contract for merging the 186 parameters and there is no need for a central coordinator in this swarm network. The 187 parameter merging algorithm is executed using a blockchain smart contract thus protects it 188 from semi-honest or dishonest participants. The parameters can be merged by the leader 189 using different functions including average, weighted average, minimum, maximum, or median 190 functions. The various merge techniques and merge frequency enables SL to efficiently work 191 with imbalanced and biased data. As currently developed, SL works with parametric models 192 with finite sets of parameters, such as linear regression or neural network models. 193

At each node, SL is conceptually divided into infrastructure and application layer (Fig. 1g) . On 194 top of the physical infrastructure layer (hardware) the application environment contains the ML 195 platform, the blockchain, and the SL library (SLL) including the Swarm API in a containerized 196 deployment, which allows SL to be executed in heterogeneous hardware infrastructures (Fig.  197 1g, Supplementary Information) . The application layer consists of the content, the models 198 from the respective domain, here medicine (Fig. 1g) , for example blood transcriptome data 199 from patients with leukemias, tuberculosis and COVID-19 ( Fig. 1h- into non-overlapping training and test sets. The training sets were then distributed to three 214 nodes for training and classifiers were tested at a fourth node (independent test set) (Fig. 2a) . 215

By assigning the training data to the nodes in different distributions, we mimicked several 216 clinically relevant scenarios (Supplementary Table 1 ). As cases, we first used samples 217 defined as acute myeloid leukemia (AML), all other samples are termed 'controls'. Each node 218 within this simulation could stand for a large hospital or center, a network of hospitals 219 performing individual studies together, a country or any other independent institutional 220 organization generating such medical data with local privacy requirements. 221

In a first scenario, we randomly distributed samples per node as well as cases and controls 222 unevenly at the nodes and between nodes (dataset A2) (Fig. 2b) . Sample distribution between 223 sample sets was permuted 100 times (Fig. 2b, middle panel) to determine the influence of 224 individual samples on overall performance. Among the nodes, the best test results were 225 obtained by node one with a mean accuracy of 97.0%, mean sensitivity of 97.5% and mean 226 specificity of 96.3% with an even distribution between cases and controls, albeit this node had 227 the smallest number of overall training samples. Node 2 did not produce any meaningful 228 results, which was due to a too low ratio of cases to controls (1:99) for training. Surprisingly, 229 node 3 with the largest number of samples, but an uneven distribution (70% cases : 30% 230 controls) performed worse than node 1 with a mean balanced accuracy of 95.1%. Most 231 importantly, however, SL outperformed each of the nodes resulting in a higher test accuracy 232 in 97.0% of all permutations (mean balanced accuracy 97.7%) (Fig. 2b , right panel, 233

Supplementary Table 4 ). The balanced accuracy of SL was significantly higher (p < 0.001) 234 when compared to the performance of each of the three nodes, despite the fact that 235 information from the poorly performing node 2 was integrated. We also calculated this scenario 236 in datasets A1 and A3 and obtained rather similar results strongly supporting that the 237 performance improvement of SL over single nodes is independent of data collection (studies) 238 and even experimental technologies (microarray (datasets A1, A2), RNA-seq (dataset A3) 239 used for data generation (Extended Data Fig. 2) . 240

To test whether more evenly distributed samples at the nodes would improve individual node 241 performance, we distributed similar numbers of samples to each of the nodes but kept 242 case:control ratios as in scenario 1 (Fig. 2c, Extended Data Fig. 3) . While there was a slight 243 increase in test accuracy at nodes 1 and 2, node 3 performed worse with also higher variance. 244

More importantly, SL still resulted in the best performance metrics (mean 98.5% accuracy) 245 with slightly but significantly (p<0.001) increasing performance compared to the first scenario. 246

Results derived from datasets A1 and A3 echoed these findings (Extended Data Fig. 3) . 247

In a third scenario, we distributed the same number of samples across all three nodes, but 248 increased potential batch effects between nodes, by distributing samples of a clinical study 249 independently performed and published in the past only to a dedicated training node. In this 250 scenario, cases and control ratios varied between nodes and left out samples (independent 251 samples) from the same published studies were combined for testing at node 4. Performance 252 of the three nodes was very comparable, but never reached SL results (mean 98.3% accuracy, 253 swarm outperformed all nodes with p<0.001, Fig. 2d ., Extended Data Fig. 4b , 254

Supplementary Data Table 4 ), which was also true for datasets A1 and A3 (Extended Data 255 nodes, albeit the variance in the results was increased both at each node and for SL, indicating 259 that study design has an overall impact on classifier performance and that this is still seen in 260 SL (mean 95.6% accuracy, Extended Data Fig. 4e ). 261

In a fourth scenario, we further optimized the nodes by increasing the overall sample size at 262 node 3 and keeping case:control ratios even at all nodes (Fig. 2e , Extended Data Fig. 5a-d) . 263

Clearly, node performance further improved with little variance between permutations, 264 however, even under these 'node-optimized' conditions, SL led to higher performance 265 parameters. 266

In a fifth scenario, we tested whether or not SL was 'immune' against the impact of the data 267 generation procedure (microarray versus RNA-seq) (Fig. 2f, Extended Data Fig. 5e,f) . We 268 recently demonstrated that classifiers trained on data derived by one technology (e.g. 269 microarrays) do not necessarily perform well on another (e.g. RNA-seq) 47 . To test this 270 influence on SL, we distributed the samples from the three different datasets (A1-A3) to one 271 node each, e.g. dataset A1 was used for training only at node 1. We used 20% of the data 272 (independent non-overlapping to the training data) from each dataset (A1-A3) and combined 273 them to form the test set (node 4). Node 3, trained on RNA-seq data, performed poorly on the 274 combined dataset due to the fact that two-thirds of the data in the test set were microarray-275 derived data. Nodes 1 and 2 performed reasonably well with mean accuracies of 96.1% (node 276 1) and 97.5% (node 2), however did not reach the test accuracy of SL (98.8%), which also 277 indicated that SL is much more robust toward effects introduced by different data production Collectively, these simulations using real-world transcriptome data collected from more than 283 100 individual studies illustrate that SL would not only allow data to be kept at the place of 284 generation and ownership, but it also outperforms every individual node in numerous 285 scenarios, even in those with nodes included that cannot provide any meaningful classifier 286 results. 287 288 Swarm learning to identify patients with tuberculosis 289

In infectious diseases, heterogeneity may be more pronounced compared to leukemia, 290 therefore we built a second use case predicting cases with tuberculosis (Tb) from full blood 291 transcriptomes. Of interest, previous work in smaller studies had already suggested that acute 292 tuberculosis or outcome of tuberculosis treatment can be revealed by blood transcriptomics 293 [48] [49] [50] [51] [52] . To apply SL, we generated a new dataset based on full blood transcriptomes derived by 294 PaxGene blood collection followed by bulk RNA-sequencing. We also generated new blood 295 transcriptomes and added existing studies to the dataset compiling a total of 1,999 samples 296 from nine individual studies including 775 acute and 277 latent Tb cases (Fig. 1k, Extended  297 Data Fig. 7a, Supplementary Table 2 ). These data are more challenging, since infectious 298 diseases show more variety due to biological differences with respect to disease severity, 299 phase of the disease or the host response. But also the technology itself is more variable with 300 numerous different approaches for full blood transcriptome sample processing, library 301 production and sequencing, which can introduce technical noise and batches between 302 studies. As a first scenario, we used all Tb samples (latent and acute) as cases and divided 303

Tb cases and controls evenly among the nodes (Extended Data Fig. S7a Fig. S7b ). To increase the challenge, we decided to assess prediction of acute Tb cases 307 only. In this scenario, latent Tb are not treated as cases but rather as controls (Extended Data 308 Fig. S7a ). For the first scenario, we kept cases and controls even at all nodes but further 309 reduced the number of training samples ( Fig. 3a-b) . As expected in this more challenging 310 scenario, distinguishing acute Tb from the control cohort (including latent Tb samples), overall 311 performance (mean balanced accuracy 89.1%, mean sensitivity 92.2%, mean specificity 312 86.0%) slightly dropped, but still SL performed better than any of the individual nodes (p<0.01 313 for swarm vs. each node, Fig. 3b ). To determine whether sample size impacts on prediction 314 results in this scenario, we reduced the number of samples at each training node (1-3) by 315 50%, but kept the ratio between cases and controls (Extended Data Fig. S7c ). Still, SL 316 outperformed the nodes, but all statistical readouts (mean accuracy 86.5%, mean sensitivity 317 87.8%, mean specificity 84.8%) at all nodes and SL showed lower performance, following 318 general observations of AI with better performance when increasing training data 26 . We next 319 altered the scenario by dividing up the three nodes into six smaller nodes (Fig. 3c , samples 320 per node reduced by half in comparison to Fig. 3a-b) , a scenario that can be envisioned in the 321 domain of medicine in many settings, for example if several smaller medical centers with less 322 cases would join efforts (Fig. 3d) . Clearly, each individual node performed worse, but for SL 323 the results did not deteriorate (mean accuracy 89.2%, mean sensitivity 90.7%, mean 324 specificity 88.2% with significant difference to each of the nodes in all performance measures, 325 see Supplementary Table 4) , again illustrating the strength of the joined learning effort, while 326 completely respecting each individual node's data privacy. 327

Albeit aware of the fact that -in general -acute Tb is an endemic disease and does not tend 328 to develop towards a pandemic such as the current COVID-19 pandemics, we utilized the Tb 329 blood transcriptomics dataset to simulate potential outbreak and epidemic scenarios to 330 determine benefits, but also potential limitations of SL and how to address them ( Fig. 3e-l) . 331

The first scenario reflects a situation in which three independent regions (simulated by the 332 nodes), would already have sufficient but different numbers of disease cases. Furthermore, 333 cases and controls were kept even at the test node ( Fig. 3e-f ). Overall, compared to the 334 scenario described in Fig. 3c , results for the swarm were almost comparable (mean accuracy 335 89.0%, mean sensitivity 94.4%, mean specificity 83.4%), while the results for the node with 336 the lowest number of cases and controls (node 2) dropped noticeable (mean accuracy 82.2%, 337 mean sensitivity 88.8%, mean specificity 75.4%, Fig. 3f ). When reducing the prevalence at 338 the test node by increasing the number of controls ( Fig. 3g-h) , this effect was even more 339 pronounced, while the performance of the swarm was almost unaffected (mean balanced 340 accuracy 89.0%). 341

We decreased the number of cases at a second training node (node 1) ( Fig. 3i-l) , which clearly 342 reduced test performance for this particular node ( Fig. 3i-j) , while test performance of the 343 swarm was only slightly inferior to the prior scenario (mean balanced accuracy 87.5%, no 344 significant difference to the prior scenario). Only when reducing the prevalence at the test 345 node ( Fig. 3k-l) , we saw a further drop in mean specificity for the swarm (81.0%), while 346 sensitivity stayed similarly high (93.0%). Finally, we further reduced the prevalence at two 347 training nodes (node 2: 1:10; node 3: 1:5) as well as the test node (Extended Data Fig. 8a -348 b). Lowering the prevalence during training resulted in very poor test performance at these 349 two nodes (balanced accuracy node 2: 59.8%, balanced accuracy node 3: 74.8%), while 350 specificity was high (node 2: 98.4%, node 3: 93.8%). SL showed highest accuracy (mean 351 balanced accuracy 86.26%) and F-statistics (90.0%) but was outperformed for sensitivity by 352 node 1 (swarm: 80.0%, node1: 87.8%), which showed poor performance concerning 353 specificity (swarm: 92.4%, node1: 84.8%). Vice versa, node 2 outperformed the swarm for 354 specificity (98.4%), but showed very poor sensitivity (21.2%) (Extended Data Fig. 8b ). When 355 lowering prevalence at the test node (Extended Data Fig. 8c- Based on the promising results obtained for tuberculosis, we collected blood from COVID-19 365 patients at two sites in Europe (Athens, Greece; n=39 samples, Nijmegen, n=93 samples) and 366 generated whole blood transcriptomes by RNA-sequencing. We used the dataset described 367

for Tb as the framework and included the COVID-19 samples (Fig. 1l) for assessing whether 368 SL could be applied early on to detect patients with a newly identified disease. While COVID-369 19 patients are currently identified by PCR-based assays to detect viral RNA 53 , we use this 370 case as a proof-of-principle study to illustrate how SL could be used even very early on during 371 an outbreak based on the patients' immune response captured by analysis of the circulating 372 immune cells in the blood. Here, blood transcriptomes only present a potential feature space 373 to illustrate the performance of SL. Furthermore, assessing the specific host response, in 374 addition to disease prediction, might be beneficial in situations for which the pathogen is 375 unknown, specific pathogen tests not yet possible, and blood transcriptomics can contribute 376 to the understanding of the host's immune response 54 . Lastly, while we do not have the power 377 yet, blood transcriptome-based machine learning might be used to predict severe COVID-19 378 cases, which cannot be done by viral testing alone. 379 COVID-19 induces very strong changes in peripheral blood transcriptomes 54 . Following our 380 experience with the leukemia and tuberculosis use cases, we first tested classifier 381 performance for evenly distributed cases and controls at both training nodes and the test node 382 (Extended Data Fig. 9a Fig. 9c ), but only when we reduced the prevalence even further (1:44 ratio, Extended 388 Data Fig. 9d ), F1-statistics was clearly reduced, albeit SL again performing best. We next 389 reduced the cases at all training nodes (Extended Figure 10 ), but even under these 390 conditions, we observed still very high values for accuracy, sensitivity, specificity and F1 391 scores, both derived by training at individual nodes or by SL (Extended Figure 10a-f) . 392

We then reduced the cases at all three training nodes to very low numbers, a scenario that 393 might be envisioned very early during an outbreak scenario (Fig. 4a) . Node 1 contained only 394 20 cases, node 2 10 cases and node 3 only 5 cases. At each node, controls outnumbered 395 cases by 1:5, 1:10, or 1:20. At the test node, we varied the prevalence from 1:1 (Fig. 4b) , 1:2 396 ( Fig. 4c) to 1:10 (Fig. 4d) . Based on our findings for Tb (Extended Data Fig. 8) , we expected 397 classifier performance to deteriorate under these conditions. We only observed decreased 398 performance at nodes 2 and 3 in these scenarios with SL outperforming these nodes with 399 p<0.05 for all performance measures, e.g. at a test node prevalence of 1:10 (accuracy 400 (99.3%), sensitivity (95.1%), specificity (99.7%) and F1-statistics (99.7%) (Fig. 4d) . Finally, 401

we simulated a scenario with four instead of three training nodes with very few cases per node 402 (Extended Data Fig. 11a-d) , in an otherwise similar scenario as described for Fig. 4 The introduction of precision medicine based on high-resolution molecular and imaging data 415 will heavily rely on trustworthy machine learning algorithms in compute environments that are 416 characterized by high accuracy and efficiency, that are privacy-and ethics-preserving, secure, 417 and that are fault-tolerant by design 33-36 . At the same time, privacy legislation is becoming 418 increasingly strict, as risks of cloud-based and central data-acquisition are recognized. Here, 419

we introduce Swarm Learning, which combines blockchain technology and machine learning 420 environments organized in a swarm network architecture with independent swarm edge nodes 421 that harbor local data, compute infrastructure and execute the shared learning models that 422 make central data acquisition obsolete. During iterations of SL, one of the nodes is chosen to 423 lead the iteration, which does not require a central parameter server anymore thereby 424 restricting centralization of learned knowledge and at the same time increasing resiliency and 425 fault tolerance. In fact, these are the most important improvements over current federated 426 computing models. Furthermore, private permissioned blockchain technology, harboring all 427 rules of interaction between the nodes, is the Swarm Learning's inherent privacy-and ethics-428 preserving strategy. This is of particular interest to medical data and could be adapted by other 429 federated learning systems. To understand whether the concept of swarm learning would also 430 be characterized by high efficiency and high accuracy, we built three medical use cases based 431 on blood transcriptome data, which are high-dimensional data derived from blood, one of the 432 major tissues used for diagnostic purposes in medicine. First, utilizing three previously 433 compiled datasets (A1-3) of peripheral blood mononuclear cells derived from patients with 434 acute myeloid leukemia, we provide strong evidence that SL-based classifier generation using 435 a well-established neural network algorithm outperforms individual nodes, even in scenarios 436

where individual contributing swarm nodes were performing rather poorly. Most striking, 437 swarm learning was even improving performance parameters when training of individual 438 nodes was based on technically different data, a situation that was previously shown to 439 deteriorate classifier performance 47 . With these promising results, we generated a more 440 challenging use case in infectious disease patients, detecting Tb based on full blood 441 transcriptomes. Also in this case, SL outperformed individual nodes. Using Tb to simulate 442 scenarios that could be envisioned for building blood transcriptome classifiers for patients 443 during an outbreak situation, we further illustrate the power of SL over individual nodes. 444

Considering the difficulty to quickly negotiate data sharing protocols or contracts during an 445 epidemic or pandemic outbreak, we deduce from these findings that SL would be an ideal 446 strategy for independent producer of medical data to quickly team up to increase the power to 447 generate robust and reliable machine learning-based disease or outcome prediction classifier 448 without the need to share data or relocate data to central cloud storages. 449

In addition, we tested whether we could build a disease prediction classifier for COVID-19 in 450 an outbreak scenario. Building on our knowledge that blood transcriptomes of COVID-19 451 patients are significantly altered with hundreds of genes being changed in expression and with 452 a rather specific signature compared to other infectious diseases 54 , we hypothesized that it 453 should be possible to build such a classifier with a rather small number of samples. Here, we 454 provide evidence that classifiers with high accuracy, sensitivity, specificity, and also high F1-455 statistics can be generated to identify patients with COVID-19 based on their blood 456 transcriptomes. Moreover, we illustrate the power of SL that would allow to quickly increase 457 the power of classifier generation even under very early outbreak scenarios with very few 458 cases used at the training nodes, which could be e.g. collaborating hospitals in an outbreak 459 region. Since data do not have to be shared, additional hospitals could benefit from such a 460 system by applying the classifiers to their new patients and once classified, one could even 461 envision an onboarding of these hospitals for an adaptive classifier improvement schema. 462

Albeit technically feasible, we are fully aware that such scenarios require further classifier 463 testing and confirmation, but also an assessment of how this could be integrated in existing 464 legal and ethical regulations at different regions in the world 5,6 . Furthermore, we appreciate 465 that other currently less expensive data might be suitable for generating classifiers to identify 466 COVID-19 patients 10 . For example, if highly standardized clinical data would become 467 available, SL could be used to interrogate the clinical feature space at many clinics worldwide 468 without any need to exchange the data to develop high performance classifiers for detecting 469 COVID-19 patients. Similarly, recently introduced AI-systems using imaging data 21,22 might be 470 more easily scaled if many hospitals with such data could be connected via SL. Irrespective 471 of these additional opportunities using other parameter spaces, we would like to suggest blood 472 transcriptomics as a promising new alternative due to its very strong signal in COVID-19. A 473 next step will be to determine whether blood transcriptomes taken at early time points could 474 be used to predict severe disease courses, which might allow physicians to introduce novel 475 treatments at an earlier time point. Furthermore, we propose to develop an international 476 database of blood transcriptomes that could be utilized for the development of predictive 477 classifiers in other infectious and non-infectious diseases as well. It could be envisioned that 478 such an SL-based learning scheme could be deployed as a permanent monitoring or early 479 warning system that runs by default, looking for unusual movements in molecular profiles. 480

Collectively, SL together with transcriptomics but also other medical data is a very promising 481 approach to democratize the use of AI among the many stakeholders in the domain of 482 medicine while at the same time resulting in more data privacy, data protection and less data 483 traffic. 484

With increasing efforts to enforce data privacy and security of medical data 8 (hhs.gov, 485 https://www.hhs.gov/hipaa/index.html, 2020; Intersoft Consulting, General Data Protection 486 Regulation, https://gdpr-info.eu) and to reduce data traffic and duplication of large medical 487 data, a decentralized data model will become the preferred choice of handling, storing, 488 managing and analyzing medical data 26 . This will not be restricted to omics data as exemplified 489

here, but will extend to other large medical data such as medical imaging data 55 Consulting, General Data Protection Regulation, https://gdpr-info.ee) making it less appealing 495 to develop centralized AI systems. We introduce Swarm Learning as a decentralized learning 496 system with access to data stored locally that can replace the current paradigm of data sharing 497 and centralized storage while preserving data privacy in cross-institutional research in a wide 498 spectrum of biomedical disciplines. Furthermore, SL can easily inherit developments to further 499 preserve privacy such as functional encryption 64 , or encrypted transfer learning approaches 65 . 500

In addition, the blockchain technology applied here provides robust measures against semi-501 honest or dishonest participants/adversaries who might attempt to undermine a Swarm 502

Network. Another important aspect for wide employment of SL in the research community and 503 in real-world applications is the ease of use of the Swarm API, which will make it easier for 504 researchers and developers to include novel developments such as for example private 505 machine learning in TensorFlow 66 . 506

There is no doubt that numerous medical and other data types as well as a vast variety of 507 computational approaches can be used during a pandemic 14 . We do not want to imply that 508 blood transcriptomics would be the preferred solution for the many questions that AI and 509 machine learning could help to solve during such a crisis. Although, at the same time, we have 510 recently shown that blood transcriptomics can be used to define molecular phenotypes of 511 COVID-19, uncover the deviated immune response in severe COVID-19 patients, define 512 unique patterns of the disease in comparison to other diseases and can be utilized to predict 513 potential drugs to be repurposed for COVID-19 therapy (Aschenbrenner et al. unpublished 514 results). Therefore, we explored blood transcriptomics as a unique and rich feature space and 515 a good example to illustrate the advantages of SL in identifying COVID-19 patients. Once 516 larger datasets become available, SL could be used to identify patients at risk to develop 517 severe COVID-19 early after onset of symptoms. 518

Another important quest that has been proposed is global collaboration and data-sharing 13 . 519

While we could not agree more about the need for global collaboration -an inherent 520 characteristic of SL -we favor systems that do not require data sharing but rather support 521 global collaboration with complete data privacy preservation. Particularly, if using medical data 522 that can also be used to interrogate medical issues unrelated to COVID-19. Indeed, 523 statements by lawmakers have been triggered, clearly indicating that privacy rules also fully 524 apply during the pandemics (EU Digital Solidarity: a call for a pan-European approach against 525 the pandemic, Wojciech Wiewiórowski, https://edps.europa.eu/sites/edp/files/publication 526 /2020-04-06_eu_digital_solidarity_covid19_en.pdf, 2020). Particular in a crisis situation such 527 as the current pandemic, AI systems need to comply with ethical principles and respect human 528 rights 14 . We therefore argue that systems such as Swarm Learning that allow fair, transparent 529 and still highly regulated shared data analytics while preserving data privacy regulations are 530 to be favored, particularly during times of high urgency to develop supportive tools for medical 531 decision making. We therefore also propose to explore SL for image-based diagnostics of 532 COVID-19 from patterns in X-ray images or computed tomography (CT) scans 21,22 , structured 533 health records 67 , or wearables for disease tracking 14 . Swarm learning would also have the 534 advantage that model and code sharing as well as dissemination of new applications is easily 535 scalable, because onboarding of new swarm participants is structured by blockchain 536 technology, while scaling of data sharing is not even necessary due the inherent local 537 computing of the data 14 . Furthermore, swarm learning can reduce the burden of establishing 538 global, comprehensive, open, and verified datasets. 539

Collectively, we introduce Swarm Learning defined by the combination of blockchain 540 technology and decentralized machine learning in an entirely democratized approach 541 eliminating a central player and therefore representing a uniquely fitting strategy for the 542 inherently locally organized domain of medicine. We used blood transcriptomes in three 543 scenarios as use cases since they combine blood as the most widely used surrogate tissue 544 for diagnostic purposes with an omics technology producing high-dimensional data with many 545 parameters. Since the deployment of Swarm Learning due to ease of use of Swarm Learning 546 libraries is a rather simple task, we propose to expand the use of this technology and further 547 develop such classifiers in a unifying fashion across centers worldwide without any need to 548 share the data itself. for training, node 4 for testing. Swarm Learning (SL) was achieved by integrating nodes 1-3 695 for training following procedures described in detail in Supplementary Information. COVID-19 696 samples were used as cases. In this scenario, node 1 would be the outbreak node with the 697 highest prevalence. Training node 2 has fewer cases and is an early secondary node, and 698 node 3 acts as a later secondary node. The spreading is tested on the testing node with three 699 different prevalences (b,c,d) and shown as box-whisker plot (mean, 1st and 3rd quartile, 700 whisker type Min/Max). (b) Evaluation of (a) with even prevalence showing accuracy, 701 sensitivity, specificity and F1-score of fifty permutations for each training node and the SL 702 (node 4). (c) Evaluation (as described in (b)) of scenario (a) using a 1:2 ratio for cases and 703 controls in the test set. Main settings are identical to what is described in Fig. 2 for dataset A2. (a) The case:control 751 distribution is even, the training sets increase from node 1 to node 3. The test set is evenly 752 split. (b) Test accuracy for evaluation of dataset A2 (corresponding to Fig. 2e) . To establish a dataset based on whole blood transcriptomes we generated new data from 846 healthy controls (Rhineland Study) and combined these with previously generated data that 847 had been deposited in Gene Expression Omnibus (GEO). We screened for transcriptome 848 datasets derived from human whole blood samples, which were collected using the PAXgene 849 Blood RNA System. In total, nine independent datasets were selected to be included in the 850 present study (GSE101705 (n=44); GSE107104 (n=33), GSE112087 (n=120), GSE128078 851 (n=99), GSE66573 (n=14), GSE79362 (n=355), GSE84076 (n=36); GSE89403 (n=914)). analysis was sampled on day 0 to 11 after admission. In the cohort in Athens, blood samples 878 from ten healthy donors who were tested negative on SARS-CoV-2 were included as controls. 879

The newly generated samples from the COVID-19 patients and the controls from Athens were 880 combined with dataset B (see above) to establish Dataset C. As a result, in addition to the 881 1999 samples derived from Dataset B, Dataset C included further 10 healthy controls and 134 882 dutch COVID-19 samples, which makes a total of 2,143 samples. Sample information is listed 883 in Supplementary Tables 2 and 6 We previously demonstrated that ML on PBMC transcriptomes can be utilized to predict 927 AML 47 . Based on this experience, we generated sample sets within three independent 928 transcriptome datasets (dataset A1-A3, see above) to assess different scenarios in a three-929 node setting for training with a fourth node only used for testing. As indicated in Fig. 2 , six 930 scenarios with varying numbers of samples per node and varying ratios between cases and 931 controls at each node where defined. For predicting AML, all samples derived from AML 932 patients were classified as cases, while all other samples were labeled controls. When 933 predicting ALL, all samples derived from ALL patients were classified as cases and all others 934 as controls. For each scenario (Fig. 2) and each dataset we permuted the sample distribution 935 100 times, resulting in a total of 5,594 individual predictions. The different scenarios were 936 chosen to address the influence of sample numbers per node, the case control ratio, study 937 design-related batch effects, and transcriptome technologies used on classifier performance 938 at the nodes, but more importantly on swarm learning performance. In line with the experience we gained from the prediction of AML, we used dataset B to 943 generate scenarios for the prediction of tuberculosis in various settings, again using different 944 scenarios in a three-node setting for training with a fourth node only used for testing. In one 945 scenario, all patients with tuberculosis (Tb) including patients with latent and acute Tb were 946 treated as cases, while all others were defined as controls (Extended Data Fig. 6b ). In all 947 other scenarios, cases were restricted to acute Tb patients' samples, while patients with latent 948

Tb were defined as controls together with all other non-Tb samples. Here, the question to be 949 answered is, whether the classifiers can identify patients with acute Tb and can distinguish 950 them from latent Tb and other conditions. 951

In one scenario (Fig. 3c-d) , we added three additional training nodes to test dependency of 952 classifier performance by the number of nodes. As indicated in Fig. 3 , three scenarios with 953 varying numbers of samples per node and varying ratios between cases and controls at each 954 node where defined. For scenarios described within Fig. 3e,g and Fig. 3i ,k, we tested two 955 prevalence scenarios in the test set. For each scenario (Fig. 3) we permuted the sample 956 distribution 5-10 times, resulting in a total of 325 individual predictions. To mimic an outbreak 957 scenario, we reduced cases also at the training nodes to determine the effects on Swarm 958

Learning performance. Sample distributions for all permutations within all scenarios are listed 959 Based on the promising results obtained with tuberculosis, we next intended to simulate 963 classifier building and testing for the prediction of COVID-19 in a SL setting. We used dataset 964 B and added 144 additional samples, of which 139 samples were derived from COVID-19 965 patients (see above). We applied a three-node setting for training with a fourth node only used 966 for testing. 967

In one scenario (Extended Data Fig. 8) , we kept cases (n=30) and controls (n=30) evenly 968 distributed among the three training nodes and tested three different prevalence scenarios at 969 the test node (22:25; 11:25; 1:44). In a second scenario (Extended Data Fig. 9a-c) we 970 changed the ratio of cases and controls at each node (node 1: 40:60, node 2: 30:70, node 3: 971 20:80) and tested two prevalence scenarios at the test node (22:25; 11:25). In a third scenario 972 (Extended Data Fig. 9a-c) we further reduced the number of cases at the training nodes 973 further (node 1: 30:70, node 2: 20:80, node 3: 10:90) and tested two prevalence scenarios at 974 the test node (37:50; 37:75). 975

Lastly, we tested an outbreak scenario (Fig. 4) with very few cases at the outbreak node 1 976 (20:80), an early secondary node (10:90) and a later secondary node (5:95) and three 977 prevalence scenarios at the test node (1:1, 1:2, 1:10), resulting in a total of 220 individual 978 predictions Sample distributions for all permutations within all scenarios are listed in 979 Supplementary Table 1 . The classification report and confusion matrix was generated with scikit-learn APIs for each 1198 permutation. Measurements of sensitivity, specificity and accuracy of each permutation run 1199 was read into a 

Data and software availability: 1205 Processed data can be accessed via the SuperSeries GSE122517 or via the individual

SubSeries GSE122505 (dataset A1), GSE122511 (dataset A2) and GSE122515 (dataset A3)

Dataset B consists of the following series which can be accessed at GEO: GSE101705

This dataset is not publicly available because of 1210 data protection regulations. Access to data can be provided to scientists in accordance with 1211

Requests for further information or to 1212 access the Rhineland Study's dataset should be directed to RS-DUAC@dzne.de. Dataset C 1213 contains dataset B and additional samples for COVID-19. These datasets are made available 1214 at the European Genome-Phenome Archive (EGA) under accession number 1215 EGAS00001004502

The code for preprocessing and for predictions can be found at GitHub 1217

Supplementary Table 1: Overview over all sample numbers and scenarios 1222 Supplementary Table 2: Dataset annotations of Dataset A, B and C 1223 Supplementary Table 3: Prediction results for all scenarios and permutations 1224 Supplementary Table 4: Summary statistics on all prediction scenarios 1225 Supplementary Table 5: Statistical tests comparing single node vs

Classification, ontology, and precision 1229 medicine

Building the foundation for genomics in precision 1231 medicine

Deep learning-based classification of mesothelioma improves 1233 prediction of patient outcome

Evaluation and accurate diagnoses of pediatric diseases using artificial 1235 intelligence

Do no harm: a roadmap for responsible machine learning for health 1237 care

The practical implementation of artificial intelligence technologies in 1239 medicine

Algorithms on regulatory lockdown 1241 in medicine

Privacy in the age of medical big data

Viral and host factors related to the clinical outcome of COVID-19

Severe Covid-19

Mild or Moderate Covid-19

AI systems aim to sniff out coronavirus outbreaks

Machine Learning for COVID-19 needs global collaboration 1253 and data-sharing

Artificial intelligence cooperation to support the global response 1255 to COVID-19

The challenges of deploying artificial intelligence models in a rapidly 1257 evolving pandemic

Improved protein structure prediction using potentials from deep 1259 learning

A data-driven drug repositioning framework discovered a potential 1261 therapeutic agent targeting

Digital Smartphone Tracking for COVID-1265 19: Public Health and Civil Liberties in Tension

Population flow drives spatio-temporal distribution of COVID-19 in China

Artificial intelligence-enabled rapid diagnosis of patients with COVID-19

Clinically Applicable AI System for Accurate Diagnosis, Quantitative 1272 Measurements, and Prognosis of COVID-19 Pneumonia Using Computed 1273 Tomography

Overview of artificial intelligence in 1275 medicine

Artificial 1277 intelligence in radiology

Deep learning

Secure, privacy-preserving 1280 and federated machine learning in medical imaging

Machine learning in medicine

Predicting the future-big data, machine learning, and 1285 clinical medicine

Machine learning: Calculating disease

Biomedical informatics 1288 on the cloud: A treasure hunt for advancing cardiovascular medicine

WELCOME -Innovative integrated care platform using wearable 1291 sensing and smart cloud computing for COPD patients with Comorbidities

Annual International Conference of the IEEE Engineering in Medicine and Biology 1293

A vision for a biomedical cloud

Implementing machine learning in health care ' 1298 addressing ethical challenges

The battle for ethical AI at the world's biggest machine-learning conference

Adversarial attacks on medical machine learning

On the responsible use of digital data to tackle the COVID-19 1304 pandemic

Federated Optimization: 1306 Distributed Machine Learning for On-Device Intelligence

Federated Learning: Strategies for Improving Communication 1308 Efficiency

Communication-Efficient Learning of Deep Networks from Decentralized Data

Privacy-Preserving Deep Learning | Proceedings of the 22nd 1313 ACM SIGSAC Conference on Computer and Communications Security

Privacy preserving probabilistic inference with 1316

ICASSP, IEEE International Conference on Acoustics, 1317 Speech and Signal Processing -Proceedings

Assessing the human immune system 1320 through blood transcriptomics

Assessment of immune status using blood transcriptomics and 1322 potential implications for global health

Large sample size, wide variant spectrum, and advanced machine-1324 learning technique boost risk prediction for inflammatory bowel disease

Big data: Astronomical or genomical?

Genomic cloud computing: Legal and ethical points to consider. Eur

Scalable Prediction of Acute Myeloid Leukemia Using High-1330 Dimensional Machine Learning and Blood Transcriptomics

A blood RNA signature for tuberculosis disease risk: a prospective 1332 cohort study

Existing blood transcriptional classifiers accurately discriminate active 1334 tuberculosis from latent infection in individuals from south India

Transcriptomic biomarkers for tuberculosis: Evaluation of DOCK9

EPHA4, and NPC2 mRNA expression in peripheral blood

Tuberculosis in advanced HIV infection is associated with increased 1339 expression of IFNγ and its downstream targets

Host blood RNA signatures predict the outcome of tuberculosis 1341 treatment

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-1343 PCR

Suppressive myeloid cells are a hallmark of severe 1345

International evaluation of an AI system for breast cancer 1347 screening

Identifying Medical Diagnoses and Treatable Diseases by Image-1349 Based Deep Learning

End-to-end lung cancer screening with three-dimensional deep learning 1351 on low-dose chest computed tomography

Dermatologist-level classification of skin cancer with deep neural 1353 networks

A machine learning model for the prediction of survival and tumor 1355 subtype in pancreatic ductal adenocarcinoma from preoperative diffusion-weighted 1356 imaging

A machine learning algorithm predicts molecular subtypes in 1358 pancreatic ductal adenocarcinoma with differential response to gemcitabine-based 1359 versus FOLFIRINOX chemotherapy

Predicting the ISUP grade of clear cell renal cell carcinoma with 1361 multiparametric MR and multiphase CT radiomics

A mathematical-descriptor of tumor-mesoscopic-structure from computed-1363 tomography images annotates prognostic-and molecular-phenotypes of epithelial 1364 ovarian cancer

Multicenter study demonstrates radiomic features derived from 1366 magnetic resonance perfusion images identify pseudoprogression in glioblastoma

Partially Encrypted 1369 Machine Learning using Functional Encryption

Utilizing Transfer Learning and Homomorphic 1371 Encryption in a Privacy Preserving and Secure Biometric Recognition System

Private Machine Learning in TensorFlow using Secure Computation

Deep Learning-Based Quantitative Computed Tomography Model in 1376 Predicting the Severity of COVID-19: A Retrospective Study in 196 Patients

Gene Expression Omnibus: NCBI gene 1379 expression and hybridization array data repository

Exploration, Normalization, and Summaries of High Density 1382 Oligonucleotide Array Probe Level Data

Affy--Analysis of Affymetrix GeneChip Data at the Probe Level

Moderated estimation of fold change and dispersion 1386 for RNA-seq data with DESeq2

STAR: Ultrafast universal RNA-seq aligner

Transforming RNA-Seq data to improve the 1390 performance of prognostic gene signatures

Bootstrap Methods: Another Look at the Jackknife

Scikit-learn: Machine Learning in Python

Acknowledgments:

We leveraged a deep neural network with a sequential architecture as implemented in the 994 keras library (Keras, https://keras.io/, 2015). Briefly, the neural network consists of one input 995 layer, eight hidden layers and one output layer. The input layer is densely connected and 996 consists of 256 nodes, a rectified linear unit activation function and a dropout rate of 40%. 997From the first to the eighth hidden layer, nodes are reduced from 1024 to 64 nodes, and all 998 layers contain a rectified linear unit activation function, a kernel regularization with an L2 999 regularization factor of 0.005 and a dropout rate of 30%. The output layer is densely connected 1000 and consists of 1 node and a sigmoid activation function. The model is configured for training 1001 with Adam optimization and to compute the binary cross-entropy loss between true labels and 1002 predicted labels. 1003The model has been translated from R to Python in order to make it compatible with the swarm 1004 learning library. This model is used for training both the individual nodes as well as swarm 1005 learning. The model is trained over 100 epochs, with varying batch sizes. Distributed ML is leveraged to train a common model across multiple nodes with a subset of 1053 the data located at each node -commonly known as the data parallel paradigm in ML -1054 though without a central parameter server. Blockchain lends the decentralized control, 1055 scalability, and fault-tolerance aspects to the Swarm Network system to enable the framework 1056 to work beyond the confines of a single enterprise. 1057The Swarm Learning library is a framework to enable decentralized training of ML models 1058 without sharing the data. The Swarm Learning framework is designed to make it possible for 1059 a set of nodes -each node possessing some training data locally -to train a common ML 1060 56 model collaboratively without sharing the training data itself. This can be achieved by individual 1061 nodes sharing parameters (weights) derived from training the model on the local data. This 1062 allows nodes to maintain the privacy of their raw data. Importantly, in contrast to many existing 1063 federated learning models, a central parameter server is omitted in Swarm Learning. 1064The nodes that participate in Swarm Learning, register themselves with the Swarm Network 1065 implicitly using the callback API. Here, the Swarm Network interacts with other peers using 1066 blockchain for sharing parameters and for controlling the training process. On each node, a 1067 simple Swarm callback API has to be used to enable the ML model with Swarm Learning 1068 capacities (see also code presented below). The Swarm container has to be configured to The Swarm Network container includes 1) software to setup and initialize the Swarm Network, 10822) management commands to control the Swarm Network, and 3) start/stop Swarm Learning 1083 tasks. This container also encapsulates the blockchain software. 1084The Swarm ML container includes software to support 1) decentralized training, 2) integration 1085 This API is incorporated into the existing ML code to quickly transform a stand-alone ML node 1146 into a Swarm Learning participant in a non-intrusive way. It offers a set of commands (APIs) 1147 to manage the Swarm Network and control the training. 1148The Swarm learning process is as follows: 1149The Swarm Learning process begins with enrollment of nodes with Swarm Network, which is 1150 done implicitly by Swarm callback function when the callback is constructed. During this 1151 process, the relevant attributes of the node are stored in the blockchain ledger. This is a one-1152 time process. 1153Nodes will train the local copy of the model iteratively using private data over multiple epochs. 1154During each epoch, the node trains its local model using one or more data batches for a fixed 1155 number of iterations. It regularly shares its learnings with the other Swarm nodes and 1156 incorporates their insights. Users can control the periodicity of this sharing by defining a 1157 59 Synchronization Interval in Swarm callback API. This interval specifies the number of training 1158 batches after which the nodes will share their learnings. 1159At the end of every synchronization interval, when it is time to share the learnings from the 1160 individual models, one of the Swarm nodes is elected as a "leader" using the leader election 1161 logic. This leader node collects the model parameters from each peer node and merges them. 1162The framework supports multiple merge algorithms such as mean, weighted mean, median, 1163 and so on. Each node then uses these merged parameters to calculate various validation 1164 metrics. These results are compared against the stopping criterion and if it is found to be met, 1165the Swarm Learning process is halted. Else the nodes use the merged parameters to start the 1166 next training batch. We evaluated binary classification model performance with sensitivity, specificity, accuracy 1173 and f1-score metrics. Sensitivity, specificity, accuracy and f1-score were determined for every 1174 test run. The 95% confidence intervals of all performance metrices were estimated using the 1175 boostrapping approach 74 . For AML and ALL, 100 permutations per scenario were run for each 1176 scenario. For TB the performance metrics were collected by running 10 permutations for 1177 scenarios 1 to 4 and 5 permutations for scenarios 5 to 10. For COVID-19 the performance 1178 metrics were collected by running 20 permutations for each scenario. All metrics are listed in 1179 Supplementary Tables 3 and 4 . 1180Differences in performance metrics were tested using the Wilcoxon signed rank test with 1181 continuity correction (Individual Comparisons by Ranking Methods, Frank Wilcoxon, 1182 https://sci2s.ugr.es/keel/pdf/algorithm/articulo/wilcoxon1945.pdf). All test results are provided 1183in Supplementary Table 5 . 1184To run the experiments, we used Python version 3.6.9 with Keras version 2.3.1 and 1185Tensorflow version 2.2.0-rc2. We used scikit-learn library version 0.23.1 75 to calculate values 1186 for the metrics. Summary statistics and hypothesis tests were calculated using R version 3.5.2 1187 (R: A language and environment for statistical computing, http://www.R-project.org/., 2015). 1188Calculation of each metric was done as follows: 1189