key: cord-0147938-ey17rupt authors: Park, Sangjoon; Ye, Jong Chul title: Multi-Task Distributed Learning using Vision Transformer with Random Patch Permutation date: 2022-04-07 journal: nan DOI: nan sha: 38bef6f0b4de2f9809b4fd44e8b61f9bebb14db9 doc_id: 147938 cord_uid: ey17rupt The widespread application of artificial intelligence in health research is currently hampered by limitations in data availability. Distributed learning methods such as federated learning (FL) and shared learning (SL) are introduced to solve this problem as well as data management and ownership issues with their different strengths and weaknesses. The recent proposal of federated split task-agnostic (FeSTA) learning tries to reconcile the distinct merits of FL and SL by enabling the multi-task collaboration between participants through Vision Transformer (ViT) architecture, but they suffer from higher communication overhead. To address this, here we present a multi-task distributed learning using ViT with random patch permutation. Instead of using a CNN based head as in FeSTA, p-FeSTA adopts a randomly permuting simple patch embedder, improving the multi-task learning performance without sacrificing privacy. Experimental results confirm that the proposed method significantly enhances the benefit of multi-task collaboration, communication efficiency, and privacy preservation, shedding light on practical multi-task distributed learning in the field of medical imaging. A RTIFICIAL intelligence (AI) has been gaining unprecedented popularity thanks to its potential to revolutionize various fields of data science. Specifically, the deep neural network has attained expert-level performances in the various applications of medical imaging [1] , [2] . To enable the AI models to offer precise decision support with robustness, an enormous amount of data are indispensable. However, data collected from volunteer participation of only a few institutions cannot fully meet the amount to guarantee robust performances. Even for the large public datasets, it may inevitably include unquantifiable biases stemming from the limited geographic regions and patient demographics such as ethnicities and races, resulting in performance instability in real-world applications. Especially for the newly emerging disease like Coronavirus disease , this limitation can be exacerbated as it is hard to build a large, well-curated dataset with sufficient diversity promptly. Therefore, the ability to collaborate between multiple institutions is critical for the successful application of AI in medical imaging, but the rigorous regulations and the ethical restrictions for sharing patient data is an another obstacle to multiinstitutional collaborative work. Several formal regulations and guidelines, such as the United States Health Insurance Portability and Accountability Act (HIPAA) [3] and the European General Data Protection Regulation (GDPR) [4] , state the strict regulations regarding the storage and sharing of patient data. Accordingly, distributed learning methods, which perform learning tasks at edge devices in a distributed fashion, can be effectively utilized in healthcare research. Specifically, distributed learning was introduced to enable the model training with data that reside on the source devices without sharing. Federated learning (FL) is one of these methods that enables distributed clients to collaboratively learn a shared model without sharing their training data [5] . However, it still holds several limitations in that it is heavily dependent on the client-side computation resources for parallel computation and not completely free from privacy concerns with gradient inversion attack [6] , [7] . Another distributed learning method, split learning (SL) [8] , which splits the network into parts between clients and the server, is a promising method that puts low computational loads at the edge devices; however, it has the disadvantage of high communication overhead between the clients and server [9] , and also has limitation in privacy preservation as the private data can be recovered by the malicious attack with feature hijacking and model inversion [10] . In addition, SL show significantly slower convergence compared with FL and shows suboptimal performance under significantly skewed data distribution between clients [11] . Inspired by the modular decomposition structure of Vision Transformer (ViT), a novel distributed learning method dubbed Federated Split Task-Agnostic learning (FESTA) was recently proposed for distributed multi-task collaboration using ViT architecture [12] . The FESTA framework, equipped with the shared task-agnostic ViT body on the server-side and multiple task-specific convolutional neural network (CNN) heads and tails on the clients-side, was able to balance the merit of FL and SL, thereby improving the performances of individual tasks under distributed multi-task collaboration setting at a level even better than the single-task expert model trained in a data-centralized manner. Nevertheless, there remain several critical limitations with the FESTA framework. First, the communication overhead is higher than that of SL and FL, as the model should continuously share features and gradients as well as head and tail parts of the network, which may impose difficulties in practical implementation. Second, we found that the large size head and tail parts in the original FeSTA tends to reduce the role of the shared body, resulting in a small improvement compared to the single task learning despite the ViT's the potential for multi-task learning (MTL). Finally, the FESTA framework was not free from the privacy issue, as the features transmitted to the server body can be hijacked and reverted to the original data by the outside malicious attackers or "honest-but-curious" server in the same manner in SL. To alleviate these drawbacks, here we introduce p-FESTA framework, a Federated Split Task-Agnostic learning with permutating pure ViT, which empowers communication efficient MTL with privacy-preservation. Although the overall composition of p-FESTA is similar to that of FESTA, instead of using a CNN based head, p-FESTA adopts a simple and task non-specific patch embedder like a vanilla ViT, enforcing the self-attention within the transformer architecture to improve the MTL performance. For privacy preservation, we introduce a Permutation module that randomly shuffles the order of all patch features ahead of sending them to the server, to prevent either outside attacker or "honest but curious" server from reverting features into original data containing privacy. The new architectural change gives the p-FESTA several unique advantages. First, the communication overhead was reduced significantly by saving features to be used throughout the entire learning process. Furthermore, the benefit of MTL is enhanced by enforcing the head to play a small role and the multi-task body to do heavy lifting. In addition, data privacy is also enhanced with a simple but effective permutation module using the intrinsic property of ViT. ViT [13] , a recently introduced deep learning model equipped with an exquisite attention mechanism inspired by its successful application in natural language processing (NLP), has demonstrated impressive performances across many vision tasks. The multi-head self-attention in ViT can flexibly attend to a sequence of patches of the image to encode the cue, enabling the model to be robust to nuisances like occlusion, spatial permutation, and adversarial perturbation, and thereby having the model to be more shape-biased like human than CNN-based model [14] . In addition, the modular design of the ViT is straightforward, implying that the components can be easily decomposed into parts: head to project the images patches into embeddings, transformer body to encode the embeddings, and tail to yield task-specific output. This easily decomposable design offers the possibility in the application for MTL. Recall that the motivation of MTL is originated from attempts to mitigate the data insufficiency problem where the numbers of data for individual tasks are limited. MTL can offer the advantage of improving data efficiency, reducing overfitting through shared representation, and faster convergence by leveraging auxiliary knowledge. Specifically, MTL with transformer-based models has emerged as a popular approach to improve the performances of the closely related task in NLP [15] , [16] . In this approach, a shared transformer learns several related tasks simultaneously, like sentence classification and word prediction, and the tasksspecific module yields the outcome for each task. As shown in previous literature [16] , the model trained with MTL strategy generally shows improved performances in a wide range of tasks. Even though not well been studied as in language, the decomposable design of ViT has unleashed the application of MTL to visual transformer models. In an early approach [17] , the ViT was divided into the task-specific head, tail, and shared transformer structures across the tasks, and it was possible to attain a similar generalization performance with fewer training steps, by sharing the transformer model among the related tasks. The main motivation of existing FESTA framework as described in Fig. 1A and Fig. 2B was to devise a framework to maximally exploit the distinct strengths of FL and SL methods and to improve the performances of individual tasks with collaboration between clients performing various tasks. Let C = K k=1 C k be a group of client sets with different tasks, where K denotes the number of tasks and the client set C k includes one or more clients having different data sources for the k-th task, i.e. C k = {c k 1 , c k 2 , . . . , c k N k : N k ≥ 1}. Clients in each client set for the k-th task has its own task-specific model architecture for head H c and a tail T c , while the serverside transformer body B is shared. For training, the server and each client initialize the weights of each sub-network with random initialization or from the pre-trained parameters. For learning round i = 1, 2, . . . R, individual clients do the forward pass on their task-specific head H c using the local training data {(x , and forward pass finishes. Backpropagation is performed exactly the opposite way, in order of tail, body, and head. First, loss is calculated in tail as: c )))), where c (y,ȳ) denotes the c taskspecific loss between the target y and the estimateȳ. Then, the gradients are pass from tail, body to head in reverse order to forward-propagation, using the chain-rule. For multi-task body update, the optimization is performed by fixing the head and tails. For the task-specific head and tail updates, the optimization problem is solved by fixing the Transformer body. In addition, per every "UnifyingRounds", the server aggregates, averages and distributes the head and tail parameters between clients participating in the same task, as in FedAvg [18] . In the previous study, the FESTA along with the MTL was shown to ameliorate the individual performances of the clients in collaboration, while resolving the data governance and ownership issue as well as eliminating the need to transmit the huge weights of the transformer body [12] . Nonetheless, the FESTA framework still has several drawbacks. First, the communication cost can be higher since the features and gradients should be continuously exchanged between the server and clients like in SL but the head and tail weights should also be aggregated between the clients as in FL. Accordingly, the total communication costs are inevitably higher than SL, and even higher than FL depending on the network size. Second, as shown in our ablation study without the transformer body, the CNN head and tail themselves already have strong representation capacity, which may diminish the role of the transformer body between head and tail. Third, privacy concerns may arise as there is no privacy-preserving method from the model inversion attack on the feature transmitted from client to server. The proposed p-FESTA is a framework devised to mitigate these shortcomings. As shown in Fig. 1B and Fig. 2B , the overall composition of p-FESTA is similar to that of FESTA, which decomposes networks into head H, body B and tail T . However, unlike previous FESTA, we do not use the CNN head tailored for each task. Having the CNN head to be powerful enough to play a major role in the task hinders the shared transformer from being an important component as there remains little room to improve with this additional module. Instead, we adopted a simple and task non-specific patch embedder like a vanilla ViT, enforcing the self-attention within the transformer architecture to do the heavy lift. Unfortunately, the use of patch embedding in a vanilla ViT may be prone to outside attackers that attempts to invert the patch embedder to obtain the original images. To address this, here we propose a novel permutation module as depicted in Fig. 1B to prevent either outside attacker or "honest but curious" server from reverting features into original data containing privacy. Specifically, this Permutation module randomly shuffles the order of all patch features ahead of sending them to the server, and stores the key to reverse the permutation on the client-side. Then, the transformer body B in the server does a forward pass with the permutated patch features and sends the encoded features back to the clients. Finally, the client reverses the permutation with the saved key and yields the final output by passing the reverted features to the task-specific tail T k . The back-propagation is performed in the exact opposite way, where the same Permutation module to forward-propagation is utilized. The availability of Permutation module attributes to an intriguing property of ViT that all the components composing the transformer body, such as multi-head self-attention, feedforward network, and layer normalization, is fundamentally "permutation equivariant" [14] . They are processed independently in a patch-based manner and the order of the patch does not affect the outcome, and therefore, the transformer body can be trained without any performance degradation. In addition, as the orders of patches are completely shuffled, it is infeasible for a malicious attacker to successfully revert the original image. How the Permutation module can provide privacy protection from the malicious attacker will be described in more detail in the following. For FL, privacy is improved by the ephemeral and focused nature of the federated aggregation, averaging, and distribution of the model updates, assuming that the model updates are considered to be less informative than the original data. However, recent studies have thrown doubt to the false sense of security, showing that private data can be uncovered faithfully only with these local model updates [19] - [21] . In detail, given the access to the global model W and the client's model update ∆W , the attacker can optimize the input image from a prior to produce a gradient that matches the client's model update as illustrated in Fig. 3A . However, this type of attack is infeasible for the proposed p-FESTA method, since only the tail part of the entire model is aggregated and distributed by the server to the clients. For instance, for COVID-19 classification, the task-specific tail is a simple linear classifier, with which the original data with privacy cannot be uncovered. SL protects privacy in a different way. As the name suggests, SL split the entire model into client-side and server-side subnetworks and does not send the models between the server and clients. Instead, the features and gradients are transmitted back and forth between the server and clients, and it can be the prey of the malicious attacker [10] . As described in Fig. 3B , when clients send the intermediate features f to the server, the attackers may hijack the features, and instead of running the remainder of the SL model, they train three components: encoderF , decoder G, and discriminator D with their own data. The D is trained to discriminate between the hijacked feature f and the featuref encoded byF , which enforceŝ f to be in the same feature space to f . Simultaneously, G learns to decodef into the image with minimal error. Then, well-trained G can also be used to decode the hijacked feature f to original data faithfully. The feature space hijacking is also possible for our p-FESTA. To make the matter worse, the head part of our model is relatively simple and can be easy prey for an attacker. This is why we introduce a novel Permutation module to protect privacy as described in Fig. 4A . The Permutation module randomly shuffles the order of all patch features. In the implementation, the permutations of each data of each client are all different without any regularity as shown in Fig. 4B , resulting in innumerable patterns for all data. With these random permutations, even if a malicious attacker or server steals patch features to uncover private data, the parameter of position embedding, an unknown learnable variable, cannot be inferred as they have no information about the original order of patches. It is also infeasible to inverse the patch features to image patches since the added position embedding, which is unknown to the attacker, should be subtracted first for inversion. This makes a contradiction that the attacker should already know an "unknown" to infer the other "unknown", making the inversion attack a type of underdetermined problem. The learning process of the p-FESTA is akin to the original FESTA, but dissimilar in several aspects. Instead of taskspecific head H k for each task k, the task non-specific patch embedder H prepares the patch embeddings h c for each client c at the beginning and sends them to the server after passing them through the Permutation module. The server then saves the received patch embeddings h c on its side and uses them throughout the remainder of the learning process in order to update the body B and tail T k parts of the model. Consequently, the overall communication costs can be significantly reduced compared to the original FESTA, as the communications to send the intermediate feature h c or to update the head H are no more required. As can be seen, the head part of the model, the patch embedder, cannot be updated in this configuration. However, fixing the parameters of the patch embedder did not bring about any performance exacerbation thanks to the simple structure to just embed the image patches into the same vector spaces. Having them trainable rather slightly decreased the performances by resulting in the discrepancy in embedding between tasks. The experimental results will be provided in the ablation study of Section III-H. The detailed process of the proposed p-FESTA is formally described in Algorithm 1. As for the head part, we used the task non-specific patch embedder consisting of the convolution layer with a kernel size of 16 × 16 and stride of 16, input channel of 3, and output channel of 768. For the server-side body, the transformer encoder of the ViT-base model, consisting of 12 encoder layers and 12 attention heads, was used. For the tail part, the network architectures specialized to yield the task-specific output were adopted. For the COVID-19 classification task, we used a simpler linear classifier. For severity prediction, the mapping module with five up-sizing convolution layers was adopted as proposed in [22] . For pneumothorax segmentation, the decoder part of U-Net [23] was used. We simulated the distributed MTL between the institutions participating in three different CXR tasks: classification, severity prediction of COVID-19, and pneumothorax segmentation. As in [12] , the model was first initialized with pre-trained weights for the CheXpert dataset. We minimized the binary cross-entropy (BCE) losses for each class for the classification task. The severity of COVID-19 was predicted and evaluated in an array-based manner as suggested in [24] . Specifically, BCE losses for each six location arrays of the lung were used for the optimization in severity prediction. Finally, for the pneumothorax segmentation, we minimized the binary cross-entropy loss combined with dice and focal losses. The SGD optimizer was used for the classification and severity prediction tasks, while the Adam optimizer was utilized for the segmentation task, with a learning rate of 0.0001 and a warm-up constant learning rate scheduler for all tasks. The batch size was 4 per client, and the warm-up step was 500. The total training round was 6,000 for all clients, and the tail weights are averaged every 100 local iterations. To adjust the scale of gradients, the 1:2:10 gradient scaling was applied for classification, severity prediction, and segmentation, respectively. The FL, SL, FESTA and p-FESTA was simulated on the modification of Flower (licensed under an Apache-2.0 license) [25] framework. All experiments were performed with Python 3.8 and Pytorch 1.8 on Nvidia RTX 3090, 2080 Ti. One of the paramount motivations for FL in medical imaging is to make a robust model leveraging the dispersed and smallsized datasets from multiple institutions while avoiding data governance. Therefore, we assume the FL scenario in which the data of several clients are scanty. For COVID-19 classification and severity prediction, we used both publicly available datasets and private data collected from local institutions. Overall, 1093 CXR from a local hospital Table III . For practical simulation of collaboration between hospitals, we allocated non-overlapping data sources to each client except for the pneumothorax segmentation task where the exact sources of the data can not be estimated. Overall, six clients participated in the MTL scenario, two clients per task. For this study, Institutional Review Board approvals of each participating hospital were obtained and informed consent was waived. Considering the sizes and compositions of each client, collaboration for the COVID-19 classification task can be regarded as the collaboration between all clients having small data with a substantial imbalance in data distribution. Likewise, the collaboration for COVID-19 severity can be considered to be the simulation of an imbalance in data size between the participants, one client has scanty data while the other client has relatively sufficient data, in addition to the differences in data composition. Finally, the clients for pneumothorax segmentation emulate the situation in which each participating clients have relatively sufficient and homogeneous data with similar sizes. When viewed in terms of the relevance between tasks, the COVID-19 classification and severity prediction task can be considered to be highly correlated tasks, while the pneumothorax segmentation task may be regarded as a less relevant task. To evaluate the classification performance, the area under the receiver operating characteristic curve (AUC) was used. For the severity prediction task, the mean squared error (MSE) of prediction was used as in the previous work [22] . To evaluate the segmentation accuracy, the Dice coefficient was calculated to measure the intersection of the segmentation results and ground truth annotations. All experiments were performed repeatedly with three different seeds to exclude the coincidence of getting over-or underestimated results. Table IV shows a comparison of the proposed p-FESTA with data centralized training, other distributed learning, and original FESTA methods. For a fair comparison, all other methods underwent the same pre-training step as the proposed method. The same model architectures were used for all other distributed learning methods except for the original FESTA methods, in which the task-specific CNN head is a key part of the method. For original FESTA methods, DenseNet-121 equipped with PCAM operation [29] tailored for CXR classification were used as the head instead of the simple patch embedder as in our previous work [22] . The single-task models trained with either p-FESTA showed a similar order of magnitude in performances of three tasks compared with data centralized training, surpassing those of FL and SL. The improvements with the p-FESTA over FL and SL were noticeable, especially for the classification and severity prediction tasks, where the data insufficiency and imbalance problems are prominent. Note that the slightly better performance of the single-task model trained with the original FESTA than the p-FESTA for the task of severity quantifiacation, which is even better than data centralized learning, may attribute to the more expressive task-specific CNN head tailored for CXR tasks. As shown in Table IV , the model obtained with MTL between three tasks using p-FESTA significantly outperforms the single-task counterparts and all other distributed learning methods. Note that the performance gain with the MTL over the single-task model is more prominent in the p-FESTA than the previous FESTA. Even when compared with the MTL model obtained with FESTA, p-FESTA showed similar or slightly better performance in severity prediction and segmentation, but substantially outperformed the previous one in the classification task, providing generally superior performances. The fact that the benefit of MTL is formidable in classification and severity prediction is intriguing, as they are the tasks in which scanty data with skewed distribution are problematic. On the contrary, the performance improvement was modest for pneumothorax segmentation where each participating clients have a relatively large number of data with even distribution. Moreover, the close relevance between COVID-19 classification and severity prediction might have further enhanced the benefit of MTL to those tasks, compared with the relatively less relevant task of pneumothorax segmentation. In this section, we provide the estimated communication costs between the server and clients. Given the number of data as D, the batch size as B, the rounds between aggregation and distribution by the server as n, and the transmission of features, gradients, and the head, body, tail parameters as F , G, and P h , P b , P t , the communication costs T of each distributed learning strategies for a total of R rounds between the server and one client can be formulated as follow: where the constant 2 is multiplied to take account of the bothway transmissions between server and client. Note that the cost for features and gradients transmissions are not multiplied by 2 in p-FESTA to reflect no transmission of features and gradients to the head during the learning process. Numerically, the communication cost for each distributed learning method in our experimental setting can be calculated as in Table V . As one of the critical drawbacks, the communication cost of FESTA is larger than SL and even higher than FL. On the contrary, the proposed p-FESTA substantially lessens the communication burden by saving the head features at the beginning and using them throughout the entire learning process on the server-side. In our experimental setting, the total communication overhead of the proposed p-FESTA is less than half of the previous FESTA, and also significantly lower than SL as well as FL. Results of ablation studies are suggested in Table VI . 1) Fixed Head: We first performed an ablation to verify whether fixing the head parametersand the patch embedder does not harm the performance. Compared with the proposed method with a fixed head, the same model trained using the learnable head showed similar or even slightly worse performance for severity prediction and segmentation, which may attribute to the overfitting to training data. Therefore, we concluded that the patch embedder can be fixed during all the learning rounds without concerns of performance degradation. 2) Permutation Module: We next ablated the Permutation module to verify whether our method is indeed "permutation equivariant". For this proposition to be true, the performance should be the same regardless of the presence of Permutation module. As expected, the performances with and without the Permutation module were in the same order of magnitude, with the differences falling within standard deviations, proving that the permutation does not affect the performance of the transformer model. 3) Position Embedding: In the proposed method, the position embedding takes two roles, first provides position information to yield the final output in the tail, and second adds an unknown parameter to prevent an attacker from uncovering patch features into image patches. We performed the ablation study to confirm that the position embedding is necessary for optimal performance in addition to privacy preservation. The model trained without the position embedding showed slightly lower performances than that with the position embedding in all tasks, suggesting that the position embedding is indispensable for the best performance as well as privacy preservation. In this work, we introduced a significantly improved federated task-agnostic learning framework with permutating pure ViT, dubbed p-FESTA, which resolves the major drawbacks of our previous FESTA framework, leveraging the intrinsic properties of the ViT. The newly proposed p-FESTA substantially reduces the communication overhead between server and clients as well as enhances the performance with the authentic multitask training in the same embedding space, while offering better privacy preservation. Table VII summarizes the comparison between the proposed p-FESTA, original FESTA and other distributed learning methods. One of the most tackling problems of the previous FESTA method was the communication overhead between server and clients since it requires the feature and gradient transmission the same as in SL as well as the server-side aggregation and distribution of the heads and tails parameters for each client. This configuration increases the communication cost to be inevitably larger than SL and even larger than FL according to the network sizes of each model component. To mitigate the problem, we configured the head part to be a simpler structure like a patch embedder so that the pre-trained head can sufficiently show the best performance without the additional training of the head that requires communications back and forth between server and clients. Consequently, the feature from the head could be stored on the server-side at the beginning and used throughout the entire learning process, reducing the overall communication cost to approximately half of other distributed learning methods. Moreover, having the head part be a common patch embedder provides another advantage of embedding the image features of different tasks to be in the same embedding spaces. This results in the increasing role of following multi-task transformer and facilitates learning better shared representation, compared with our previous method where task-specific CNN heads embed the features into different embedding spaces for each task and confine the role of transformer resultingly. Concerns may arise that using a simpler head structure and saving features in the server-side memory could unleash privacy leakage, which is especially important for medical data. To alleviate this concern, we utilized the intriguing properties of the ViT, the "permutation equivariance" of the self-attention mechanism. The patch features were randomly permutated ahead of transmission to the server, making the problem underdetermined to the attacker while not deteriorating the performance of the transformer. The merit of our method can be maximized in "data-hungry" collaboration. As shown in the experiments, the performance gain was more prominent for the classification task and severity prediction tasks where the data of each client are scanty. Given that one of the important motivations of distributed learning is to enable building a robust model without data centralization with the participation of many clients having limited data, this potential gain in data-hungry collaboration will further incentivize the widespread application. Nevertheless, our study is not free of limitations. First, even though we simulated the practical collaboration between hospitals on the customized Flower framework, the robustness to the other tackling factor such as straggler-resilience was not verified [6] , [30] . Considering that connection instability becomes a common problem in online learning, it should be resolved technically ahead of real-world implementation. Second, we did not consider other types of attacks for distributed learning, such as model poisoning or data poisoning [31] - [33] , which is beyond the scope of this work. For defense against these types of malicious attacks, the existing methods [34] - [36] can be utilized along with our framework. Future work might verify the robustness for these types of malicious attacks. In this paper, we proposed the novel p-FESTA framework with pure ViT, which elicits the synergy of MTL among heterogeneous tasks as well as reduces the communication overhead significantly compared to the existing FeSTA. In addition, we also enhanced the privacy using the Permutation module in a way specific to ViT. We believe that our work is a step toward facilitating distributed learning among the institutions wanting to participate in different tasks, mitigating the major drawbacks of the existing methods. Artificial intelligence in radiology Digital pathology and artificial intelligence Health insurance portability and accountability act The european union general data protection regulation: what it is and what it means Federated learning: Strategies for improving communication efficiency Federated learning: Challenges, methods, and future directions Federated learning: Opportunities and challenges Split learning for health: Distributed deep learning without sharing raw patient data Splitfed: When federated learning meets split learning Feature space hijacking attacks against differentially private split learning End-to-end evaluation of federated learning and split learning for internet of things Federated split vision transformer for covid-19cxr diagnosis using task-agnostic training An image is worth 16x16 words: Transformers for image recognition at scale Intriguing properties of vision transformers Multi-task learning in natural language processing: An overview Multi-task deep neural networks for natural language understanding Pre-trained image processing transformer Communication-efficient learning of deep networks from decentralized data Inverting gradients-how easy is it to break privacy in federated learning? Deep leakage from gradients idlg: Improved deep leakage from gradients Multi-task vision transformer using low-level chest x-ray feature corpus for covid-19 diagnosis and severity quantification U-net: Convolutional networks for biomedical image segmentation Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with COVID-19 Flower: A friendly federated learning research framework Bimcv COVID-19+: a large annotated dataset of rx and ct images from COVID-19 patients Bs-net: Learning covid-19 pneumonia severity on a large chest x-ray dataset SIIM-ACR Pneumothorax Segmentation Weakly supervised lesion localization with probabilistic-cam pooling Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity Threats to federated learning: A survey Data poisoning attacks against federated learning systems How to backdoor federated learning Privacy and robustness in federated learning: Attacks and defenses Pdgan: A novel poisoning defense method in federated learning using generative adversarial network Learning to detect malicious clients for robust federated learning