key: cord-0460574-2kxjpqvc authors: Kassem, Hasan; Alapatt, Deepak; Mascagni, Pietro; Consortium, AI4SafeChole; Karargyris, Alexandros; Padoy, Nicolas title: Federated Cycling (FedCy): Semi-supervised Federated Learning of Surgical Phases date: 2022-03-14 journal: nan DOI: nan sha: 3f4f8b9ed71d951326409cb4148e3ebcbc4a0b62 doc_id: 460574 cord_uid: 2kxjpqvc Recent advancements in deep learning methods bring computer-assistance a step closer to fulfilling promises of safer surgical procedures. However, the generalizability of such methods is often dependent on training on diverse datasets from multiple medical institutions, which is a restrictive requirement considering the sensitive nature of medical data. Recently proposed collaborative learning methods such as Federated Learning (FL) allow for training on remote datasets without the need to explicitly share data. Even so, data annotation still represents a bottleneck, particularly in medicine and surgery where clinical expertise is often required. With these constraints in mind, we propose FedCy, a federated semi-supervised learning (FSSL) method that combines FL and self-supervised learning to exploit a decentralized dataset of both labeled and unlabeled videos, thereby improving performance on the task of surgical phase recognition. By leveraging temporal patterns in the labeled data, FedCy helps guide unsupervised training on unlabeled data towards learning task-specific features for phase recognition. We demonstrate significant performance gains over state-of-the-art FSSL methods on the task of automatic recognition of surgical phases using a newly collected multi-institutional dataset of laparoscopic cholecystectomy videos. Furthermore, we demonstrate that our approach also learns more generalizable features when tested on data from an unseen domain. The rise of such applications can also be largely attributed to algorithms developed and benchmarked on large publicly available labeled datasets reflecting a high level of surgical expertise. While this approach has demonstrated significant value, indicative of the potential of surgical data science to disrupt clinical practice, concerns regarding the scalability and generalizability of these methods are frequently echoed through much of the literature in the field [2] . Rightly, these concerns must be better understood, quantified, and addressed as we move towards the deployment of deep learning models in real-world settings. Firstly, the scalability of label-intensive approaches is severely hindered by its dependence on scarce annotations that are manually and laboriously generated by overburdened clinical professionals [2] . This bottleneck has limited the size of the few datasets that drive large chunks of research in the field. This has led to the development of methods to efficiently use fewer labels to supervise the training of deep learning models. In practice, this has meant using unlabeled data or easier-to-obtain labels in semi-or weakly-supervised settings [3] - [6] , respectively. While significant progress has been made, inching towards fully-supervised performance using progressively fewer and fewer labels, potential biases towards the demographics and medical institutions represented in public datasets still remain. This is in part due to prohibitive restrictions preventing the free transfer and publishing of sensitive medical data due to privacy and medico-legal concerns. Recently, a decentralized learning technique known as federated learning [7] has been gaining popularity allowing for the training of models on remote edge devices or servers, circumventing the need to explicitly exchange data. Our work falls squarely at the intersection of federated and semi-supervised learning. We present a method, FedCy, for federated semi-supervised learning of surgical phases, which is the first work applying federated learning to surgical videos. Considering the practical value of being able to scale to private unlabeled medical datasets, we further show that FedCy can learn from completely unlabeled datasets using a single completely labeled dataset by leveraging temporal patterns that occur in all. To the best of our knowledge, FedCy is the first work using distinct but complementary training objectives arXiv:2203.07345v1 [cs.CV] 14 Mar 2022 with labeled and unlabeled data, outperforming state-of-theart results for federated semi-supervised learning on labeled, unlabeled, and held-out datasets. To perform our analysis, we introduce a large, international multicenter dataset for surgical phase recognition containing 180 video recordings collected from 5 hospitals. To evaluate our method, we make use of a newly introduced annotation protocol for the same to represent the highly variable workflow across different hospitals. This further highlights the difficulty of generating consistent labels and a need to minimize the annotations required from different medical institutions. Surgical phase recognition is the coarse recognition of surgical workflow and is often formulated as a classification task of video frames to one of several predefined phases. Several works [8] - [11] have been proposed to address this task due to its importance in building context-aware systems and improving surgical safety [12] . Whereas surgical phase recognition models have become increasingly performant over the years, these methods were neither trained nor evaluated on videos from multiple hospitals, limiting their deployment in the real world. This is mainly due to the sensitive nature of medical data that impedes data sharing and, by consequence, data aggregation and training on centralized datasets. Recently, less invasive alternative collaborative methods have been proposed for leveraging data from multiple sources in a privacypreserving fashion. Federated Learning has emerged recently as a promising collaborative learning method [13] to mitigate data privacy concerns and the dependence of deep learning methods on developing in large centralized data lakes [7] . In its vanilla setting, federated learning is a process where a central server coordinates and allows for multiple data owners to train collaboratively without having to explicitly expose their data. Data owners can be medical institutions or data centers and are hereinafter referred to as clients. Learning in a federated setting involves several clients iteratively training local models on their data, which are then aggregated by a central server to form a single global model that is used by each client as an initialization for the next round of local training. Given the sensitive nature of medical data, federated learning has naturally been utilized in a variety of healthcare applications [14] . For instance, it has been used for detecting COVID-19 lung abnormalities [15] and for brain tumor segmentation [16] . In these studies [15] [16] , federated learning demonstrated comparable performance to conventional data sharing approaches while mitigating privacy concerns. To the best of our knowledge, no federated learning approach has yet been applied to surgical videos, which is the application context of this work. While federated learning bypasses the need to share data and facilitates collaborative training, the availability of labeled data still constitutes a major bottleneck, especially in healthcare applications. Participating in a supervised federated learning network requires each client to label their data. This is often prohibitively expensive in the medical domain, where data labeling commonly requires clinical expertise. Even assuming the availability of this expertise, the review and coordination on consistency across multiple clients are not always feasible without shared data. This is a major concern given the complexity and inherent ambiguities that several medical applications present. Straightforward solutions may involve naive adaptations of generic semi-supervised methods [17] - [19] or those that were proposed for surgical phase recognition [3] - [6] . However, these methods were primarily designed for clustered datasets, and hence such adaptations may not be suitable if the labeled data and unlabeled data come from different distributions. Still, the few federated learning works that address label deficiency [20] - [22] , demonstrate the feasibility and value of exploiting unlabeled data from multiple sources. Federated unsupervised data representation learning was studied in [23] , [24] , in which multiple clients collaborate to learn useful data representations on unlabeled data before leveraging this learned knowledge in supervised downstream tasks, such as classification. Correspondingly, federated unsupervised representation learning has also demonstrated value in medical imaging, specifically for detecting COVID-19 from x-ray chest scans [25] , cardiac MRI segmentation [26] , and brain MR anomaly segmentation [27] . Unsupervised domain adaptation has been also formulated in a federated setting [28] - [30] to preserve privacy, in which clients with labeled data collaborate with clients with unlabeled data to enhance training for the latter. A similar training scenario involving both labeled and unlabeled data, under which our works falls, is Federated Semi-Supervised Learning (FSSL). In FSSL, clients with labeled and/or unlabeled data collaborate to enhance training for all clients. FSSL has been addressed in several works, most of which utilize pseudo-labeling, a technique to automatically generate artificial labels using model predictions. [31] first introduced a simple framework utilizing pseudo-labeling in a federated setting. Other works proposed methods to enhance pseudolabels. [32] utilized a dynamic thresholding strategy for selecting pseudo-labels based on model confidence along with having multiple clients vote to generate pseudo-labels. [33] utilized peer learning and ensemble averaging from multiple clients. The aforementioned works primarily addressed the scenario where clients have partially labeled datasets, i.e. all clients have both labeled and unlabeled data. In contrast, our work addresses the scenario where a client with a fully labeled dataset and multiple clients with fully unlabeled datasets collaborate for FSSL. This formulation of labeled/unlabeled datasets decoupling has been only recently garnering attention [34] - [40] , but could have crucial practical implications on the scalability of models for healthcare applications by facilitating the participation of clients with completely unlabeled data. This scenario is referred to as FSSL with global semisupervision [40] . In other applications where large labeled datasets are more readily available or can be transferred, the roles of the aggregating server and the labeled data owner may overlap, and this scenario is instead referred to as the labels-at-server scenario [34] - [38] . [34] introduced an inter-client consistency loss for mitigating domain shifts and model decomposition for supervisedunsupervised learning decoupling. [35] proposed a method to minimize gradient diversity across clients models by replacing batch normalization with group normalization [41] and a new model averaging alternative to federated averaging. [36] proposed an adaptive layer-wise parameter selection method for uploading models for aggregation. Other works that also achieve competitive performance on benchmark datasets include [37] - [39] . [37] adapted a combination of two stateof-the-art semi-supervised methods, FixMatch [17] and Mix-Match [19] , in a federated setting. [38] and [39] proposed methods based on contrastive learning and knowledge distillation, respectively. Apart from classification, FSSL has been also used for the task of COVID-19 region segmentation in chest computed tomography scans [40] . A crucial aspect of our work is the importance of learning relevant temporal information from surgical videos, an issue that has not explicitly been addressed in any of the above works. Temporal modeling in federated learning and, more specifically, in FSSL is still relatively unexplored. Only a few applications exist such as traffic flow forecasting [42] , human activity recognition [43] , [44] , audio recognition [45] , and machine fault diagnosis [46] . Finally, it is worth noting that in all the FSSL mentioned methods except [40] , data distribution shifts across clients were simulated artificially based on label distributions. Aside from label distribution differences, our data are also characterized by a heterogeneity that originates from the different hospitals, such as surgical workflow, demographic, and hardware variability. This practical difference highlights the need for further study on real-world data, which has not been well investigated in previous works. In this section, we describe our proposed federated semisupervised learning method for surgical phase recognition. Surgical phase recognition is a single-label multiclass classification problem of surgical video frames. Though distinct scenarios have been described in previous works [33] , [34] , [40] of federated learning using varying amounts of labeled data at different locations, our work specifically assumes the presence of one fully labeled private dataset and several other completely unlabeled private datasets. Whereas our proposed method can be extended to utilize multiple labeled datasets, we choose to tackle this instance of label deficiency due to the practical limitations of generating consistent annotations for complex tasks without shared data and review processes. This represents the real-world use-case allowing to scale models to clients that do not have the technical and clinical bandwidth to generate consistently labeled datasets. As discussed in Section II and empirically proved in our results, previous FSSL approaches may be suboptimal for the task of phase recognition. We, therefore, design our method to learn temporal patterns found in videos while effectively harnessing the task knowledge that can be learned from the labeled data. We adapt Temporal Cycle Consistency [47] , a self-supervised learning technique for learning temporal patterns, by guiding it to learn more task-relevant features through concurrently optimizing a contrastive loss on the labeled data. In the rest of this section, we first describe two core components of FedCy -Temporal Cycle Consistency and constrastive learning -and finally the federated training process. Temporal Cycle Consistency Learning (TCC) [47] is a self-supervised method for learning temporal correspondences between videos. TCC has been shown to be useful as a selfsupervised pretraining task for learning spatio-temporal representations that can boost performance on downstream tasks such as activity recognition in videos [47] . In this subsection, we briefly describe the concept of cycle consistency learning, which we later adapt for the task of surgical phase recognition. Consider two sequences of video frames: S = {s k } n k=1 and respectively. Finally, let Q denote a similarity metric of two feature vectors (e.g. cosine similarity). For an embedding u k ∈ U , let: • v r denote the nearest neighbor of u k in V . • u l denote the nearest neighbor of v r in U . An embedding u k is said to be cycle consistent if l = k. [47] reformulated this constraint as a regression task with a differentiable loss function, which we describe below, that can be used as a learning objective to train deep neural networks. 2) Consistency loss for a pair of sequences (S,T): For an embedding u k ∈ U , we compute its soft nearest neighborṽ Here, we use the softmax function to compute α = {α i } m i=1 , a vector representing the similarity of u k to each of the embeddings of V , with τ , the softmax temperature that is used to scale the logits fed into the softmax function. Similarly, we compute β = {β i } n i=1 , the vector representing the similarities ofṽ to each of the embeddings of U . For u k to be cycle consistent with respect to V, β would need to show peak-like behavior at its k th entry. The loss function described below enforces this using a Gaussian prior imposed on β: and λ σ is the weight for the variance regularization loss term that forces β to be more sharp around its k th entry. The final consistency loss for two sequences of frames S, T is thus: 3) Clip Sampling: A straightforward implementation of cycle consistency loss would involve training using multiple videos. However, this approach is not easily scalable, particularly when dealing with long videos such as surgical recordings, due to memory constraints. Short clips are sampled from videos and used for training instead. We define a clip as a set of chosen frame IDs c selected per sampling strategy. Let V be a video of n frames, whose IDs are defined by the set {1, 2, 3, . . . , n}, and let k be the size of the clip to be sampled. In [47] , two sampling strategies were used in the case of long videos, albeit for much shorter durations than the surgical videos used in this work: 1) Uniformly Strided Sampling with Offset: where s > 0 is the stride (fixed) and o is a randomly (uniformly) chosen offset constrained by o + (k − 1)s ≤ n. 2) Random Sampling with Offset: are randomly (uniformly) chosen samples constrained by a fixed offset o: s i ≥ o ∀i and s i > s j ∀i > j. Laparoscopic cholecystectomy videos are usually characterized by highly variable phase durations across different videos [8] . We hypothesize that using a fixed stride (sampling strategy 1) thus may not be efficient in our case. Additionally, since some phases are much shorter than others, sampling strategy 2 may not efficiently represent entire procedures during training. We therefore introduce a clip sampling strategy that mitigates the above concerns, depicted in Fig. 1 . Each video is divided into k equal partitions, k being the clip size to be sampled. Then, 1 frame is randomly selected from each partition to generate a clip of size k. We sample L k nonoverlapping clips from each video to be used in each epoch of training, where L is the corresponding video length. Contrastive learning was first introduced in [48] . Given a frame d a (the anchor), a positive frame d p (e.g. a frame that belongs to the same class of d a , or an augmented version of d a ), and a negative frame d n , the principle is to learn a feature representation φ that maps d a , d p , d n to feature vectors f a , f p , f n while maximizing the similarity of (f a , f p ) and minimizing the similarity of (f a , f n ). In this work, we will be using the NT-xent loss function for contrastive learning, as termed in [49] . For an anchor f a , a positive f p and a set of negatives F n : C. Learning Objective 1) Problem formulation: Given a single client C L with a private local dataset D L and a set of M clients with private unlabeled datasets {D j U } M j=1 , our aim is to learn a single global model that is effective on all the clients for phase recognition. The labeled dataset D L can be represented by: with n denoting the total number of frames, x i denoting a frame, and y i a one-hot label corresponding to one of P phases (classes) represented in where c i is a clip of k frames and n j c is the total number of clips in D j U sampled according to the strategy defined in Section III-A.3. All the considered clients collaboratively train, in a federated setting coordinated by a central server, a global model represented by a feature extractor φ G parameterized by ω G and a classifier H G parameterized by θ G , that produces softmax outputs corresponding to phase probabilities. The clients C U will learn spatio-temporal information which, when supported by the supervised discriminative knowledge learned by the client C L , can be useful for phase classification without the need for client-specific fine-tuning. As illustrated by Fig. 2 , the training procedure is divided into three main steps detailed below and executed repeatedly as follows: • In Parallel: -Unsupervised training by each client C j U . -Supervised training by the client C L . • Model aggregation by the central server. 2) Unsupervised Training: Each client C j U locally trains a feature extractor φ j U parameterized by ω j U by minimizing the temporal cycle consistency loss on the clips of its dataset using equations 1-3. For a given batch of clips B ⊂ D j U , each client optimizes the following: where λ t is a weight. 3) Supervised Training: Due to the complexity of the phase recognition task, unsupervised learning of temporal correspondences may not necessarily yield spatio-temporal representations that are directly useful for phase recognition. For example, recurring events such as unexpected bleeding that are not specific to any phase of the procedure may serve as confounding factors during training. The objective function of the client C L is thus designed in a way that it can play the role of a guide for the unsupervised training being done by the other clients. To do this, in addition to minimizing the supervised cross-entropy loss, we force representations of different classes to be more distant in the feature space by adding a contrastive loss term on the final objective of this client. We hypothesize that propagating this learned knowledge through federated learning will drive the network towards finding more task-specific temporal correspondences from the unlabeled data. Therefore, the client C L will train a feature extractor φ L of parameters ω L and a classifier H L of parameters θ L by minimizing a contrastive loss term and a cross-entropy loss term. A batch of frames B ⊂ D L can be represented as B = {J 1 , J 2 , . . . , J P } where J i contains the frames belonging to class i. Let {X 1 , X 2 , . . . , X P } denote the sets of the corresponding extracted feature vectors of B by φ L . Then, the contrastive loss on B can be calculated using equation 4 as follows: where: The final objective of the client C L will thus be: where L ent is the cross-entropy loss function, and λ c is the weight of the contrastive loss term. Aggregation at the Server: After each round of local training, the server will aggregate the trained local models using FedAvg [7] into a global model (φ G , H G ) which will then be sent back for another local training round: where p L and p j U represent respectively the fraction of data contributed by clients C L and C j U during each round. To evaluate FedCy, we introduce Multicenter Cholecystectomy 2022 (MultiChole2022): a large multicenter dataset comprising 180 laparoscopic cholecystectomy (LC) videos. MultiChole2022 is composed of the 80 videos of the public dataset Cholec80 [8] , which were collected from the University Hospital of Strasbourg, France, along with 4 sets of 25 videos collected in the following Italian hospitals: Policlinico Universitario Agostino Gemelli, Rome; Azienda Ospedaliero-Universitaria Sant'Andrea, Rome; Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan; and Monaldi Hospital, Naples. Participating hospitals only shared anonymized endoscopic videos through encrypted hard drives. No clinical data were harvested or shared. Datasets of these hospitals will be anonymously denoted by D 1−4 . The inherent data diversity represented in MultiChole2022 could facilitate research in a variety of topics such as domain adaptation and federated learning for laparoscopic surgery. LC workflow has been divided into phases in several ways [11] . In [8] , a LC is divided into 7 phases: Preparation, Calot Triangle Dissection, Clipping and Cutting, Gallbladder Dissection, Gallbladder Packaging, Cleaning and Coagulation, and Gallbladder Extraction. To annotate the workflow in LC videos, an annotation protocol describing the visual clues signaling the start and end of each surgical phase should be defined. A robust annotation protocol ensures the reproducibility and consistency of annotations across annotators and hospitals [50] . 1) The MultiChole2022 phase annotation protocol: Surgical workflows may differ across hospitals, and hence annotation protocols designed for videos from a specific hospital might not generalize well to videos from another center. For instance, the protocol used to annotate Cholec80 made large use of instrument presence to define the end of a phase corresponding to the beginning of the next, assuming a consistent use of instruments and a fixed order of phases across procedures performed in the same surgical department. These assumptions do not hold when annotating a multicenter dataset as the order of phases and instrument usage is more likely to vary across hospitals. In addition, some phases might be performed differently across hospitals. Fig. 3 illustrates the case of Gallbladder Extraction. In some hospitals, the gallbladder is extracted through the same trocar used to insert the endoscopic camera whereas in other hospitals a lateral trocar is used. Consequently, in the first case, the gallbladder extraction phase largely takes place out-of-sight whereas in the second strong visual cues can be defined that consistently mark the beginning of this phase. With these considerations in mind and a focus on capturing surgical semantics, we designed a new, robust annotation protocol, the MultiChole22 protocol. Using this protocol, we annotated all the 180 MultiChole22 videos. Detailed specifications of this protocol are available in the supplementary material. To evaluate how well FedCy performs on data from hospitals not participating in training, we also annotate a subset of 6 videos from the TUM LapChole dataset [51] following the previously introduced annotation protocol. These 6 videos were taken from the Hospital Klinikum Rechts der Isar of Munich, Germany. We establish several baselines to contextualize FedCy with respect to various learning paradigms. In all cases, our aim is to train a ResNet-50 [52] model initialized from ImageNet [53] pretrained weights for the task of surgical phase recognition. In all semi-supervised experiments, we use Cholec80 [8] as the labeled dataset and refer to the 4 unlabeled datasets as D 1−4 . Our first category of baselines comprises various fullyand semi-supervised methods assuming access to a single, fully labeled dataset and 4 completely unlabeled datasets, corresponding to 5 independent clients. These methods were chosen to demonstrate the superiority of our approach to both naïve federated approaches and state-of-the-art designs that use the same amount of labeled data. The first baseline, FullSup-Cholec80, is trained on a fully labeled cholec80 train set. This presents the standard non-federated approach that excludes external unlabeled data from training. The next baselines, Fed-FixMatch, FedUDA, and FedTCC represent naïve adaptations of FixMatch [17] , UDA [18] and TCC pretraining [47] , respectively, using federated averaging [7] . Our two final baselines in this category, FedMatch [34] and FedRGD [35] , are stateof-the-art FSSL approaches. Here, FedRGD proposes several changes to improve model performance including a change to the network design, replacing batch normalization layers with group normalization [41] . To perform a fair comparison with FedRGD, we also implement our final FedCy model with group normalization, denoted by FedCy-GN. Our second category of baselines includes only fully supervised approaches assuming that each of the included datasets is fully labeled. These baselines, which are dependent on the presence of additional labeled data, are included to provide the target performance that we would like to achieve through this and future semi-supervised work. In this category, FullSup-Each, is a model trained using full supervision only on the dataset it is being evaluated on. This represents the strictest scenario in terms of privacy where no sharing of either data or models is involved in training. In contrast, FullSup-All, represents the most relaxed scenario, where a single model is trained on all the data collected together and treated as a single, centralized dataset. Finally, federated averaging [7] represents the middle ground where models but not data can be transferred freely. All presented models were trained on NVIDIA A100 GPUs. Videos were subsampled at 1 frame per second. Soft data augmentation (shift, rotate, scale) was used in all experiments following [9] . Each experiment, except the pretraining phase of FedTCC, was run for a minimum of 6 epochs and stopped when the validation F1-score stops improving for 3 consecutive epochs. FedTCC was pretrained for 30 epochs. The Cholec80 dataset was divided into splits containing 40-8-32 (training-validation-testing) videos following state-of-theart usage. Each of the other 4 datasets in MultiChole2022 was divided into splits of 13-6-6 (training-validation-testing) by stratified random sampling. In all semi-supervised experiments, only the Cholec80 validation split was used. The validation splits of all the datasets were only used for the fully supervised baselines. We use a fixed learning rate of 5 × 10 −5 and a weight decay of 5 × 10 −5 for all models based empirically on the FullSup-Cholec80 experiment. Similarly, we fix a batch size of 64 for all models. Specifically, for FedCy and FedTCC, where unlabeled training happens on clips and not images, we set the clip size and batch size to 16 and 2, respectively. We use an Adam optimizer for all experiments, and a ResNet-50 [52] feature extractor pretrained on ImageNet [53] with a 2048-sized feature vector output. The classifier used on top of the ResNet-50 is a fully connected layer with a softmax activation. In FedRGD and FedCy-GN, batch normalization layers of the ResNet-50 are replaced with group normalization [41] using 32 channel groups. In these two models, group normalization layers weights are initialized with those of the batch normalization layers from the pretrained ImageNet weights. Finally, we also empirically set the TCC softmax temperature, NT-xent loss temperature, λ c , and λ t to 0.05, 0.1, 10, and 10, respectively, based on validation results. In all federated experiments, the communication cost is set to one epoch of local training per round, and each client used a local optimizer whose weights are excluded from model aggregation. For the state-of-the-art implementations, we refer the reader to the supplementary material for a complete list of the hyperparameters used. F1-score is used to evaluate all models. We present the test results per-client (Cholec80, D 1−4 ) , macro-averaged across all unlabeled datasets (Overall unlabeled ), and macro-averaged across all datasets (Overall all ). Each of the presented results shows the mean and standard deviation of the model performance over 3 reruns. In Table II , we compare FedCy to several baselines, using varying amounts of supervision, as described in Section V-A. We firstly observe from FullSup-Cholec80, which is trained only on Cholec80 in a standard fully-supervised setting, that there are significant disparities between performance on different client datasets. In particular, the 21.5% gap in F1-score when testing on videos from the same client versus the average performance on videos from the other 4 clients highlights a major limitation of the generalizability of deep learning models. The development of principled, systematic, and accessible approaches to both train and evaluate on data sourced by varied and independent clients represents a significant step towards the clinical translation of such applications. Federated learning-based approaches that can leverage the availability of unlabeled data from the other 4 clients expectedly show boosts in performance. For example, FedTCC which uses a ResNet-50 backbone that has been pretrained on all the datasets using a federated averaging based implementation of TCC demonstrates an average boost of 5.7% F1 on the unlabeled datasets over FullSup-Cholec80. Our proposed approach, FedCy demonstrates markedly and consistently superior performance over all the presented FSSL approaches and the model trained on only Cholec80. Notably, this equates to an average increase of 3.4% and 9% F1 on the unlabeled datasets over the next best baseline in this category (FedTCC) and FullSup-Cholec80, respectively. We also see approximately the same overall performance of our approach whether it uses batch normalization or group normalization. Interestingly, despite using a large fully labeled dataset, FedCy also helps boost performance on the Cholec80 test set with an increment of 3.9% F1 over fully supervised training on only Cholec80. This may be attributed to a need for even more labeled samples or a regularizing effect provided by the training procedure on the unlabeled data. When comparing against the models trained using labeled data from every client, we see that FedCy goes a long way towards bridging the gap in performance towards fully supervised approaches while significantly mitigating privacy and annotations concerns. To note, the non-collaborative approach FullSup-Each, where each client trains a tailored model on their dataset, performs 5.9% worse on average than training a global model trained on a single clustered dataset (FullSup-All). While this reflects the need for large and diverse datasets, this number may in fact understate this need because this baseline is predicated on the generation of independently but consistently annotated data. Impressively, on two of the five considered datasets, Cholec80 and D 2 , FedCy even outperforms the FullSup-Each baseline by up to 1.6%. Given that FedCy demonstrates the significant value of semi-supervised learning of temporal correspondences, we carefully study different components of our formulation of TCC on unlabeled datasets. In this experiment, we analyze the role of clip size and batch size. Here, clip size reflects the granularity of temporal correspondences that we are trying to learn on the unlabeled data and batch size represents the number of clips we are trying to find correspondences between at each training iteration. In the heatmap presented in Fig. 4 , we see large gains in performance up to a clip size of 8 after which the performance tends to saturate for all clip sizes. Concerning the batch size, we see that FedCy is largely robust to variations in this parameter. 2) Sampling Strategies: Here, we compare our proposed sampling strategy against the approaches used in [47] and described in Section III. In Table III , we see that the choice of sampling strategy could greatly affect results. Particularly, simply extending the strategies proposed in [47] to sample multiple clips from each video results in marked increase in performance. Our proposed sampling approach is ∼ 1% better, on average, than the next best strategy (random sampling using multiple clips). In this ablation study, we aim to illustrate the crucial role that the supervised contrastive loss plays in driving the training process on the unlabeled data. Table IV shows a clear improvement in F1-score performance across the board with the addition of the contrastive loss. Quantitatively, this translates to a 3.3% and 4.5% increase in F1 score overall and on average across the unlabeled datasets, respectively. Notably, we also see that even without the contrastive loss, FedCy performs on par with or significantly better than proposed alternative methods for federated semi-supervised learning. Whereas the previously presented results are based on a ResNet-50 feature extractor, state-of-the-art methods for supervised surgical phase recognition [9] , [10] often additionally train temporal modules, such as multi-stage temporal convolutional networks [54] , using the learned features extracted from the feature extractor. We study the quality of the features learned by FedCy and the other baselines by training a twostage temporal convolutional neural network (TeCNO) [9] . We use Cholec80 to fine-tune TeCNO for all the semi-supervised baselines, and the corresponding client data for each of the FullSup-Each baselines. For FedAvg and FullSup-All, we correspondingly train TeCNO using federated averaging and a centralized dataset, respectively. In Table V , we see that the previously noted improvements in the feature extractor (Table II ) also translate to gains in performance with the addition of the temporal module. Using the unlabeled data, we see improvements on both Cholec80 and the unlabeled datasets over state-of-the-art phase recognition performance (FullSup-Cholec80). Whereas the substitution of batch normalization with group normalization (FedCy-GN) does not significantly improve the predictive quality on single images (see Table II ), when used in tandem with the temporal context of a frame, the learned representational space seems to be much more expressive. FedCy-GN outperforms all other semi-supervised baselines and even the much more label-intensive baseline, FullSup-Each. Note that here, FullSup-Cholec80 corresponds to [9] but uses the newly generated phase labels and does not use instrument usage annotations. In We studied the task of surgical phase recognition of laparoscopic cholecystectomy (LC) videos in a federated semisupervised learning setting, highlighting the feasibility and efficacy of training on unlabeled datasets without the need to explicitly share data. To this end, we proposed a novel FSSL method, FedCy, to efficiently leverage the temporal knowledge found in labeled videos using contrastive learning to guide unsupervised training on unlabeled videos. Comparisons with the state-of-the-art FSSL methods showed significant improvements by our method in the task of surgical phase recognition on labeled, unlabeled and held-out datasets. To conduct this study, we generated a new diverse and large multicenter dataset of LC videos, annotated with 6 phases according to a newly introduced annotation protocol that is robust to workflow variations and reflective of surgical semantics. We believe that such diverse datasets can push multicenter research and evaluation, contributing to the clinical translation of tools for surgical video analysis. In the context of our application, several limitations remain that we plan to address in future works. Firstly, the annotation protocol used to annotate the labeled dataset may still not be applicable to hospitals not participating in this study. Both this work and limitation emphasize the need for wider consensus on annotation protocols used to generate the precious datasets that drive work in the field. Besides that, a crucial property of FedCy is the critical role that the labeled dataset plays in guiding the unsupervised training. In this context, using a more representative dataset than Cholec80 [8] for supervised training could add another significant performance boost. Furthermore, in the FSSL scenarios where labeled datasets originate from sources different from those of the unlabeled ones, the data validation split used for hyperparameter tuning and model selection still poses a problem. Most of the FSSL work mentioned in Section II simulate the federated learning setup by gathering data, reserving a validation and a test split, then splitting the training data into non-identically distributed datasets. In practice, and in our experiments, data validation splits correspond to only the labeled dataset, and hence hyperparameter tuning and model selection might be biased towards the labeled dataset. Such practical issues could be addressed in future work, for example, by using weak labels. Finally, future work might investigate how temporal modules can be adapted to FSSL methods, including FedCy. Common practices [9] , [10] in the non-federated supervised settings involve using the learned features from the feature extractor as inputs to train a temporal module. Adapting such multi-stage training pipelines in a FSSL setting could boost performance, but is challenging due to memory constraints. Table I presents the newly introduced annotation protocol. The table presents the definition of each phase, as well as additional notes concerning the whole surgical workflow. Starting Signal Preparation The first endoscopic frame (i.e., camera inserted in the abdominal cavity). The main operator grasps the gallbladder with the left instrument (e.g., grasper) and incises the peritoneal reflection on the hepatocystic triangle with the right instrument (e.g., hook). A sealing device (e.g., clipper) is inserted to seal (e.g., clip) the cystic artery or the cystic duct. A dissecting tool (e.g., hook) starts detaching the gallbladder from the liver bed after both the cystic duct and the cystic artery have been divided. The specimen bag is seen in the frame after the gallbladder is completely detached from the liver. Cleaning and Coagulation After the gallbladder has been detached from the liver, any of the following cues can mark the start of this phase: -The presence of a cleaning tool(e.g., suction and irrigation device, drainage, gauze) -A coagulation tool (e.g., bipolar) enters the scene for the first time after the gallbladder is detached from the liver. Notes -Incomplete recordings can fail to show certain phases, such as Preparation. -Cleaning and Coagulation can happen more than once (before and after Gallbladder Packaging), but never before Gallbladder Dissection. -In normal, retrograde, cholecystectomy, the first 4 phases are always consecutive and in the same order. The last two phases can happen in any order. -In fundus first, anterograde, cholecystectomy, the first 4 phases are not necessarily consecutive and Clipping and Cutting can happen twice: once for the cystic artery before Gallbladder Dissection, and once for the cystic duct after Gallbladder Dissection. Table II presents the hyperparameters used for the state-of-the-art baselines when reporting our results. These parameters are the final ones chosen after tuning. Please note here that several parameters such as ψ-factor reference variables described in the FedMatch public code. For these four methods, we implement hard data augmentation using the RandAugment library https://pypi.org/ project/randaugment/ with the same parameters as the ones used in FedMatch: • N : Number Sparsity threshold for the ψ weights during inference 10 −5 λ L 1 Weight of L1 regularization on ψ weights 10 −3 λ L 2 Weight of L2 regularization on the ψ − σ difference 10 FedRGD S The number of groups 1 λu Weight of the unsupervised loss term 0.1 τ thresh Confidence threshold for pseudo-labels 0. 25 TABLE II HYPERPARAMETERS USED FOR STATE-OF-THE-ART BASELINES. Surgical data science for next-generation interventions Surgical data science-from concepts toward clinical translation Less is more: Surgical phase recognition with less annotations through self-supervised pre-training of cnn-lstm networks Automated surgical activity recognition with one labeled sequence Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition Semi-supervised learning with progressive unlabeled data excavation for label-efficient surgical workflow recognition Communication-efficient learning of deep networks from decentralized data Endonet: a deep architecture for recognition tasks on laparoscopic videos Tecno: Surgical phase recognition with multi-stage temporal convolutional networks Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer Machine learning for surgical phase recognition: a systematic review A computer vision platform to automatically locate critical events in surgical videos: documenting safety in laparoscopic cholecystectomy Federated learning in medicine: facilitating multiinstitutional collaborations without sharing patient data The future of digital health with federated learning Federated deep learning for detecting covid-19 lung abnormalities in ct: a privacy-preserving multinational validation study Multiinstitutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation Fixmatch: Simplifying semi-supervised learning with consistency and confidence Unsupervised data augmentation for consistency training Mixmatch: A holistic approach to semi-supervised learning Towards utilizing unlabeled data in federated learning: A survey and prospective Emerging trends in federated learning: From model fusion to federated x learning Concepts, key challenges and open problems of federated learning Federated unsupervised representation learning Collaborative unsupervised visual representation learning from decentralized data Federated contrastive learning for decentralized unlabeled medical images Federated contrastive learning for volumetric medical image segmentation Feddis: Disentangled federated learning for unsupervised brain pathology segmentation Federated adversarial domain adaptation Privacy-preserving unsupervised domain adaptation in federated setting Federated multi-target domain adaptation Exploiting unlabeled data in smart cities using federated edge learning Fedtrinet: A pseudo labeling method with three players for federated semi-supervised learning Fedperl: Semi-supervised peer learning for skin lesion classification Federated semisupervised learning with inter-client consistency & disjoint learning Improving semi-supervised federated learning by reducing the gradient diversity of models Fedsiam: Towards adaptive federated semi-supervised learning Semifl: Communication efficient semisupervised federated learning with unlabeled clients Fedcon: A contrastive framework for federated semi-supervised learning Federated semi-supervised medical image classification via inter-client relation matching Federated semi-supervised learning for covid region segmentation in chest ct using multi-national data from china, italy, japan Group normalization Cross-node federated graph neural network for spatio-temporal data modeling Semi-supervised federated learning for activity recognition Semi-supervised methodologies to tackle the annotated data scarcity problem in the field of har Federated self-training for semi-supervised audio recognition Federated learning for machinery fault diagnosis with dynamic validation and self-supervision Temporal cycle-consistency learning Dimensionality reduction by learning an invariant mapping A simple framework for contrastive learning of visual representations Artificial intelligence for surgical safety: automatic assessment of the critical view of safety in laparoscopic cholecystectomy using deep learning The tum lapchole dataset for the m2cai 2016 workflow challenge Deep residual learning for image recognition Imagenet large scale visual recognition challenge Ms-tcn: Multi-stage temporal convolutional network for action segmentation