key: cord-0139824-ykyyrxhw authors: Yang, Luyu; Gao, Mingfei; Chen, Zeyuan; Xu, Ran; Shrivastava, Abhinav; Ramaiah, Chetan title: Burn After Reading: Online Adaptation for Cross-domain Streaming Data date: 2021-12-08 journal: nan DOI: nan sha: b257e1ecdf123d54a37ad07eb2bfb02bc0f10dff doc_id: 139824 cord_uid: ykyyrxhw In the context of online privacy, many methods propose complex privacy and security preserving measures to protect sensitive data. In this paper, we argue that: not storing any sensitive data is the best form of security. Thus we propose an online framework that"burns after reading", i.e. each online sample is immediately deleted after it is processed. Meanwhile, we tackle the inevitable distribution shift between the labeled public data and unlabeled private data as a problem of unsupervised domain adaptation. Specifically, we propose a novel algorithm that aims at the most fundamental challenge of the online adaptation setting--the lack of diverse source-target data pairs. Therefore, we design a Cross-Domain Bootstrapping approach, called CroDoBo, to increase the combined diversity across domains. Further, to fully exploit the valuable discrepancies among the diverse combinations, we employ the training strategy of multiple learners with co-supervision. CroDoBo achieves state-of-the-art online performance on four domain adaptation benchmarks. With the onslaught of the pandemic, the internet has become an even more ubiquitous presence in all of our lives. Living in an enormous web connecting us to each other, we now face a new reality: it is very hard to escape one's past on the Internet now that every photo, status update, and tweet lives forever in the cloud [19, 67, 81, 85] . Moreover, recommender systems that actively explore the user data [22, 100] for data-driven algorithms have encouraged the debate about the right to privacy over convenience. Fortunately, we have the Right to Be Forgotten (RTBF), which gives individuals the right to ask organizations to delete their personal data. Recently, many solutions [69, 90, 101, 106] have been proposed that try to preserve privacy in the context of deep learning, mostly focused on the Federated Learning [31, 82, 92, 96] . Federated Learning allows asynchronous update of multiple nodes, in which sensitive data is stored only on a few specific nodes. However, recent studies [30, 97, 105] show that private training data can be leaked through the gradients sharing mechanism deployed in distributed models. In this paper, we argue that: not storing any sensitive data is the best form of security. The best form of security requires us to delete the user data after use, which necessitates an online framework. However, existing online learning frameworks [42, 60, 71] cannot meet this need without addressing the distribution shift from public data, i.e. source domain, to the private user data, i.e. target domain. Therefore, in this paper we propose an online domain adaptation framework in which the target domain streaming data is deleted immediately after adapted. The task that is seemingly an extended setting of unsupervised domain adaptation (UDA), however, cannot simply be solved by the online implementation of the offline UDA methods. We explain the reason with a comprehensive analysis of the existing domain adaptation methods. To begin with, existing offline UDA methods rely heavily on the rich combinations of cross-domain mini-batches that gradually adjust the model for adaptation [18, 37, 58, 74, 77, 83, 86, 98, 102] , which the online streaming setting cannot afford to provide. In particular, many domain adversarialbased methods [5, 18, 26, 35, 57, 88] depend on a slowly annealing adversarial mechanism that requires discriminating large number of source-target pairs to achieve the adaptation. Recently, state-of-the-art offline methods [32, 40, 41] show promising results by exploiting target-oriented clustering, which requires an offline access to the entire target domain. Therefore, the online UDA task needs new solutions to succeed at scarcity of the data from target domain. We aim straight at the most fundamental challenge of the online task-the lack of diverse cross-domain data pairs-and propose a novel algorithm based on crossdomain bootstrapping for online domain adaptation. At each online query, we increase the data diversity across domains by bootstrapping the source domain to form diverse At each timestep, only the current j-th query is available from the target domain. CRODOBO bootstraps the source domain batches that combine with the current j-th query in order to increase the cross-domain data diversity. The learners w u and w v exchange the generated pseudo-labelsŷ u j andŷ v j as co-supervision. Once the current query is adapted in the training phase, it is tested immediately to make a prediction. Each target query is deleted after tested. Example images of source domain are from Fashion-MNIST [94] , adapting to target domain Deep-Fashion [43] . Best viewed in color. combinations with the current target query. To fully exploit the valuable discrepancies among the diverse combinations, we train a set of independent learners to preserve the differences. Inspired by [99] , we later integrate the knowledge of learners by exchanging their predicted pseudo-labels on the current target query to co-supervise the learning on the target domain, but without sharing the weights to maintain the learners' divergence. We obtain more accurate prediction on the current target query by an average ensemble of the diverse expertise of all the learners. We call it CRODOBO: Cross-Domain Bootstrapping for online domain adaptation, an overview of CRODOBO is shown in Figure 1 . We conduct extensive evaluations on our method, including the classic UDA benchmark VisDA-C [59] , a practical medical imaging benchmark COVID-DA [103] and the large-scale distribution shift benchmark WILDS [34] subset Camelyon. Moreover, we propose a new adaptation scenario in this paper from Fashion-MNIST [94] to DeepFashion [43] . On all the benchmarks, our method outperforms the state-of-the-art UDA methods that are eligible for the online setting. Further, without the reuse of any target sample, our method achieves comparable performance to the offline setting. We summarize the contributions as follows. (1) We propose an online domain adaptation framework to implement the right to be forgotten. (2) We propose a novel online domain adaptation algorithm that achieves new stateof-the-art online results, and comparable results to the offline setting. (3) Despite being simple, the comparable per-formance to the offline setting suggests that our method is an excellent choice even just for time efficiency. The Right to Be Forgotten [4, 19, 55, 81, 85] , also referred to as right to vanish, right to erasure and courtesy vanishing, is the right given to each individual to ask organizations to delete their personal data. RTBF is part of the General Data Protection Regulation (GDPR). As a legal document, the GDPR outlines the specific circumstances under which the right applies in Article 17 GDPR 1 . The first item is: The personal data is no longer necessary for the purpose an organization originally collected or processed it. Yet, the exercise of this right has become a thorny issue in applications. Politou et al. [62] discussed that the technical challenges of aligning modern systems and processes with the GDPR provisions are numerous and in most cases insurmountable. In [63] they specifically examined the implications of erasure requests on current backup systems and highlight a number of challenges pertained to the widely known backup standards, data retention policies, backup mediums, search services, and ERP systems [1] . In the context of machine learning, Villaronga et al. [85] addressed that the core issue of the AI and Right to Be Forgotten problem is the dearth of interdisciplinary scholarship supporting privacy law and regulation. Graves et al. [21] proposed three defense mechanisms against a general threat model to enable deep neural networks to forget sensitive data while maintaining model efficacy. In this paper, we focus on how to obtain model efficacy while erasing data online to protect the user's right to be forgotten. Online Adaptation to Shifting Domains was first investigated in Signal Processing [14] and later studied in Natural Language Processing [13] and Vision tasks [7, 10, 15, 29, 48, 51, 65, 95] . Jain et al. [29] assumed the original classifier output a continuous number of which a threshold gives the class, and reclassify points near the original boundary using a Gaussian process regression scheme. The procedure is presented as a Viola-Jones cascade of classifiers. Moon et al. [51] proposed a four-stage method by assuming a transformation matrix between the source subspace and the mean-target subspace embedded in the Grassmann manifold. The method is designed for handcrafted features. In the context of deep neural network, we argue that one transformation matrix might not be sufficient to describe the correlation between source and target deep representations [53] . Taufique et al. [80] approached the task by selectively mixing the online target samples with those that were saved in a buffer. Without a further discussion of which samples can be saved in the buffer, we find this method limited in the exercise of the right to be forgotten. Active Domain Adaptation [6, 11, 47, 49, 61, 64, 66, 70] also benefits the online learning of shifting domains. It bears a different setting: the target domain can actively acquire labeled data online. Rai et al. [66] presented an algorithm that harnessed the source domain data to learn a initializer hypothesis, which is later used for active learning on the target domain. Ma et al. [47] allowed a small budget of target data for the categories that appeared only in target domain and presented an algorithm that jointly trains two sub-networks of different learning strategies. Chen et al. [6] proposed an algorithm that can adaptively deal with interleaving spans of inputs from different domains by a tight trade-off that depends on the duration and dimensionality of the hidden domains. Ensemble Methods for Online Learning [9, 50, 89] such as bagging and boosting have shown advantages handling concept drift [46] and class imbalance, which are common challenges in the online learning task. MinKu et al. [50] addressed the importance of ensemble diversity to improve accuracy in changing environments and proposed the measurement of ensemble diversity. Han et al. [24] proposed a regularization for online tracking with a subset of branches in the neural network that are randomly selected. Although online learning and online domain adaptation share similar streaming form of data input, we argue that the two tasks face fundamentally different challenges. For online learning, the challenge is to select the most trustworthy supervisions from the streaming data by differentiating the in-formative vs. misleading data points, also known as the stability-plasticity dilemma [28] . For online domain adaptation, the streaming data of target domain naturally comes unlabeled, and the challenge is the scarcity of supervision. Thus the goal is how to maximize the utilization of the supervision from a different but related labeled source domain. Given the labeled source data where N S and N T represent the number of source and target samples, both offline and online adaptation aim at learning a classifier that make accurate predictions on D T . The offline adaptation assumes access to every data point in D S or D T , synchronous [18, 74, 77, 83, 98] or asynchronous [40] domain-wise. The inference on D T happens after the model is trained on both D S and D T entirely. For the online adaptation, we assume the access to the entire D S , while the data from D T arrives in a random streaming fashion of mini-batches where B is the batch size, M T is the total number of target batches. Each mini-batch T is first adapted, tested and then erased from D T without replacement, as shown in Figure 1 . We refer each online batch as a query in the rest of the paper. The fundamental challenge of the online task is the limited access to the training data at each inference query, compared to the offline task. Without loss of generality, let's assume there are 10 3 source and target batches respectively. In an offline setting, the model is tested after training on at most 10 6 combinations of source-target data pairs, while in an online setting, an one-stream model can see at most 10 3 +500 combinations at the 500-th query. Undoubtedly, the online adaptation faces a significantly smaller data pool and data diversity, and the training process of the online task suffers from two major drawbacks. First, the model is prone to underfitting [84] on target domain due to the limited exposure, especially at the beginning of training. Second, due to the erasure of "seen" batches, the model lacks the diverse combinations of source-target data pairs that enable the deep network to find the optimal cross-domain classifier [38] . The goal of the proposed method is to minimize these drawbacks of the online setting. We first propose to increase the data diversity by cross-domain bootstrapping, which are preserved in multiple independent learners. Then we fully exploit the valuable discrepancies of these learners by exchanging their expertise on the current target query to co-supervise each other. The diversity of cross-domain data pairs is crucial for most prior offline methods [18, 58, 74] to succeed. Since the target samples cannot be reused in the online setting, we propose to increase the data diversity across domains by bootstrapping the source domain to form diverse combinations with the current target domain query. Specifically, for each target query T j , we randomly select a set of K mini-batches of the same size from the source domain with replacement. Correspondingly, we define a set of K base learners {w k } K k=1 . At each iteration, a learner w k makes prediction for query T j after trained on {T j , S k j }, and updates via where η is the learning rate, c is the number of classes, p k j is the predicted probability by the k-th learner, and L(, ) is the objective function. The predicted class for T j is the average of K predictions of the base learners. We justify our design choice from the perspective of uncertainty estimation in the following discussion. Theoretical Insights As mentioned in Sec. 3.1, we aim at the best estimation of the current target query. We first consider a single learner situation. At the j-th query, the learner faces a fundamental trade-off: by minimizing the uncertainty of the j-th query, the learner can attain the best current estimation. Yet the risk of fully exploring the uncertainty is to spoil the existing knowledge from the previous j-1 target domain queries. However, if we don't treat the uncertainty, the single observation on j-th query is less informative for current query estimation. Confronting the dilemma, we should not ignore that the uncertainty captures the variability of a learner's posterior belief which can be resolved through statistical analysis of the appropriate data [3, 54] . This gives us hope for a more accurate model via uncertainty estimation. One popular suggestion for resolving uncertainty is to use Dropout [16, 17, 73] sampling, where individual neurons are independently set to zero with a probability. As a sampling method on the neurons, Dropout works in a similar form of bagging [75, 91] of multiple decision trees. It might equally reduce the overall noise of the network regardless of domain shift but it does not address the problem of our task, which is the lack of diverse cross-domain combinations. Alternatively, we employ another pragmatic approach Bootstrap for uncertainty estimation on the target domain that offsets the source dominance. With the scarcity of target samples, we propose to bootstrap source-target data pairs for a more balanced cross-domain simulation. At high-level, the bootstrap simulates multiple realizations of a specific target query given the diversity of source samples. Specifically, the bootstrapped source approximate a distribution over the current query T j via the bootstrap. The bootstrapping brings multi-view observations on a single target query by two means. First, given K sampling subsets from D S , let F be the ideal estimate of T j ,F be the practical estimate of the dataset, andF * be the estimate from a bootstrapped source paired with the target query, will be the average of the multi-view K estimates. Second, besides the learnable parameters, the Batch-Normalization layers of K learners generate result in a set of different means and variances {µ k , σ k } K k=1 that serve as K different initializations that affects the learning ofF * . After the independent learners have preserved the valuable discrepancies of cross-domain pairs, the question now is how to fully exploit the discrepancies to improve the online predictions on the target queries. On one hand, we want to integrate the learners' expertise into one better prediction on the current target query, on the other we hope to maintain their differences. Inspired by [99] , we train the K learners jointly by exchanging their knowledge on the target domain as a form of co-supervision. Specifically, the K learners are trained independently with bootstrapped source supervision, but they exchange the pseudo-labels generated for target queries. We followed the FixMatch to compute pseudo-labels on the target domain. In this paper, we consider K=2 for simplicity, we denote the learners as w u and w v for k = 1, k = 2 respectively. Given the current target query T j , the loss function L consists a supervised loss term s from the source domain with the bootstrapped samples, and a self-supervised loss term t from the target domain with pseudo-labelsŷ b from the peer learner. We denote the cross-entropy between two probability distributions as H(; ). p u b and p v b are the predicted probabilities of t b by w u and w v , respectively.t b is a strongly-augmented version of t b using Randaugment [8] , and τ is the threshold for pseudolabel selection. To further exploit the supervision from the limited target query, from p u b and p v b we compute a standard entropy minimization term ent and a class-balancing diversity term div which are widely used in prior domain adaptation works [40, 72, 86] . Finally, we update the learners by where λ is a hyperparameter that scales the weight of the diversity term. We consider two metrics for evaluating online domain adaptation methods: online average accuracy and one-pass accuracy. The online average is an overall estimate of the streaming effectiveness. The one-pass accuracy measures after training on the finite-sample how much the online model has deviated from the beginning [52] . A one-pass accuracy much lower than online average indicates that the model might have overfitted to the fresh queries, but compromised its generalization ability to the early queries. Dataset. We use VisDA-C [59] , a classic benchmark adapting from synthetic images to real. We followed the data split used in prior offline settings [40, 59, 74] . We also use COVID-DA [103] , adapting the CT images diagnosis from common pneumonia to the novel disease. This is a typical scenario where online domain adaptation is valuable in practice. When a novel disease breaks out, without any prior knowledge, one has to exploit a different but correlated domain to assist the diagnosis of the new pandemic in a time-sensitive manner. We also evaluate on a largescale medical dataset Camelyon17 from the WILDS [34] , a histopathology image datasets with patient population shift from source to the target. Camelyon17 has 455k samples of breast cancer patients from 5 hospitals. Another practical scenario is the online fashion where the user-generated content (UGC) might be time-sensitive and cannot be saved for training purposes. Due to the lack of cross-domain fashion prediction dataset, we propose to evaluate adapting from Fashion-MNIST [94] -to-DeepFashion [43] category prediction branch. We select 6 fashion categories shared between the two datasets, and design the task as adapting from 36, 000 grayscale samples of Fashion-MNIST to 200, 486 real-world commercial samples from DeepFashion. Implementation details. We implement using Pytorch [56] . We follow [40, 41] to use ResNet-101 [25] on VisDA-C pretrained on ImageNet [12, 68] . We follow [103] to use pretrained ResNet-18 [25] on COVID-DA. We follow the leader-board on WILDS challenge [34] 2 to use DenseNet-121 [27] on Camelyon17 with random initialization, we use the official WILDS codebase 3 for data split and evaluation. We use pretrained ResNet-101 [25] on Fashion-MNIST-to-DeepFashion. The confidence threshold τ = 0.95 and diversity weight λ = 0.4 are fixed throughout the experiments. Baselines. We compare CRODOBO with eight state-ofthe-art domain adaptation approaches, including DAN [44] , CORAL [79] , DANN [18] , ENT [20, 72] , MDD [102] , CDAN [45] , SHOT [40] and ATDOC [41] . ATDOC has multiple variants of the auxiliary regularizer, we compared with the Neighborhood Aggregation (ATDOC-NA) with the best performance in [41] . Among the compared approaches, SHOT and ATDOC-NA require a memory module that collects and stores information of all the target samples, thus only apply the offline setting. For the other six approaches, we compare both offline and online results. Each offline model is trained for 10 epochs, and each online model takes the same randomly-perturbed target queries to make a fair comparison. Main results. We summarize the results on VisDA-C [59] in Table 1 , and plot the online streaming accuracy in Figure 2 To make a fair comparison, we follow [32, 40, 41, 59] to provide the VisDA-C one-pass accuracy in class average. Among the offline methods, SHOT [40] and ATDOC-NA [40] largely outperforms other approaches, showing the effectiveness of target domain clustering. Regarding the online setting, the Source-Only baseline loses 2.4% in the online average and 7.2% in the one-pass accuracy, which indicates that the data diversity is also important in domain generalization. We observe that ENT [72] , which is an entropy regularizer on the posterior probabilities of the unlabeled target samples, has a noticeable performance drop in the online setting, and illustrates obvious imbalanced results over the categories (superior at class "knife" but poor at "person" and "truck"). We consider it a typical example of bad objective choice for the online setting when the dataset is imbalanced. Without sufficient rounds to provide data diversity, entropy minimization might easily overfit the current target query. The 2.5% drop in one-pass from online further confirmed the model has deviated from the beginning. CRODOBO outperforms other methods in the online setting by a large margin, and is comparable to the offline result from the state-of-the-art approach ATDOC-NA [41] . For the sake of time efficiency, CRODOBO is superior than other approaches by achieving high accuracy in only one epoch. The results on two medical imaging datasets COVID-DA [103] and WILDS-Camelyon17 [34] are respectively summarized in Table 2 and Table 3 .The online streaming accuracy is presented in Figure 3 . COVID-DA* is the method proposed along with the dataset in [103] , which is a do- Table 1 . Accuracy on VisDA-C (%) using ResNet-101. In the online setting, individual class reports accuracy after one-pass, one-pass is the class average. Best offline (italic bold), best online (bold). [43, 94] . Meanwhile, we reprint Domain Generalization results from the WILDS leaderboard just for reference. Despite bearing a different setting, the online adaptation does not overwhelm the system with much more computational cost, but can improve the precision of the model The results on the newly proposed large-scale Fashion benchmark, from Fashion-MNIST [94] to DeepFashion [43] category prediction branch, is summarized in Table 4. We also plot the online streaming accuracy in Figure 4 . To the best of our knowledge, we are the first to report results on this adaptation scenario. The offline Source-Only merely achieves 23.1% accuracy, only 6.5% gain on the basis of the probability of guessing, which indicates the benchmark is challenging. The sharp drop of performance from Source-Only online accuracy to one-pass accuracy (-6.8%) indicates the large domain gap, and how easy the model is dominated by the source domain supervision. Similar observation is made on WILDS-Camelyon17 Source-Only results(-11.6% from online to one-pass), this usually happens when the source domain is less challenging than the target domain, and the distribution of the two domains are far from each other. Faced with this challenging bench- Table 5 . Ablation study of cross-domain bootstrapping on four datasets (%). VisDA-C one-pass accuracy is in per-class. VisDA-C COVID-DA Camelyon17 Fashion mark, CRODOBO improves the online performance to a remarkable 49.1%, outperforming the best result in the offline setting. Our one-pass accuracy is slightly shy compared to CDAN [45] , but is better in online metric. Ablation study. We conduct ablation study on the impact of cross-domain bootstrapping in Table 5 . Following Table 1, we provide the VisDA-C one-pass accuracy in class average. This study is to evaluate whether the improvement is introduced by cross-domain bootstrapping or simply the strong baseline with the objectives on the target domain (see Sec. 3.2.2). Thus, we devise a baseline by removing only the cross-domain bootstrapping, called w/o CRODOBO. The baseline model has one learner that is optimized by minimizing the objective , which is Eq. (3) without exchanging the pseudo-labels. In Table 5 , we observe that w/ CRODOBO is consistently better than w/o in the online average accuracy on all the datasets. Regarding one-pass accuracy, the effectiveness of cross-domain bootstrapping is unapparent on smaller datasets VisDA-C and COVID-DA, yet clearly outperforms w/o on large-scale WILDS-Camelyon17 and Fashion-MNIST-to-DeepFashion. We further conduct ablation study on the objective terms (see Sec. 3.2.2) and report the results in Table 6 . To elim- Source Figure 5 . Qualitative results of a randomly selected target query (size 24). We compare CRODOBO with two essential baselines Source-Only and CDAN [45] . We represent the bootstrapped source samples (top two rows under each benchmark), target samples (third row under each benchmark), and the prediction result of each target sample. Best viewed in color. inate the benefit of cross-domain boosting, our default setting is the model w/o CRODOBO. We leave out ent and observe significant performance drop. Without div , the performance decrease slight in the online metric, but far more sharply on the one-pass metric (which is calculated per-class). We analyze that the diversity term is important for imbalanced dataset like VisDA-C to achieve high class-average accuracy. We also report the results by replacing ent and div with Pseudo-labeling [36] . We replace either { ent , div } or t with MixMatch, and observe decent performance when employed together with { ent , div } (see Table 6 row6). The RandAugment [8] on the entropy and diversity terms does not enhance the performance. In the context of the the right to be forgotten, we propose an online domain adaptation framework in which the target data is erased immediately after prediction. A novel online UDA algorithm is proposed to tackle the lack of data diversity, which is a fundamental drawback of the online setting. The proposed method achieves state-of-the-art online results and comparable results to the offline domain adaptation approaches. We would like to extend CRODOBO to more tasks like semantic segmentation [39, 104] . Potential negative impact. The proposed method is at the risk of privacy leakage if one exploits purposefully the memorization effect of the deep neural network weights and restore the private information, which is the common vulnerability of all neural networks. A. Prior Online UDA approaches In the main paper, we propose a novel cross-domain framework to implement the right to be forgotten. However, we do not claim to have proposed the task of online unsupervised domain adaptation, which has existed before the emergence of deep learning [13, 29, 51] . The more recent works are either engineered for a specific task that lacks generality [10, 15, 23, 48, 93] or more general to compare to but not published [80] . Yet, we try to compare to the unpublished approach [80] despite its limited availability. The setting of [80] is different from our approach, with a memory module that buffers some previous target queries that the model can re-access, [80] is certainly less challenging comparing to "burn after reading". Meanwhile, [80] bears a continual setting, in which the model is pretrained on the source domain and then adapted to the target domain. Without source code, we compare to the results they present in the paper with the same employed backbone HR-Net [87] . We devise CRODOBO to a continual setting to make it comparable. Without simultaneous access to the source domain, cross-domain bootstrapping is not an option. So we employ the objectives on the target domain (see main paper Sec.3.2.2), we call it Continual CRODOBO. The comparison results are in Table 7 . We observe that, without any buffer mechanism or reaccess to the previous queries, the continual CRODOBO still outperforms ConDA [80] . As mentioned in the main paper (Sec.4), each online model takes the same randomly-perturbed target queries. Here, we discuss whether the performance of the model can be influenced by different random sequential orders. We perturb the original target sequence (arrange in the order of category) using 5 different random seeds, and report the results of each seed on VisDA-C [59] and the large-scale Fashion-MNIST-to-DeepFashion [43] benchmark. We compare the randomness using CDAN [45] and CRODOBO. We choose CDAN [45] since it is the stateof-the-art adversarial approach and is essentially different from our proposed CRODOBO. The results are in Table 9 . We observe that on VisDA-C the variance among different sequential orders is rather small (< 0.25). On the more challenging Fashion benchmark, the variance of CRODOBO is larger but manageable (< 2.0). We analyze that CRODOBO relies more on the target-oriented supervision (see main pa-per Sec.3.2.2) than CDAN [45] , which makes it more sensitive towards the changes of the target samples. This is a drawback of CRODOBO that we will try to address in the future work. To conclude, the randomness in forming the order of target queries will not influence the evaluation on the method's effectiveness. The co-supervision in the proposed method (cf. main paper Eq.(3)) can be replaced with any other pseudo-labeling approaches. With the constant proposal of new pseudolabeling techniques, one can simply replace the term on either/both {w u , w v } to achieve better performance. We replace on either/both learners with another popular semisupervised approach MixMatch [2] and report the results in Table 8 . We observe that FixMatch [78] provides better cosupervision and the online performance drops ∼8% when replaced with MixMatch. Sensitivity to Diversity Weight " We have two hyperparameters in the proposed approach: λ for weighing the term div and τ for the pseudo-label selection (cf. main paper Eq.(3)). We used λ=0.4 and τ =0.95 in all our experiments, here we report results on more settings of these hyperparameters. The results of λ={0.1, 0.4, 0.5, 0.8, 1.0} are shown in Table 10 . As the results suggest, CRODOBO is not sensitive to hyperparameter λ. We observe similar performance of the model when λ is larger than 0.4. The sensitivity to τ is shown in Table 11 . When τ is smaller, more samples in each target query are selected as pseudo-labels to co-supervise the peer learner. However, the quality of these pseudo-labels is compromised since the model is less confident about the prediction. Thus, the cosupervision is less accurate to depend on. We observe the performance drop when the threshold τ is smaller than 0.6. Therefore, we suggest a larger threhold τ to achieve a more effective model. The online accuracy of the above settings are shown in Figure 6 and Figure 7 . We follow the network architecture in [40, 41] , a feature backbone followed by a bottleneck layer with dimen-sion=256, and a Linear layer as the output layer. For the experiments on VisDA-C [59] , COVID-DA [103] and Fashion-MNIST-to-DeepFashion [43, 94] , the feature backbone is pretrained on ImageNet [12] . For the WILDS-Camelyon17 benchmark, we followed the leaderboard to use a randomly initialized DenseNet-121 [27] . We use Adam [33] with with an initial learning rate of 8e-4. The query size in our experiments is set as 64. We have not observed any major performance change using different batchsize. Erp system implementation in large enterprises-a systematic literature review Mixmatch: A holistic approach to semi-supervised learning Optimal adaptive policies for markov decision processes The European Right to be Forgotten: The First Amendment Enemy Domain adversarial reinforcement learning for partial domain adaptation Active online learning with hidden shifting domains Domain adaptation for visual applications: A comprehensive survey Randaugment: Practical automated data augmentation with a reduced search space A boosting-like online learning ensemble Online domain adaptation for person re-identification with a human in the loop Active multi-kernel domain adaptation for hyperspectral image classification Imagenet: A large-scale hierarchical image database Online methods for multi-domain learning and adaptation Frequency-domain adaptation of causal digital filters Online domain adaptation for multi-object tracking Dropout as a bayesian approximation: Representing model uncertainty in deep learning Domain-adversarial training of neural networks Formalizing data deletion in the context of the right to be forgotten Semi-supervised learning by entropy minimization Does ai remember? neural networks and the right to be forgotten An embedding learning framework for numerical features in ctr prediction Online domain adaptation for continuous cross-subject liver viability evaluation based on irregular thermal data Branchout: Regularization for online ensemble tracking with convolutional neural networks Deep residual learning for image recognition Discriminative partial domain adversarial network Densely connected convolutional networks Online learning: Searching for the best forgetting strategy under concept drift Online domain adaptation of a pre-trained cascade of classifiers Cafe: Catastrophic data leakage in vertical federated learning Secure, privacy-preserving and federated machine learning in medical imaging Contrastive adaptation network for unsupervised domain adaptation Adam: A method for stochastic optimization Wilds: A benchmark of in-the-wild distribution shifts Domain-adversarial neural networks to address the appearance variability of histopathology images. In Deep learning in medical image analysis and multimodal learning for clinical decision support Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks Drop to adapt: Learning discriminative features for unsupervised domain adaptation Learning overparameterized neural networks via stochastic gradient descent on structured data Bidirectional learning for domain adaptation of semantic segmentation Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation Domain adaptation with auxiliary target domain-oriented classifier A survey of deep neural network architectures and their applications Deepfashion: Powering robust clothes recognition and retrieval with rich annotations Learning transferable features with deep adaptation networks Conditional adversarial domain adaptation Learning under concept drift: A review Active universal domain adaptation Kitting in the wild through online domain adaptation Svmbased boosting of active learning strategies for efficient domain adaptation The impact of diversity on online ensemble learning in the presence of concept drift Multistep online unsupervised domain adaptation The deep bootstrap framework: Good online learners are good offline generalizers Handcrafted vs. non-handcrafted features for computer vision classification Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout Human rights and the right to be forgotten Automatic differentiation in pytorch Multi-adversarial domain adaptation Moment matching for multi-source domain adaptation Visda: A synthetic-to-real benchmark for visual domain adaptation Deepwalk: Online learning of social representations Active learning for domain adaptation in the supervised classification of remote sensing images The "right to be forgotten" in the gdpr: Implementation challenges and potential solutions Efthimios Alepis, Matthias Pocs, and Constantinos Patsakis. Backups and the right to be forgotten in the gdpr: An uneasy relationship Active domain adaptation via clustering uncertainty-weighted embeddings Two-dimensional multilabel active learning with an efficient online adaptation model for image classification Domain adaptation meets active learning The right to be forgotten Imagenet large scale visual recognition challenge A generic framework for privacy preserving deep learning Active supervised domain adaptation Online deep learning: Learning deep neural networks on the fly Semi-supervised domain adaptation via minimax entropy Maximum classifier discrepancy for unsupervised domain adaptation Tactile object recognition using deep learning and dropout Gradient matching for domain generalization A dirt-t approach to unsupervised domain adaptation Fixmatch: Simplifying semisupervised learning with consistency and confidence Return of frustratingly easy domain adaptation Continual unsupervised domain adaptation Reconsidering the 'right to be forgotten'-memory rights and the right to memory in the new media era. Media A hybrid approach to privacy-preserving federated learning Adversarial discriminative domain adaptation Process mining: a two-step approach to balance between underfitting and overfitting. Software & Systems Modeling Humans forget, machines remember: Artificial intelligence and the right to be forgotten Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation Deep high-resolution representation learning for visual recognition Unsupervised domain adaptation via domain adversarial training for speaker recognition Resamplingbased ensemble methods for online class imbalance learning Privacy-enhanced data collection based on deep learning for internet of vehicles An empirical analysis of dropout in piecewise linear networks Federated learning with differential privacy: Algorithms and performance analysis Online and offline domain adaptation for reducing bci calibration effort Fashionmnist: a novel image dataset for benchmarking machine learning algorithms Hierarchical online domain adaptation of deformable part-based models Hybridalpha: An efficient approach for privacy-preserving federated learning Information leakage by model weights on federated learning Curriculum manager for source selection in multi-source domain adaptation Deep co-training with task decomposition for semi-supervised domain adaptation Alicg: Fine-grained and evolvable conceptual graph construction for semantic search at alibaba Deeppar and deepdpa: privacy preserving and asynchronous deep learning for industrial iot Bridging theory and algorithm for domain adaptation Deep domain adaptation from typical pneumonia to covid-19 Multi-source domain adaptation for semantic segmentation Deep leakage from gradients Medical imaging deep learning with differential privacy We thank Caiming Xiong for his valuable insights into the project, Junnan Li and Shu Zhang for the help to improve the experiments.