key: cord-0870258-qefkgpvq authors: Bandyopadhyay, Hmrishav; Deng, Zihao; Ding, Leiting; Liu, Sinuo; Uddin, Mostofa Rafid; Zeng, Xiangrui; Behpour, Sima; Xu, Min title: Cryo-shift: Reducing domain shift in cryo-electron subtomograms with unsupervised domain adaptation and randomization date: 2021-11-17 journal: Bioinformatics DOI: 10.1093/bioinformatics/btab794 sha: 91d43d480a2eac1b40da27fb00e6430aadb6ca1a doc_id: 870258 cord_uid: qefkgpvq Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology that enables the visualization of subcellular structures in situ at near-atomic resolution. Cellular cryo-ET images help in resolving the structures of macromolecules and determining their spatial relationship in a single cell, which has broad significance in cell and structural biology. Subtomogram classification and recognition constitute a primary step in the systematic recovery of these macromolecular structures. Supervised deep learning methods have been proven to be highly accurate and efficient for subtomogram classification, but suffer from limited applicability due to scarcity of annotated data. While generating simulated data for training supervised models is a potential solution, a sizeable difference in the image intensity distribution in generated data as compared to real experimental data will cause the trained models to perform poorly in predicting classes on real subtomograms. In this work, we present Cryo-Shift, a fully unsupervised domain adaptation and randomization framework for deep learning-based cross-domain subtomogram classification. We use unsupervised multi-adversarial domain adaption to reduce the domain shift between features of simulated and experimental data. We develop a network-driven domain randomization procedure with `warp' modules to alter the simulated data and help the classifier generalize better on experimental data. We do not use any labeled experimental data to train our model, whereas some of the existing alternative approaches require labeled experimental samples for cross-domain classification. Nevertheless, Cryo-Shift outperforms the existing alternative approaches in cross-domain subtomogram classification in extensive evaluation studies demonstrated herein using both simulated and experimental data. Complex macromolecules participate in a large number of biochemical processes which helps them sustain their cellular environment and governs their cellular activities. For an accurate analysis of cellular processes enabled by the arXiv:2111.09114v1 [q-bio.QM] 17 Nov 2021 Figure 1 : Domain shift illustrated between source and target distributions interaction of these macromolecules, an in-depth analysis of their spatial organizations and native structures within the cell is required. Such an analysis needs the observation of macromolecules in situ, which has been recently made possible with cryo-Electron Tomography (cryo-ET). Cellular cryo-ET is an imaging technology that helps to visualize 3D structures of subcellular structures from a series of 2D projections generated through an electron microscope in cryogenic temperature, where the native structure and orientation of sub-cellular components in the cell are preserved (Lučić et al. [2013] ). With the advent of cryo-ET, many 3D structures in the form of tomograms can be swiftly generated for the analysis of macromolecules they contain. The difficulty, however, lies in decoding the structural information of the macromolecular complexes from these tomograms. These reconstructed 3D tomograms have complexities in the form of very low Signalto-Nose Ratio (SNR) and missing wedge effects, making the structural recovery of these macromolecules inherently difficult. Classification of macromolecules in tomograms in such cases goes a long way towards their structural recovery from tomogram level data. To classify macromolecules, in a tomogram, we first extract subtomograms, sub-volumes of tomograms where a single macromolecule is most likely to be present. These subtomograms are then classified based on the macromolecule they contain. CNN based methods of supervised and semi-supervised classification of subtomograms that excel in accuracy and speed were proposed by Xu et al. [2017] , Che et al. [2018] , Gao et al. [2020] and Liu et al. [2019] . The classified subtomograms can be then aligned and averaged (Briggs [2013] ) to get a higher resolution average of the corresponding structure. While obtaining large-scale experimental data in the form of tomograms for classification is not an issue, the annotation of this data with corresponding macromolecule identifiers requires a considerable amount of time and compute. Common methods of annotation like template matching proposed by Best et al. [2007] , Beck et al. [2009] and Kunz et al. [2015] are extremely time-consuming and require quality control in the form of inspection by experts, effectively making the entire process long-drawn and laborious. An apparent lack of scalability in data processing makes large-scale data of macromolecular complexes in the form of labeled subtomograms hard to obtain. To get around this issue, methods like unsupervised template-free subtomogram classification have been proposed by Xu et al. [2012] , Bartesaghi et al. [2008] , and Chen et al. [2014] while automatic annotation procedures using neural networks have been proposed by Chen et al. [2017] . Active learning-based methods that make use of minimal labeled data have also been proposed by Du et al. [2021] A potential solution to the scarcity of annotated data would be the development of classification algorithms that can classify subtomograms assisted by domain adaptation and randomization algorithms. Domain randomization methods work by training a classification algorithm on labeled simulated data that is easily available and randomizing this data so that the network trained on it can regularise easily to real data as proposed by Che et al. [2019] . Contrary to changing (randomizing) the simulated data, domain adaptation-based methods train the deep learning-based classifier in such a way that it can classify both the simulated and the real data corresponding to the source and target domains. Domain in this context refers to data distributions that are related but different. The difference in data distributions is typically projected in the form of a domain shift in the extracted feature space. The concept of domain shift has been illustrated in Fig 1. One of the first adversarial domain adaptation methods for the classification of subtomograms was proposed by Lin et al. [2019] while a few shot method for adaptation was proposed by Yu et al. [2021] . All of these methods of classification depend largely on the data simulation procedure used as it forms the base network trained on the source dataset. While some cryo-ET data simulation methods simulate isolated macromolecules, methods proposed by and Pei et al. [2016] can dynamically simulate complete tomograms by packing multiple macromolecular complexes with additional factors in the form of simulated noise, contrast transfer function, and missing wedge effects. In this work, we propose the use of a network-driven domain randomization procedure in conjunction with an unsupervised domain adaptation algorithm for the classification of subtomograms. Fundamentally differing from (Che et al. [2019] ) which focuses on randomizing hyper-parameters and then performing simulations, our network-driven method Cryo-Shift of domain randomization involves gathering information from the already simulated data to add noise and distortions, allowing the model to generalize to the target domain. Furthermore, we propose the use of a multi-adversarial unsupervised domain adaptation framework inspired from Pei et al. [2018] that takes in minimal unannotated data from the target domain to bridge the domain shift between the simulated and real datasets. We follow the same data generation procedure as Che et al. [2019] in our experiments. Our primary contributions, thus, can be summed up as : • The development of a network-driven domain randomization algorithm inspired by Zakharov et al. [2019] along with "warp" modules that distort and alter the simulated data for helping the model generalize better. • The addition of an adversarial loss function with a discriminator to help the warp network provide "realistic" warps. • The addition of an unsupervised domain adaptation algorithm that helps reduce the domain shift between real and simulated subtomograms, improving cross-domain classification. Being completely unsupervised, the algorithm provides an efficient solution to the problems faced in cryo-ET subtomogram annotation. Our methods primarily serve to reduce the domain gap between simulated and experimental data via network-guided domain randomization and unsupervised domain adaptation. Our overall pipeline consists of a warp network W , a "step zero network" M , a feature discriminator F d , and a domain discriminator D. Our data simulation module S takes in a set of hyper-parameters h to simulate data from corresponding density maps (d map ) of n classes (in our analysis n = 4). As the first step of our pipeline, we train the base network M , consisting of a feature extractor F e and a classifier C, with simulated data S (h) (d map ) for predicting class labelsŷ. The feature extractor F e denotes the convolutional blocks in M while the classifier C refers to the linear layers after F e in M . As the second step of the pipeline, we train the warp network W , the feature discriminator F d , the domain discriminator D and retrain the base network M simultaneously. The warp network W is fed with simulated data S (h) (d map ) that it distorts to return a warped set. The domain discriminator D and the warp network W are trained adversarially where D tries to classify the outputs of W as simulated and W tries to fool D into making wrong decisions. The feature extractor from the base network M , F e , is fed with a concatenated dataset consisting of the output of the warp network W (S (h) (d map )), the simulated data S (h) (d map ) and a subset d r1 of the real data D R . While the labels of the simulated data and the output from the warp network are automatically present, the labels of d r1 are not used here. Additionally, the output from the warp network is detached from the computational graph before being fed to F e . The output from F e is fed to the feature discriminator F d and the classifier C. Here F d and F e are trained adversarially while C is trained only on the labeled subset of the data from the feature extractor F e . While F d tries to classify the features from F e into real and simulated domains, C tries to classify the features into classes accurately. An important difference between the domain discriminator D and the feature discriminator F d is that the former works directly on subtomograms while the latter works on features. Thus, while the former tries to change the subtomogram to look more like real subtomograms, the latter makes sure that the features learned by the network are similar in nature. The overall algorithm is described in Algorithm 1. Fig. 1 provides a diagrammatic representation of our method architecture. for i in num_iterations do 5: The step zero network or the base network is a simple classification network we pre-train on the simulated data S (h) (d map ). This network forms the baseline upon which further training takes place with the help of warp modules and adaptation algorithms. To establish the consistent nature of out algorithm, we make use of multiple architectures in our ablation experiments (in Section 2.7) while proceeding with a single architecture, CB3D from (Che et al. [2018] ), in the The Warp Network W is built of a set of warp modules that distort or alter the simulated data and help the base network generalize more to the experimental data during inference. Domain randomization in the context of the warp modules we use is a misnomer because while domain randomization modules deal with randomizing the source domain to help the model generalize better to any target domain, we specialize our warp network to work on a particular target domain. Although this restricts the portability of our model onto any target dataset, this helps us enhance the performance of our network on the single target dataset we train it on. Thus, while Zakharov et al. [2019] proposes making use of a gradient reversal layer (GRL) to oppose the classifier for training the randomization network, we propose the use of a discriminator with a GRL to oppose our warp network. Here, we pass the output of the warp network and randomly sampled data from the validation set in equal proportions to the discriminator that is tasked with classifying the data as real or fake. The presence of the GRL between the warp network and the discriminator makes the warp network inclined to fool the discriminator into classifying the warped data as real. The concept of GRL, inspired from Ganin and Lempitsky [2014] is that the gradients are reversed during backpropagation, thereby allowing the Discriminator and the warp network to train in opposing manners. The warp network consists of an encoder-decoder based architecture with the output of a single encoder E being fed to multiple decoders, each emulating one warping module. The encoder consists of 3 convolutional blocks with skip connections that connect to the decoder layers and are concatenated there. The input to the encoder is batch-wise 3D cryo-ET subtomogram data in the form of I n ∈ R b×1×40×40×40 with b denoting the batchsize. The output from the encoder, the bottleneck B e can be expressed as O e ∈ R b×256×5×5×5 . This bottleneck is split into two equal parts channel-wise and fed to the two decoders which act as warp modules. The warping modules we make use of are: • Noise Module (W n ) • Distortion Module (W d ) The noise module (W n ) adds noise to the input subtomogram I n as the first randomization step. The encoder split is then passed through a simple decoder with 3 convolutional blocks where the skip connections from the encoder are concatenated at their respective layers. The output from the last decoder block is passed through a scaled sigmoid activation function, restricting its value to (−0.1, 0.1). The final output d 0 ∈ R b×1×40×40×40 is the summed up with the input to the module I n and is represented as O n ∈ R b×1×40×40×40 . The distortion module takes in two inputs, O n from the noise module and a channel-wise split of the encoder output E dec . E dec is fed to two subnetworks: the first consisting of a single convolutional and linear layer while the second being a decoder with 3 convolutional blocks with the corresponding skip connections from the encoder concatenated. The first network provides an output in the form of a single value per sample per channel αR b×3 restricted to (0, 1), while the second decoder gives out a 3D dense grid G ∈ R b×3×40×40×40 . This dense-grid is then convolved with a 7 × 7 Gaussian 3D kernel having σ = 1 and multiplied in a channel-wise fashion with α where the 3 channels store values of x, y, and z coordinates. The values in this new dense-grid grid d are then mapped with the coordinates in O n and the corresponding deformed output subtomogram O d is created. The Domain discriminator D consists of an encoder module, acting as a binary classifier, and is pre-trained on simulated data and the unlabelled d r1 subset of real data D R . The discriminator predicts if the data it is supplied with is from the real dataset or the simulated dataset. While it strives to reduce the discriminative loss L dd and classify the domains correctly, the warp network training alongside adversarially, tries to fool it in classifying its output as real. We propose an unsupervised domain adaptation paradigm inspired by Pei et al. [2018] . This paradigm consists of n similar feature discriminators where n represents the number of classes our classifier has. These discriminators are simply binary classifiers that try to classify the features to their respective domains (simulated or real). A concatenated set consisting of output from the warp network, simulated subtomograms, and real subtomogram data (d r1 ) in equal proportions is fed to the feature extractor F e of base model M at each iteration. These features from F e are fed to each of the n feature discriminators. The domain labels for the simulated and warped data are zero, while the domain label for the experimental data is set to one. A weighted gradient reversal layer (GRL) is set up between the discriminators and the feature extractor, forming an adversarial relationship between the networks during training. During the calculation of the mean discriminative loss for training the feature discriminators, a weighting system is used with the predicted class probabilities from the classifier C (detached from the computational graph), for the same features, acting as weights as demonstrated in equation 5. The pre-trained classifier, which can already predict real subtomograms with considerable accuracy, comes in use now as the discriminative loss for a particular discriminator is multiplied with the probability of the subtomogram belonging to that particular class as predicted by the classifier. The step-zero network, M , is trained on the domain randomized simulated data S (h) using L task , a categorical crossentropy loss as the loss function. We make use of this pre-trained model M in the next stage where we simultaneously train the base network M , the domain discriminator D, the warp network W , and the feature discriminator F d . The discriminative loss L disc is a combination of the domain discriminative loss L dd used to train D and W and the feature discriminative loss L f d used to train F d and F e . As in the previous stage, in this stage also a task loss L task is used to train the step-zero network M (F e and C). The corresponding loss function for the entire algorithm can be expressed as: where y represents the task ground truth. t d and t f represent the domain ground truth for the two discriminators. The components of L disc , L f d and L dd can be represented as: where F i represents the individual discriminators from 1 to N with N denoting the total classes (four in our case). y j represents the one dimensional features from F e for the j th sample with X denoting the total samples in a batch. prob j represents the class confidence of the i th class for the j th sample. We follow the methodology described above and pre-train the step-zero network as a basic classification module. We use Adam as an optimizer with a learning rate of 1e-4 and set betas to 0.9 and 0.999. We make use of a similar optimizer in the second phase of the training with the learning rate set to 1e-5 for the step zero network and the learning rates for the warp network, domain discriminator, and the feature discriminator set to 5e-3, 1e-6, and 1e-5 respectively. Amongst these, the optimizer of the base network is monitored with its learning rate reduced when the loss plateaus. The loss checked for this plateau condition is the task loss L task on the validation set we constructed earlier and used as an unannotated data source. We train the step zero network for 120 epochs in the first stage and the 50 epochs in the next stage of the pipeline. The simulated data we use in our work is generated using a data generation process similar to (Che et al. [2019] ). The data generation process starts with density maps obtained directly from EMDB (Electron Microscopy Data Base) accessions or running situs package (Wriggers et al. [1999] ) on RCSB PDB (Protein Data Bank) files. In our experiments, density maps of four distinct classes (Ribosome, Proteosome, TRiC, and Membrane) were obtained from EMDB accessions. These maps have a shape of 40 × 40 × 40 with a voxel spacing of 1.368nm. To simulate the missing wedge effect, the 3D structural data is projected on a series of 2D projection images with varying tilt angles based on the Maximum Wedge Angle hyperparameter specified. This projection data is then convolved with Contrast Transfer Function (CTF) and Modulation Transfer Function (MTF), adjusted by hyperparameters like Defocus and spherical aberration, for the simulation of the respective optical effects. At this stage, Gaussian Noise is added to the data to simulate the corresponding SNR level and the data is back-projected to obtain the 3D subtomogram. This 3D data is randomized by sampling hyperparameters uniformly from the following list, while in the specific case of the SNR, the sampling is done at a logarithmic scale. • SNR: 0.03 to 10 • MWA: 0 • to 50 • • Dz: -12 to 0 • Cs: 1.5 to 3.0 where MWA stands for Maximum Wedge Angle and Cs and Dz are hyperparameters for Spherical Abberation and Defocus respectively. A sample from the simulated data is provided in Fig. 4 . Further samples demonstrating the qualitative effect of these hyperparameters on the simulation is available in the supplementary. The real dataset we use in our experiments consists of 1051 subtomograms, which have been processed by template search and have been filtered manually from rat neuron tomograms (Guo et al. [2018] ). The data has a tilt angle range of -50 • to +70 • and is divided into 4 classes, namely ribosome, mitochondrial membrane, TRiC, and single capped proteasome. A summary of the class-wise distribution of the data in terms of the number of subtomograms is available in Table 4 . A figure containing slices of a subtomogram belonging to each class is available in Fig. 3 . This data is further split in a 1:9 ratio where the smaller part of the split is fed to the network as unlabelled real data for randomization and adaptation. Since this smaller subset of the data is passed as unlabelled and does not actively participate in reducing the classification loss, we reuse it as a validation set in our experiments. We perform ablation experiments for the warp network, as the feature discriminator part of our algorithm does not involve multiple modules. For our ablation studies, we use multiple neural networks varying in depth to demonstrate the efficacy of our methods on more than one network architecture. The network architectures we use along with their trainable parameter counts are listed in 1. Table 3 presents the performance of the warp module along with the performance of corresponding step zero networks (in terms of macro-average accuracy across 4 classes). The presence of "(W)" along with the neural network architecture used denotes that the model has been trained with the warp network, while the absence of this notation implies the performance recorded is that of the pre-trained step zero network. An increasing average accuracy while using the warp modules is expected here as we are not only using the domain randomization algorithm to fool the discriminator, but are also training the classifier to keep up with it. Table 2 presents more ablation studies using CB3D as the base network with modules zeroed out at places to demonstrate their individual contributions to the network. Table 2 also includes a study where we remove the domain discriminator and force the warp modules to train against the classification loss as proposed in Zakharov et al. [2019] . Our ablation studies demonstrate the efficacy of each module in the warp network along with individual components of the loss function we use. Additional ablation experiments on simulated data only are provided in the supplementary. The extensive results of randomization and adaptation on the CB3D model are presented in Table 5 . We compare our results with relevant domain adaptation methods like FSADA (Yu et al. [2021] ) and FADA (Motiian et al. [2017] ) in Table 6 . While these are domain adaptation methods that make use of labels in limited amounts, we propose a completely unsupervised procedure of the same and outperform these methods by a large margin (7 %). To further ensure a fair comparison with these methods, we train all of them on the same architecture as proposed by Yu et al. [2021] . We also include t-SNE plots in Fig 4 that The study of in situ macromolecular complexes enabled with cryo-electron tomography is essential for the analysis of various cellular processes. Shortcomings in the imaging procedure in conjunction with high structural complexity in tomograms make this analysis difficult to perform. The classification of subtomograms as sub-volumes of these tomograms forms an integral step in the structural recovery of macromolecules along with aiding in their analysis. Supervised classification methods, however, require large-scale annotated data that is extremely scarce. In this work, we propose an unsupervised domain adaptation and randomization framework which can help train neural networks completely on simulated data and unlabeled experimental data. We first train a base network on simulated data and then use a network-driven domain randomization framework to help the network make better predictions on the experimental dataset. Additionally, we make use of a multi-adversarial domain adaptation module where we help the classifier generate similar features for both real and simulated data, thus mitigating the domain shift between the real and the simulated data features and helping the model classify better in both of these domains. Our work can be further used for automated annotation of subtomograms after being trained on similar simulated data, thus forming an important step in the completely autonomous recognition and structural recovery of subtomograms with the help of neural networks for better analysis of macromolecules and their interactions. We find our work can be improved upon with future works aimed at further development of simulation algorithms to bridge the domain gap between real and simulated data. Unsupervised deep learning based algorithms for the analysis of genomic data (Liu et al. [2021a] ) have made considerable progress with the help of generative networks. A possibility lies in the use of novel conditional generative networks (Vahdat et al. [2021] , Liu et al. [2021b] ) to generate class-specific cryo-et data for improving upon the algorithms demonstrated in this work. the Mark Foundation For Cancer Research 19-044-ASP. X.Z. was supported in part by a fellowship from Center of Machine Learning and Health at Carnegie Mellon University. An example of the simulation hyperparameters and their effect on the data being simulated is demonstrated in Figures S1-S8 . To provide a qualitative analysis, we show results at the extremes of a particular hyperparameter range, while maintaining the other hyperparameters at the mid-points of their ranges. Since SNR is sampled at a logarithmic scale, we take the midpoint of the logarithmic scale for SNR in these demonstrations. In these diagrams, C 0 , C 1 , C 2 , C 3 refer to the four classes Ribosome, Proteasome, TRiC, and Membrane respectively. We further provide some experiments on purely simulated data to gauge the performance of our model against the baseline step zero network (CB3D). In these experiments detailed in Table S1 , we simulate the data at specified SNR levels and try to adapt our network to predict classes at a different SNR level with the help of network driven domain randomization and adaptation. The top row in each cell of the table represents the performance of the baseline network while the bottom row demonstrates our model's performance. Cryo-electron tomography: the challenge of doing structural biology in situ Deep learning-based subdivision approach for large scale macromolecules structure recovery from electron cryo tomograms Improved deep learning-based macromolecules structure classification from electron cryo-tomograms. Machine vision and applications Dilated-densenet for macromolecule classification in cryo-electron tomography Semi-supervised macromolecule structural classification in cellular electron cryo-tomograms using 3d autoencoding classifier Structural biology in situ-the potential of subtomogram averaging. Current opinion in structural biology Localization of protein complexes by pattern recognition Visual proteomics of the human pathogen leptospira interrogans Mask-independent scoring of the reference bias High-throughput subtomogram alignment and classification by fourier space constrained fast volumetric matching Classification and 3d averaging with missing wedge correction in biological electron tomography Autofocused 3d classification of cryoelectron subtomograms Convolutional neural networks for automated annotation of cellular cryo-electron tomograms Active learning to classify macromolecular structures in situ for less supervision in cryo-electron tomography Domain randomization for macromolecule structure classification and segmentation in electron cyro-tomograms Adversarial domain adaptation for cross data source macromolecule in situ structural classification in cellular electron cryo-tomograms Few shot domain adaptation for in situ macromolecule structural classification in cryoelectron tomograms Efficient cryo-electron tomogram simulation of macromolecular crowding with application to sars-cov-2 Simulating cryo electron tomograms of crowded cell cytoplasm for assessment of automated particle picking Multi-adversarial domain adaptation Deceptionnet: Network-driven domain randomization Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Unsupervised domain adaptation by backpropagation Situs: a package for docking crystal structures into low-resolution maps from electron microscopy situ structure of neuronal c9orf72 poly-ga aggregates reveals proteasome recruitment Few-shot adversarial domain adaptation Simultaneous deep generative modelling and clustering of single-cell genomic data Score-based generative modeling in latent space Density estimation using deep generative neural networks We thank Dr. Qiang Guo for sharing with us experimental rat-neuron tomograms that we make use of as real data. This work was supported in part by U.S. National Institutes of Health (NIH) grants R01GM134020 and P41GM103712, U.S. National Science Foundation (NSF) grants DBI-1949629 and IIS-2007595, AMD COVID-19 HPC Fund, and