key: cord-0436065-4mphx3yz authors: Wang, Siqi; Lu, Manyuan; Moshkov, Nikita; Caicedo, Juan C.; Plummer, Bryan A. title: Anchoring to Exemplars for Training Mixture-of-Expert Cell Embeddings date: 2021-12-06 journal: nan DOI: nan sha: 7408d4c6917294c6297ca42de7cce06f8c246eb8 doc_id: 436065 cord_uid: 4mphx3yz Analyzing the morphology of cells in microscopy images can provide insights into the mechanism of compounds or the function of genes. Addressing this task requires methods that can not only extract biological information from the images, but also ignore technical variations, ie, changes in experimental procedure or differences between equipments used to collect microscopy images. We propose Treatment ExemplArs with Mixture-of-experts (TEAMs), an embedding learning approach that learns a set of experts that are specialized in capturing technical variations in our training set and then aggregates specialist's predictions at test time. Thus, TEAMs can learn powerful embeddings with less technical variation bias by minimizing the noise from every expert. To train our model, we leverage Treatment Exemplars that enable our approach to capture the distribution of the entire dataset in every minibatch while still fitting into GPU memory. We evaluate our approach on three datasets for tasks like drug discovery, boosting performance on identifying the true mechanism of action of cell treatments by 5.5-11% over the state-of-the-art. Cell images can be used to infer the effects and mechanisms of compounds in many contexts, following an approach known as image-based profiling [28] , which requires a comprehensive and generic feature representation of cell morphology. This task has traditionally been addressed with hand-engineered features [29] , and more recently, transfer learning with models pre-trained on natural images [1, 24] . Training models directly on cellular images holds the potential to improve the sensitivity of biological experiments [5, 28] . However, ground truth annotations * Equal contribution Figure 1 . Strategies to compute single-cell embeddings. (top) Treatment classification optimizes a CNN to identify biological treatments, which are weak labels with respect to the true mechanism of action [4] . (bottom) Proposed model: a metric learning approach with a mixture of experts (to capture technical variation), treatment exemplars (to capture phenotype distributions), and a memory bank (to facilitate learning). are not available for supervised training, primarily because the effects (or mechanisms) of compounds are usually not known in advance in real world conditions (e.g. drug discovery). Thus, prior work has either used a small dataset with a limited set of ground-truth mechanistic classes [9] , or, more commonly, used treatment labels for classification instead of ground-truth mechanistic classes [4] . Given that treatment labels are always known (i.e. chosen by scientists), they can be used as weak labels to supervise a model with the expectation that the latent space captures relevant properties of mechanistic classes. Figure 2 . TEAMs during training. The training input has three parts: a single-cell image, technical variation group, and treatment label. First, a base feature representation is generated from a CNN backbone (ResNet-18 [10] in our experiments). Then it is transformed by an variation expert specified by the cell's technical variation group. Note that only one expert is activated according to the group input (although different experts may be used for other cells in a minibatch). Next, the cell embedding is compared with learnable Treatment Exemplars, which model the canonical representation of the cells for each treatment. Finally, feature memoization is used to efficiently increase batch size by reusing embeddings from recent minibatches. See Figure 3 for an illustration of using TEAMs for inference. While using (weak) treatment labels directly in classification models has shown to be useful, there are limitations in generalizing features that make it not optimal for image-based profiling. The power of image-based profiling is on modeling morphology as a continuous transition of cell states that can be compared, rather than as categorical, discrete groups. The categorical classification loss, typically used in weakly supervised learning, imposes a discrete grouping that may separate cells not by their relevant biological traits, but by other irrelevant factors of variation, including technical artifacts, which can affect the ability of these models to generalize. As illustrated in Figure 1 , prior work (e.g., [4] ), has typically addressed this through a post-hoc normalization approach using cells from a control group, but such an approach is ineffective if informative features were not learned in the initial training step. In this work, we propose Treatment ExemplArs with Mixture-of-experts (TEAMs), a metric learning approach for training single-cell representations from biological images. Our approach has three components, each of which addresses a complementary challenge. First, learnable Treatment Exemplars transform the metric learning problem into a type of cluster prediction task, where single cells are encouraged to embed nearby the exemplars from the treatments that produced them. This provides an efficient mechanism for capturing the distribution of the entire dataset in each minibatch, ensuring that informative samples are always present. Second, we use variation experts to learn embeddings that are specialized to a specific set of technical variations. Each projection learned by the expert takes cell features from some shared general embed-ding space to a new subspace. Then, at test time we mix together the predictions of all of our experts. This helps us avoid overfitting to just one setting, reducing variance due to spurious correlations caused by a single set of variations. Finally, we use a cross-batch memory module [40] that reuses samples computed in recent batches to efficiently increase the batch size, boosting the information used to update model parameters. Figure 2 contains an overview of our approach. To summarize, our contributions are: • We introduce TEAMs, a novel feature learning approach that improves performance on downstream tasks like identifying the true mechanism of action of cell treatments by 5.5-11% over the state-of-the-art. • We show that our Mixture-of-Expert cell embeddings have less technical variation defects. Compared with baselines like adversarial alignment [7] or selfchallenging [11] , our method outperforms by 1-1.5%. • We demonstrate that our learnable Treatment Exemplars can boost downstream performance by learning more informative features due to seeing the entire distribution of the training data in each minibatch. Deep learning for image-based profiling. Cell morphology has traditionally been measured with classic handcrafted features in high-throughput biological experiments [3] . In early work convolutional networks demonstrated how models pre-trained on ImageNet can be used to obtain cell morphology embeddings [1, 8, 12, 24] . This transfer learning approach can improve performance without requiring any training: only a post-hoc re-normalization is applied to calibrate the relevant factors of variation. Training feature extraction models directly on cell images is attractive given the large amounts of data acquired in high-throughput experiments (thousands of compounds are tested in parallel). However, well curated and manually annotated datasets with ground truth labels of cellular phenotypes do not exist. Several studies have trained neural networks with cellular images using unsupervised learning, including variational auto-encoders [16] , image in-painting [18] , deep clustering [13] , and contrastive learning [25] . Weakly supervised learning was proposed to train models using treatment labels as a pretext task [4] . Treatment labels are always known for a given high-throughput experiment, but they are considered weak labels because the true phenotypic response of cells (also known as mechanism of action) is not known for all treatments. The approach, like the other methods we have discussed, may require posthoc feature re-normalization as well to account for technical variation. That said, it has been successfully adopted to analyze treatments for diseases such as COVID-19 [6] , and anti-aging treatments [42] . In contrast, our TEAMs model is designed to minimize the effect of technical variation without requiring any post-hoc processing steps. Deep metric learning. Most relevant to our work are methods that learn multiple embedding spaces within a single model (e.g., [17, 22, 26, 27, 34, 36, 37] ). However, most models have been developed for different settings where they either assume they know something about what embedding to use at test time, whereas for our task we make no such assumptions. In contrast, we use our embedding spaces so that our model becomes an expert at capturing information about cells where they have similar technical variations, then minimize the effect of spurious correlations by aggregating predictions across experts. This goal is similar to methods that were developed to generalize across domains (e.g. [11, 14, 30] ), but this work typically involves trying to align feature distributions of different domains into a shared embedding space. However, knowing what constitutes an artifact rather than an informative feature is not known beforehand, so methods focusing on alignment may unintentionally remove informative features, whereas our approach can still take advantage of them with our experts. Given two images of single cells, our goal is to measure if they share a mechanistic class. More formally, given a pair of images (I a , I p ) our goal is to embed them nearby each other if they were from the same mechanistic class, i.e., M a = M p , or to embed them far away from each other otherwise, M a = M p . However, since mechanistic classes are not available during training, we use the type of treatment given to a cell as a proxy label for mechanistic classes. Then, I a , I p are embedded according to treatment labels T a and T p . A standard way of learning this kind of embedding is using a triplet loss, but performance is often very sensitive on the size of minibatches used for training and how the images are sampled [43] . In addition, technical variations may cause a distribution mismatch between training and inference [4] , reducing performance. To address these issues, we introduce TEAMs, which has separate modules used to address each problem we outlined above. Specifically, Section 3.1 transforms our representation learning task into a cluster prediction task that effectively allows us to represent the entire distribution of our training in every minibatch data without requiring expensive offline sampling techniques, Section 3.2 learns ensembles of experts that minimizes the effect of technical variations, and Section 3.3 improves the quality of our gradients by effectively increasing the minibatch size using a memory module. Training with the entire dataset in a single minibatch is often not possible due to limitations in GPU memory. Thus, minibatches are employed that typically represent a very small subset of the data. Selecting informative samples can have a significant impact on performance [43] . Thus, researchers using several intelligent sampling techniques often focus on finding hard negatives (e.g. [31, 39] ), but this effectively introduces a bias during training due to focusing more on certain samples. Wu et al. [43] demonstrated that having a sampling method that is more representative of the distribution of the dataset can improve performance. However, they relied on an expensive offline sampling approach, where pairwise comparisons between all training samples are needed. The statistics about the distribution of the dataset are collected every few epochs during training, which can be prohibitively expensive for very large datasets. Inspired by recent work in metric learning [19, 23, 35] , we reformulate our problem such that each image is compared with an exemplar for every treatment in our training set. These treatment exemplars are learnable parameters that are meant to represent the canonical representation of a cell that underwent a specific treatment in our training set. Since these exemplars are the same size as the final representations of our cells (512-D in our experiments) they are extremely memory efficient. As a result, in our experiments we are able to compare every image in a minibatch with the exemplars for treatments that represent the entire dataset, which contains images of over a million cells, thereby accurately representing the entire dataset distribution in a minibatch. Our exemplars are trained by predicting the treatment that produced each cell. Given treatment t ∈ T and a corresponding 2 -normalized image features of a cell I t , and a set of 2 -normalized exemplars C for each treatment (i.e., |C| = |T |), we minimize the cosine distance d(·) between image I t and its corresponding exemplar C t while maximizing its distance to all other exemplars Z, i.e.: . (1) Note that we also explored modifications such as giving exemplars a higher learning rate or using temperature scaling that was shown to be beneficial in prior work [35] , but found they did not improve performance in our experiments. Technical variations in how cell images were collected can create a shift in the distribution between datasets. Working on images from different distributed datasets can be turned into the problem of aligning features across domains. e.g., using an adversarial classifier [7] to solve the mismatches between train/test distributions. Self-challenging has also been used to create models that are more robust across domains by ensuring many different kinds of features are learned [11] . This suggests that learning multiple kinds of features, or, in other words, different specialized experts, can help improve performance across domains. However, Huang et al. [11] learned these experts by forcing nondominant features to activate according to the labels, but doesn't explicitly model the different distributions that can occur across different domains, which limits its expressiveness. In contrast, we use a mixture-of-experts approach that ensembles a set of variation-specific experts learning different distributions separately to minimize the technical noise. Let V be a set of training variations, where each variation is a certain technical setting to collect the data. For each variation, we train a variation-specific expert to learn the projection W , which transforms the cell features from a shared network backbone to the final variation-specific representation. The projection W is supposed to minimize the specific technical variance noise and only output informative biological features as the variation-specific representation. Then the cell images from different variations can be compared with the single set of Treatment Exemplars in Section 3.1, i.e., we compute I represents the cell feature from our shared network backbone. In our experiments, an 18-layer deep residual network [10] is used, but our approach generalizes to any architecture (see supplementary for EfficientNet-B0 [33] results). Using a common network backbone encourages feature sharing across settings while also minimizing the amount of specialized knowledge that Figure 3 . TEAMs during inference. During training, technical variation group is know to us, and hence can choose expert to use for each cell image. However, at test time we assume we are not provided with this information, and selecting a single expert may also result in a biased prediction. Thus, we found that averaging the predictions of all experts by concatenating the features together results in best performance. can be captured by using a single linear projection, which can reduce overfitting to a single set of technical variations. During inference, however, little information about the variation distribution is known. Selecting a single variationspecific expert could be challenging, and no single expert may accurately represent the new data. A single expert may also contain spurious correlations from its observed technical variations. Instead, we represent a cell image by obtaining the variation-specific representation from each of our experts and concatenating them together (illustrated in Figure 3) . Since we use a shared network representation, obtaining features from all our experts is efficient since are all obtained using a single linear projection. Note that we 2normalize each variation-specific representation separately before concatenation so that they accurately represent the features learned during training. Our mixtures-of-experts approach is similar to Wang et al. [41] , which addressed the task of promoting fairness in image classification, where the goal is to minimize the performance differences across a set of protected attributes for a task. Our mixtures-of-experts can be seen as an adaptation of their method to our task of learning robust representations of cell images. Our Treatment Exemplars described in Section 3.1 provide a mechanism where each cell image is compared with the entire distribution of the dataset in a minibatch. However, the converse is not true, each exemplar may not have any representative samples from the treatment it represents, and instead is only updated with information from negative samples. If we alter our minibatches to have a balanced number of cells from each treatment, they may still Experiment Anchor Table 1 . Experiments Metrics. I is a single cell image; m is a mechanistic class; Mtest is a set of mechanistic classes in test dataset; t is a treatment, which is a set of images here: t k = {I|Itreatment = k}; Ttest represents the treatments in test dataset. Figure 4 . Cell Painting Example. Images are from three treatments and one control, sampled from a high-throughput experiment with thousands of treatments. The morphology of cells reveals their response to treatments, which can be used to identify effectiveness and relationships with other treatments, according to mechanism of action classes. not fit into GPU memory in datasets with many treatments. Wang et al. [40] observed that embeddings between minibatches in close proximity during training do not change much, and would likely provide a very similar signal if they were simply retained in the nearby subsequent training iterations rather than recomputed. Thus, we keep the K most recent embeddings (but not their underlying gradients) in memory module Φ and simply reuse them in future minibatches to compare them with their proxies via: . The number of iterations a sample is retained is dependent on the relative size of K and the minibatch size. For example, if K = 256 and the minibatch has 128 samples, then each sample would be retained for 256/128 = 2 iterations. The total loss function for TEAMs is a linear combination of our memory and exemplar losses, i.e., By effectively increasing the minibatch size using our memory module, each exemplar is more likely to be provided with samples from its own treatment during training, making gradient updates more effective. Cell images for our experiments are from three Cell Painting [2] datasets, which represent large treatment screens of chemical and genetic perturbations. The BBBC022 and BBBC036 datasets are high-throughput compound screens testing 1,600 and 2,200 bioactive compounds, respectively. The BBBC037 dataset is a genetic over-expression screen of 200 genes. All three datasets 1 were obtained by exposing U2OS cells (human bone osteosarcoma) to the treatments. Each treatment is tested in 5 replicates, using multi-well plates, and then imaged with the Cell Painting protocol [2] , which is based on six fluorescent markers captured in five channels ( Figure 4 ). The cell population in each replicate is captured with up to 9 images (typically 1080×1080 pixels) using the same magnification (20X). One image may contain hundreds of cells, and we identify each one to quantify treatment effects at single-cell resolution. Our goal is to model cell morphology features that capture similarities between the effects of treatments, which we evaluate by looking into the ground truth mechanism of action classes. The numbers of mechanistic classes in our datsets are 453, 693 and 29 for BBBC022, BBBC036, and BBBC037 respectively. Data Pre-processing All five-channel images were first processed with a retrospective illumination correction algorithm to fix uneven illumination distribution under the microscope objective [32] . Next, we run CellProfiler [21] to obtain cell segmentations using the seeded-watershed algorithm [38] . The locations of cell centers were recorded and used to crop-out individual cells in a 128×128 image with five channels. We save each single cell in a separate 8-bit PNG file, with all channels concatenated in a single sequential strip (illustrated in Figure 4 ). For each pre-processed Cell Painting dataset, we perform 5-fold cross-validation where we split the images by treatment, ignoring images belonging to the controls group during training following [4] . Table 2 . Single cell results. We compare how often a pair of images that share a mechanistic class are predicted as more similar than a pair of images that share no mechanistic class. for train/test/validation resulting in an average of about 470K/210K/73K images per split, respectively. We evaluate performance across three experiments summarized in Table 1 . First, for every (anchor) image I a in the test (or validation) split, we sample two additional images I p , I n to create a triplet of images. One of the sampled images is a cell image shares a mechanistic class m with the anchor, whereas the other sampled image shares no mechanistic classes. Both images are sampled at random from among the set that satisfies the constraints for that group. Performance is measured by how often the image that shares a mechanistic class with the anchor is predicted as more similar than the image that shares no mechanistic classes. As a reminder, the mechanistic class labels are not used during training, so this experiment is a type of a transfer learning problem. The second experiment is similar, but the image that does not share a mechanistic class with the anchor is randomly selected from among the images in the control group. The third experiment is analogous to the first, but the triplet being selected is over mechanistic classes. Specifically, for each (anchor) treatment t a in the test set, we select a pair of treatments t p , t n where one shares a mechanistic class with the anchor, and the other treatment shares no mechanistic class. For treatment triplets, the similarity is computed by averaging pairwise similarity between all image pairs belonging to each treatment. For each split we randomly sample 1,000 treatment triplets for BBBC037 and 50,000 triplets for BBBC022 and BBBC036. Implementation details. We train TEAMs to create a 512-D embedding using Adam [15] with a learning rate of 1e −3 , which is decayed exponentially using a gamma of 0.9. We train for 40 epochs with a batch size of 768, using a memory size of 2,816 and select the epoch that performs the best on the validation set which is evaluated after every epoch. We train our models using a single NVIDIA RTX A6000 GPU. Each Cell Painting dataset is treated as a set of technical variations (i.e., they each get their own experts). Transfer learning. This approach uses a network pretrained on ImageNet as feature extractor of the morphology of the cell. First, each gray-scale channel of the cell is transformed into an RGB tensor by replicating its content in the channel axis. Then, each channel is processed separately by the pretrained network and their feature vectors are concatenated in a single representation. We used Effi-cientNetB0 [33] in our experiments, and keep the features from the pooling layer before the classifier, resulting in a 6,400-D vector per cell (1,280 features per channel). Treatment classification [4] . This approach follows a weakly supervised strategy to representation learning where treatment labels are used to train a classification network. The network is assigned the problem of classifying single cells into one of the known treatments, assuming that all the cells in a treatment display a similar phenotypic response. In our experiments, we trained an EfficientNetB0 backbone with one classifier for each dataset. We discard the classifier and keep features from the last pooling layer before the classifier, resulting in 1,280-D vectors. Online Negatives. We also compare to a model trained with a margin-based loss with online negative mining. We assume we are provided with two pairs of images (I x , I p ) and (I y , I n ) which represent cells from the same and different treatments, respectively. Note that unlike traditional triplet loss, where I x = I y , we found we could improve performance by relaxing this constraint by allowing the case where I x = I y , and computing the loss over all possible pairs within a minibatch. I.e., we sample minibatches by sampling pairs of cell images from the same treatment. Then, we obtain their embeddings using the same encoder as our approach (ResNet-18 [10] ) and 2 -normalize them. We then compute cosine similarity between all possible pairs of cell images in the minibatch and separate them into a set of positive image pairs P for images from the same treatment and negative image pairs N for images from different treatments. Thus, our loss is computed as: H(I x , I p , I y , I n ) = max(0, m+d(I y , I n )−d(I x , I p )) (4) (5) where we set the margin m = 0.3. Adversarial [7] . This approach uses an adversarial classifier to align features across different domains. This is implemented as a linear classifier that takes the features from our network backbone (i.e., those that are input into Eq. (5)) and then predicts the variation that the treatment came from. Then, the gradients from the linear classifier that backpropagated into the network backbone from the adversarial classifier are flipped and scaled in order to encourage the underlying CNN to learn features that can't be used to discriminate between treatments. We found a scaling factor of 1e −2 worked well in our experiments. Self-Challenge [11] . This method develops more robust models by iteratively muting a percentage of the the most important features for making a prediction during training, thereby forcing the model to learn how to align additional features to the target labels. In our experiments, we set the percentage of features dropped at 50% which is applied to 1/3rd of a minibatch. Table 2 reports our results separating single cells by mechanistic class as well as from controls. Comparing our online negatives baseline in first line of Table 2 (b) to prior work in Table 2 (a) we see that our baseline provides a boost of 3.5-4% on average. While we show that using prior work for ignoring technical variation via methods like adversarial alignment can improve performance, we see that our Mixture-of-Experts (MoE) obtains best performance in Table 2 (b). We also report that our Treatment Exemplars reported in the first line of Table 2 (c) give a 2% boost when separating by mechanistic class over our online negatives baseline. Note, however, this comes at a small cost of separating from controls. This may be due to the fact that we do not explicitly model controls in our model, which may also be useful in identifying cells that have no reaction to their treatment. We believe investigating this further may be a useful direction for future work. Finally, we see through the ablations of our model in Table 2 (c) that each of the three components of TEAMs all provide meaningful contributions to obtain our best average performance across the three datasets, where we outperform prior work in Table 2(a) by 6-6.5%. Table 3 reports the performance of training on only two of the three datasets. During training, only two variation experts are trained for the two datasets respectively. The results on the held-out dataset (diagonal values in the table) show how the experts work together for the new technical variation. For each column, comparing the results from the first three rows to the last row shows how the held-out dataset variation could help training. Notably, performance on the BBBC036 dataset is most improved when combined with all three datasets, while not training on BBC022 results in a significant drop in performance on that dataset. Comparing these results to the baselines in Table 2 (a), we see that even when fewer images and examples of technical variation are available for training, our approach still provides up to a 3% boost in overall performance. Table 4 shows how TEAMs compares to prior work when aggregating predictions over entire cell treatments in order to identify treatments that share mechanistic classes. Overall TEAMs boost performance between 5.5-11% over prior work across the three datasets. Over individual datasets, the largest boost in performance is seen over the BBBC036 dataset, which improves over the best model from prior work by 8.5%. Qualitative results. Figure 5 compares the embeddings learned by our approach visualized with UMAP [20] to those created by Treatment Classification [4] . We see in Figure 5 (top) that TEAMs is able to create more homogeneous clusters compared to Treatment Classification, although this includes creating multiple clusters of cells that share mechanistic classes. We note that some treatments that are labeled as having different mechanistic classes may actually have some undiscovered similarities in mechanistic classes as well. However, as can be seen in Figure 5 (bottom), much of the separation in the same mechanistic class BBBC037 BBBC036 BBBC022 Average Transfer learning [33] 70 Table 4 . Treatment-level results. We compare how often a method can accurately identify which treatments share a mechanistic class after averaging the pairwise similarity scores between all unique pairs of cells from different treatments. See Section 5.3 for discussion. can be accounted for by coloring the clusters by treatment. This suggests that TEAMs is far more capable at learning a good embedding for the training task, but that the features learned do not always translate perfectly to the target task. In addition, there may be some effect due to technical variation that could be further improved in future work. Our quantitative results still demonstrate that despite these limitations our approach is more effective than prior work. This study was conducted with biological images of human bone osteosarcoma cells, an immortalized cell line used for research purposes only. The images or data in this study do not contain patient information of any kind. The use of these images, and the algorithms to analyze them, is to test the effects of treatments. Automating drug discovery has positive impacts in society, specifically the potential to help finding cures for diseases of pressing need around the world in shorter times, and utilizing less resources. The pro-posed methods could be used to optimize drugs that harm people; we do not intent that as an application, and we expect regulations in biological labs to prevent such uses. We provide a novel metric learning approach: Treatment ExemplArs with Mixture-of-experts (TEAMs) to capture biological features from single-cell images. Learnable treatment exemplars, technical variation experts and feature memoization are three key modules of the model. Learnable treatment exemplars represent the entire distribution of the dataset in every minibatch data, variation experts minimize the noise from technical artifacts and memory module effectively increases the minibatch size without extra computation. TEAMs outperforms the state-of-the-art by 5.5-11% on downstream tasks like identifying the true mechanism of action of cell treatments. Identifying cells that have no reaction to treatment and translating features from training task to test task could be directions for future work. In the main paper we performed experiments with a ResNet-18 [10] , but in Table 5 we report performance with an EfficientNet-B0 backbone [33] to match the backbone used by prior work. Since the difference in performance between the two backbones is negligible, we can conclude that our performance gains cannot be attributed to a difference in the network backbone architecture. To validate our choice of using an average of each expert in our Mixture-of-Experts (MoE) approach at inference we provide two points of comparison: 1. Random Expert. When computing similarity between two cell images in a pair, we select a single expert at random. 2. Oracle Expert. We use the ground truth expert at test time, which is typically not known, but it can provide a comparison of the supposedly "optimal" choice. Table 6 reports the performance of the different methods of using our trained experts. We first note that both the oracle and averaging approach used by our TEAMs model significantly outperform selecting an expert to use at random. Although the oracle expert has the highest accuracy for the image triplets considering only treatments, it performs significantly worse when separating from controls, making the averaging approach overall better when summing the two scores (i.e., TEAMs = 62.6 + 72. 5 that aggregating across experts can minimize the bias due to technical variations that are prevalent when using just a single expert on our task. We note that a similar observation was made by Wang et al. [41] when creating image classification models that were fairer predictions across common subgroups in a dataset. We perform 5-fold cross-validation on each preprocessed Cell Painting dataset (BBBC037, BBBC036, BBBC022). The results for each split can be found in Table 7 for BBBC037, Table 8 for BBBC036, and Table 9 for BBBC022. The results show that TEAMs not only has much higher accuracy scores but also reduces variance across splits for two of the three datasets, with only a minor increase in the variance on the third dataset. This demonstrates how TEAMs provides more stability in the results across splits in addition to higher performance. Table 6 . Single cell results with different methods of selecting an expert. We note that oracle expert overfits to the training data, improving performance on separating by treatment slightly, but significantly hurting performance on separating from controls. This suggests that by aggregating all the experts we are able to avoid some bias due to technical variation. Single Cell Results Split 1 Split 2 Split 3 Split 4 Split 5 Average Standard Deviation Transfer learning [33] 51 Table 9 . Results on 5 splits of BBBC022 dataset. Improving phenotypic measurements in high-content imaging screens Cell painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes Data-analysis strategies for image-based cell profiling Weakly supervised learning of single-cell feature embeddings Image-based profiling for drug discovery: due for a machine-learning upgrade? Functional immune mapping with deeplearning enabled phenomics applied to immunomodulatory and covid-19 drug discovery. bioRxiv Unsupervised domain adaptation by backpropagation Democratized image analytics by visual programming through integration of deep models and small-scale machine learning A multi-scale convolutional neural network for phenotyping high-content cellular images Deep residual learning for image recognition Self-challenging improves cross-domain generalization Phenotypic profiling of high throughput imaging screens with generic deep convolutional features Fully unsupervised deep mode of action learning for phenotyping high-content cellular images. bioRxiv MULE: Multimodal Universal Language Embedding Adam: A method for stochastic optimization Capturing single-cell phenotypic variation via unsupervised representation learning Fashion outfit complementary item retrieval Learning unsupervised feature representations for single cell microscopy images with paired cell inpainting Personalized outfit recommendation with learnable anchors Umap: Uniform manifold approximation and projection Cellprofiler 3.0: Next-generation image processing for biology Effectively leveraging attributes for visual similarity No fuss distance metric learning using proxies Automating morphological profiling with generic deep convolutional networks. bioRxiv Contrastive learning of single-cell phenotypic representations for treatment classification Give me a hint! Navigating Image Databases using Human-in-the-loop Feedback Conditional image-text embedding networks Imagebased cell phenotyping with deep learning Capturing single-cell heterogeneity via data fusion improves image-based profiling Maximum classifier discrepancy for unsupervised domain adaptation Facenet: A unified embedding for face recognition and clustering Pipeline for illumination correction of images for highthroughput microscopy Rethinking model scaling for convolutional neural networks Learning similarity conditions without explicit supervision Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis Learning type-aware embeddings for fashion compatibility Conditional similarity networks Combining intensity, edge and shape information for 2d and 3d segmentation of cell nuclei in tissue sections Learning deep structure-preserving image-text embeddings Cross-batch memory for embedding learning Towards fairness in visual recognition: Effective strategies for bias mitigation A multi-phenotype system to discover therapies for agerelated dysregulation of the immune response to viral infections. bioRxiv Sampling matters in deep embedding learning