key: cord-0602129-5tjhamd5 authors: Kang, Myeongkyun; Won, Dongkyu; Luna, Miguel; Chikontwe, Philip; Hong, Kyung Soo; Ahn, June Hong; Park, Sang Hyun title: Image Translation using Texture Co-occurrence and Spatial Self-Similarity for Texture Debiasing date: 2021-10-15 journal: nan DOI: nan sha: fbbfb61ae44de4bb0369ffb80edef244ee322c27 doc_id: 602129 cord_uid: 5tjhamd5 Classification models trained on datasets with texture bias usually perform poorly on out-of-distribution samples since biased representations are embedded into the model. Recently, various debiasing methods have attempted to disentangle biased representations, but discarding texture biased features without altering other relevant information is still a challenging task. In this paper, we propose a novel texture debiasing approach to generate additional training images using the content of a source image and the texture of a target image with a different semantic label to explicitly mitigate texture biases when training a classifier. Our model ensures texture similarity between the target and generated images via a texture co-occurrence loss while preserving the content details from the source image with a spatial self-similarity loss. Both the generated and original training images are combined to train an improved classifier that is robust against inconsistent texture bias representations. We employ five datasets with known texture biases to demonstrate the ability of our method to mitigate texture bias. In all cases, our method outperformed existing state-of-the-art methods. Texture biases can be easily and unintentionally introduced during the data collection process. The source and properties of such biases are usually unknown, and can lead to a significant decrease in performance when a model trained on biased data is applied to out-of-distribution data [3, 4] . In addition, Geirhos et al. [10] showed that common convolutional neural networks (CNNs) prioritize texture information over content features (e.g., shape), making them especially vulnerable to texture biased images. Therefore, transferring texture features between images of different classes allows a classifier to adjust its internal learned representation to be less dependent on biased information. As shown in Fig. 1(a) , if a binary classifier is trained with images that have a distinct texture for each class (e.g., five is colored and six is grayscale), it is highly likely that the model will consider the texture representations as a differentiator between the two classes rather than the actual shape. In this case, the model will perform poorly on test data that does not include similar textures observed in the training data. Bias mitigation or debiasing has been previously addressed by several methods that proposed to extract bias-independent features through adversarial learning and enabled models to solve the intended classification task without relying on biased features (e.g., color information) [21, 45] . However, texture biased representations are often retained after training, since it is difficult to completely disentangle biased features through adversarial learning. In addition, prior knowledge regarding the types of biases in the training data is essential for adversarial learning, but it is often unknown and difficult to identify, especially when the data has been collected from multiple sources [3, 4, 12, 28] . In this paper, we propose to debias a dataset by generating additional images using the existing training images. Specifically, we combine the structural information of a source image with the texture of another labeled image to extend the dataset and mitigate bias. Thus, the extended dataset helps the classifier learn class related features and avoid relying on biased features to perform correct data separation. As shown in Fig. 1(b) , a debiased classifier can be trained without bias labels by combining the original data and generated images containing the textures of images with different labels. Modern generative adversarial networks (GAN) [51, 29, 15, 24, 22, 34] have shown remarkable success in effectively transferring textures (or styles) between pairs of images, but they also tend to adjust content information since no explicit supervision is provided during training. For instance, in Fig. 1 , the ideal scenario would be the colored digit five is translated to black and white, but is instead translated into a six that matches the black and white style. In general, transferring only texture information without content discrepancy is very challenging, and poor results will adversely affect the classifier. To address this issue, we propose a novel GAN that simultaneously optimizes spatial self-similarity between source and generated images, and texture similarity between the generated and target images, where the source and target images have different labels. The proposed GAN consists of an image generator with content and texture encoders, two discriminators that constrain texture similarity in both local and global views, and a pre-trained VGG [40] network to enforce spatial self-similarity. The generation of high quality images with the intended properties is achieved by using a spatial self-similarity loss to ensure content consistency, a texture co-occurrence loss and GAN loss to enforce similar local and global textures in the target image. Once generated images have been obtained using the training data, a debiased classifier can be learned using all the available data. Our main contributions are listed below: -We introduce a method that learns a debiased classifier by extending the training dataset with additional images that incorporate textures from other classes while retaining structural information of it's own class. Our method does not require any bias labeling and can effectively mitigate unknown texture biases during training. -We propose a novel image generation method employing texture co-occurrence and spatial self-similarity losses. Our results demonstrate how using both losses can jointly generate high quality images that successfully combine structure and textures from separate images. We also show that optimizing both losses can produce images that are effective for debiasing. -We show that our method outperforms existing debiasing methods and produces high quality images compared to prior generative models on five distinct texture biased datasets. Texture-Bias in CNN: Geirhos et al. [10] argued that texture bias mitigation is necessary to ensure the reliability of a classifier, since common CNNs prioritize texture information over content (shape). Especially in the task of domain generalization [49] , texture bias has gained significant interest since changes in image texture are the main reason for domain shift [44, 32, 50, 43, 25, 37] . Nuriel et al. [33] and Zhou et al. [50] applied feature-level adaptive instance normalization (AdaIN) [14] by shuffling (or swapping) the features of training samples across source domains to improve the generalizability of the trained model. Similarly, Nam et al. [32] introduced content-and style-biased networks that randomize the content and style (texture) features between two different samples via AdaIN. To obtain features robust to style-bias, they also leveraged adversarial learning to prevent the feature extractor from retaining style-biased representations. In general, prior methods only focused on the generalization performance of inaccessible-domain samples, and thus design models to learn common object features from multiple source domains. However, if the sample is biased along the label e.g., colored digit five and grayscale digit six in Fig. 1 , existing methods may fail to learn common object features, and cannot effectively ignore texture biases present in the training dataset. In this paper, we address this by training a classifier to only focus on the intended task without using bias information present in the training data. Image Generation and Style Transfer Models: Zhu et al. [51] and Liu et al. [29] proposed deep generative models using a pair of generators and discriminators that translate one domain into another using cycle-consistency and a shared-latent space assumption, respectively. Huang et al. [15] proposed a multi-domain translation model by varying the style code with fixed content for diverse style image generation. For feature disentanglement, Lee et al. [24] proposed a content and attribute encoder with a cross-cycle consistency loss to enforce consistency between domains. Furthermore, Kolkin et al. [22] proposed an optimization-based style transfer method using the concept of self-similarity. Park et al. [34] proposed a conditional image translation method with a contrastive loss to maximize the mutual information between positive and negative patches, whereas Zheng et al. [48] proposed a spatially-correlative loss for consistent image translation to preserve scene structures. However, as most GAN-based methods focused on transferring image style from one domain to another instead of maintaining image content; spatial discrepancy or texture corruption is observable in the generated images. Unlike prior generation models, our model only updates textures while preserving content by minimizing a spatial self-similarity loss [22] . Debiasing and Fairness Aware Models: To mitigate bias, Alvi et al. [2] employed a bias prediction layer to make latent features indistinguishable using a confusion loss [42] . Kim et al. [21] used a gradient reversal layer [7] that minimizes the mutual information of bias predictions to constrain the use of bias-related features for classification. In addition, few related works [39, 5] used the fairness term in GAN. However, these methods were proposed for the generation of a fraction (e.g., 10%) of protected attribute samples for a model trained on a biased dataset. Moreover, since these methods could not generate images with a specific texture, we employ an image translation technique that can explicitly generate an image with the texture of a target image. On the subject of fairness aware training, Li et al. [27] formulated bias minimization in terms of data re-sampling to balance the preference of specific representations for classification. On the other hand, Wang et al. [45] employed an adversarial learning approach [7] to remove protected attributes correspondence (e.g., gender) in the intermediate features of the model. Moreover, Louppe et al. [30] proposed an adversarial network to enforce the pivotal property (e.g., fairness) on a predictive model, and Zhang et al. [46] proposed three terms, i.e., primary, adversary, and projection, to improve the stability of debias training. However, these methods require bias labels for training and stability in adversarial learning is often hard to achieve. In contrast, our method solves the bias problem by explicitly utilizing texture generated images and only requires images of different domains. Thus, our approach does not require laborious bias labeling and is free from intractable adversarial classifier training. Our proposed image generation model. Our model consists of a content encoder Ec, texture encoder Et, StyleGAN2 generator G, ImageNet pre-trained V GG model, and two discriminators D and D patch . Each encoder extracts the content feature ci from an image Xi and the texture feature tj in an image Xj with an different label, respectively. G generates an image X ′ i using ci and tj with a spatial self-similarity loss L spatial , a texture co-occurrence loss Ltexture, and a GAN loss LGAN . Given an image dataset D = {X 1 , ..., X n } with binary classification labels y i ∈ {0, 1}, and a bias property b i ∈ {0, 1} aligned with the labels i.e., y i = b i , our main goal is to build an augmented dataset D ′ = {X ′ 1 , ..., X ′ n } that reduces the importance of b i during the classification process. In this work, we consider texture as the dominant property that creates bias during the classification process and prevents the model from using shape information of the object of interest. Therefore, we propose a generative data augmentation framework that updates an input image X i using the texture of a randomly selected image with a different label i.e., X j,yi̸ =yj , while retaining the content information of X i . Our framework consists of three main sections as shown in Fig. 2 . First, X i and X j are encoded by a content encoder E c and a texture encoder E t , respectively. Then, the encoded features are used to generate a target augmented image X ′ i via G, and a combination of two discriminators with an ImageNet pre-trained V GG model [40] ensure the generated image X ′ i has the texture of X j while retaining the content information of X i . Modifying texture information while maintaining the content requires the use of additional terms in the standard adversarial objective. Thus, we add a texture co-occurrence loss term to enable correct texture transfer and a spatial self-similarity loss term to ensure the original content is unchanged. The generator G and discriminator D follow the architecture proposed in StyleGAN2 [20] , and are used to compute the adversarial loss between X ′ i and X j . Texture co-occurrence loss between X i and X j,yi̸ =yj is computed by the patch-discriminator D patch , and the spatial self-similarity loss between X i and X ′ i is computed by the V GG model, respectively. Finally, the original dataset D is combined with its augmented version D ′ to train a classifier that is robust to inconsistent texture bias representations for classification, and can avoid texture bias present in the training data. Generating a realistic image using the content information in X i and the texture of X j requires the extraction of specific types of features. To this end, we use two different encoders E c and E t to extract a content-encoded tensor c i and a texture-encoded vector t j , respectively. Then, G generates a texture transferred image X ′ i by taking c i and t j as a constant and style vector following Karras et al. [20] , and discriminator D enforces a non-saturating adversarial loss [11] for generative training. The adversarial loss is defined as: However, this setting often fails to retain content information since the discriminator is heavily enforced to generate an image that preserves entangled content features in the target domain. Thus, we add additional constraints to ensure generator G retains content information in c i and uses texture information in t j . These constraints take the form of two additional modules and loss terms, i.e., a texture co-occurrence loss and spatial self-similarity loss. To encourage the transfer of texture information t j from X j to X ′ i i.e., G(c i , t j ), we employ a texture co-occurrence loss with a patch-discriminator D patch [35] that measures the texture difference between X ′ i and X j . The texture co-occurrence loss and patch discriminator D patch were initially proposed for image editing [35] to disentangle texture information from structure. D patch encourages joint feature statistics to appear perceptually similar [16, 17, 8, 35] , and can be achieved by cropping multiple random patch sizes between 1/8 to 1/4 of the full image size, and feeding them to D patch . In particular, we average the features in X j patches and concatenate with the features of X ′ i , and feed them to the last layers of D patch to calculate the discriminator loss. Consequently, G is enforced to satisfy the joint statistics of low-level features for consistent texture transfer. Formally, To retain the content information of the source image, we employ spatial selfsimilarity as a domain invariant content constraint. The self-similarity loss has been used to maintain the structure of content images in artistic image style transfer [22] . Formally, we consider a spatial self-similarity map as follows: where f Xi ∈ R C×HW denotes the spatially flattened features extracted from V GG with channel C, height H, and width W , respectively. By applying the dot product, S Xi ∈ R HW ×HW captures spatial correlation from one location (R C ) to the rest in all feature maps. Thus, our domain invariant content constraint (spatial self-similarity loss) can be calculated between X ′ i and X i as follows: where cos denotes the cosine similarity. In conventional generative models, the reconstruction loss (e.g., L1, MSE) or perceptual losses are used to provide constraints using the entangled feature (contents and texture) for minimizing content discrepancy (Fig. 3 ). Since content information can be decoupled from texture information using spatial self similarity, we can explicitly control the content features in G. Consequently, we can successfully preserve the content of X i while updating texture. The objective function of our framework is defined as: where hyper-parameters λ g , λ t , and λ s balance the importance of spatial and texture loss terms, respectively. We set λ g = 0.1, λ t = 1.0, and λ s = 100, respectively. The objective and network design of discriminators D and D patch closely follow StyleGAN2 [20] , and 8 patches are used in the texture co-occurrence loss. E c and E t downsample their inputs 2x and 6x each to extract c i and t j (1x and 4x for small resolution inputs e.g., digit), respectively. For t j , we explicitly discard spatial information by applying global average pooling. To reduce computational costs, we select 256 random features from f T Xi to obtainf T Xi , thus reducing the size of the self-similarity map, i.e.,Ŝ Xi =f T Xi · f Xi . Herein,Ŝ Xi ∈ R 256×HW is used to calculate L spatial instead of S Xi ∈ R HW ×HW (generally 256 < HW ). To construct D ′ , we randomly select texture sources from images with a different label. To debias a multi-domain biased dataset, the number of models required for texture updates is a factorial of the number of domains in the set. In other words, it is challenging to use the proposed method on datasets with a large number of labels (domains). Herein, we introduce a conditional version of the proposed method constructed by using 2D embedding layers that simply change the statistics of an intermediate feature F of CNNs. We placed the 2D embedding layer at the second and pre-last layer of E c , E t , and D. The domain label is fed into a 2D embedding layer and returns 1D vectors (weight e w and bias e b in the same dimensions as F) to modify F. The updated features F ′ are calculated as F ×e w +e b , and are fed to the next layer instead of F. Consequently, a condition is provided to the model. Existing datasets often have similar distributions between training and testing data [3, 4] , making them unsuitable for direct bias mitigation analysis. Thus, we constructed five texture biased datasets following previous works [2, 21] , where the training and testing sets have opposite distributions allowing us to focus our evaluation solely on texture bias mitigation (Fig. 4) . Note that, in the balanced setting, the model will perform well on the same distribution samples, but may present some errors on the remainder. Thus, our evaluation used the inverse distribution to quantify the true performance in the most extreme cases. On the other hand, to avoid indirectly learning about the existing bias during training, the validation set also has the same distribution as the training set, thus high accuracy on the validation set does not reveal any robustness to bias. Five vs. Six: This set is based on MNIST [23] and MNIST-M [7] datasets where only the numbers five and six are used to construct a training split. Images with a five are taken solely from MNIST-M, whereas those with six are from MNIST, the opposite case is considered for the testing split. Dogs vs. Cats: This dataset was originally proposed by Kim et al. [21] by aligning the hair color differences of animals. Images with bright dogs and dark cats images are used in the training split, whereas dark dogs and bright cats are employed for testing. Consequently, the dataset exhibits hair color bias. COVID-19 vs. Bacterial pneumonia: Pneumonia caused by different pathogens requires specific treatments. However, images obtained by computed tomography (CT) scans might share similar properties making accurate diagnosis challenging. One source of ambiguity for automatic diagnostic systems is the choice of the CT protocol during the image acquisition process, i.e., using a contrast agent vs. another protocol. To evaluate our method for real-world texture bias, CT scans were collected at anonymous medical center. In this dataset, we selected Noncontrast COVID-19 for training and contrasted scans for testing. The opposite applies for Bacterial pneumonia scans. Multi-class Biased Datasets: To evaluate our method on a multi-class texture biased dataset, we constructed Digit and Biasd PACS datasets. In the Digit dataset, samples with labels from zero to four are taken from MNIST [23] , and those with five to nine from MNIST-M [7] . Biased PACS dataset was constructed using PACS dataset [25] , which consists of four domains (Photo, Art, Cartoon, Sketch) and has seven classes. We selected the top four labels (Dog, Elephant, Giraffe, Horse) after sorting the number of images. Each class is taken from a different domain (e.g., Dog -Cartoon, Elephant -Sketch) for the training dataset, and the remaining domain images are used for the test set. Additionally, we constructed an Inverse Biased PACS dataset that inversely uses the biased PACS dataset, i.e., replace the test split samples with the training samples, and replace the train/val split samples with the testing samples, respectively. Compared to the binary-class biased datasets, this dataset has access to samples in three domains during training and be evaluated on samples from a single domain that does not match the training data. For our baseline, we train an ImageNet pre-trained ResNet50 classifier only using the biased training data without any debiasing methods. For COVID-19 vs. Bacterial pneumonia, we aggregated slice predictions via majority voting to obtain patient-level diagnosis. We evaluated accuracy using the macro F1-score, which treats the label distribution equally. For consistency, training was repeated three times, and we report the average performance with the standard deviation of each method as the final performance. We compare our method against five conventional debiasing methods [30, 2, 46, 45, 21] , two domain generalization methods [33, 50] , and five generative models [51, 29, 15, 24, 34] . For non-generative debiasing methods, bias information (e.g., data-source, color, and CT protocol) was used to train the classifier. For generative models, we constructed augmented datasets using each generative model and employed them to train classifiers. For fair evaluation, the classifiers in the compared methods were trained using the same backbone (ResNet50) under the same training settings. In the multi-domain data scenario, we did not perform experiments for methods [30, 46] addressing binary classification, as well as generation methods [51, 29, 15, 24, 34 ] that had to learn models for all domain labels. Table 1 . Classification performance of non-generative debiasing methods (The first sub-row), domain generalization methods (The second sub-row), image generative models (The third sub-row), and the proposed model (The fourth sub-row) on five datasets. F1-score with ± std was used as the metric. The second and third sub-columns indicate binary-and multi-class biased dataset classification performances, respectively. Five vs. Six: In this task, none of the compared models yielded satisfactory results (see Table 1 (a)). Non-generative models report extremely low scores, leading us to conclude that texture features were still used for classification instead of the digit shape. Furthermore, domain generalization methods also show poor F1-scores. We observed that generative models mainly transferred texture and shape jointly (See Fig. 5) . Consequently, the resulting augmented dataset had several instances with the shape of a different category. As the classifier does not have a clear cue to distinguish between numbers, this led to poor performance. Results on this dataset clearly show the benefit of using our method to transfer texture features while retaining the underlying information for successful bias mitigation. Dogs vs. Cats: Performing classification on real-world animals requires more complex features than color alone. In Table 1 (b), we observed an increase in performance when a debiasing-or domain generalization method was applied. Overall, generative models show better performance over the non-generative counterparts, i.e., F1-score +10%. While the performance of generative models was reasonable, several limitations were noted. In the case of CycleGAN, CUT, and DRIT++, unsatisfactory images were obtained, i.e., texture was not entirely translated and distortions were present. Moreover, UNIT and MUNIT generate images with small texture updates, thus bias remains in the generated images (see Fig. 5 ). On the other hand, our model simultaneously shows improved performance with high-quality image generation. We believe this is mainly due to the proposed texture translation strategy. Leveraging both texture co-occurrence and spatial self-similarity losses enabled our model to generate consistent and natural images leading to improved classification performance. COVID-19 vs. Bacterial Pneumonia: In contrast to the results observed on the Dogs vs. Cats task, non-generative models report improved performance over the rest (Table 1 (c)). Domain generalization methods obtain lower F1scores than non-generative approaches, whereas non-generative models mitigate distinctly recognizable bias in the image such as color over generative models, since they do not impose any structural changes in the image. On the other hand, generative-based models tend to modify patterns on regions of diagnostic interest, thus losing the properties that identify the disease, leading to lower performance (See Fig. 5 ). For example, CycleGAN tends to erase (row 10, 12) or create lesions (row 9) in the CT scans. UNIT and MUNIT show minor texture updates, hence are insufficient to mitigate bias in the classifier during training (all rows). DRIT++ and CUT create artifacts such as non-existent lesions, checkerboards and change the images' properties (row 9, 10, 12), resulting in the lowest classification accuracy among the generative models. Meanwhile, our method jointly optimizes the texture co-occurrence and spatial self-similarity losses, each imposing structural and texture constraints for image generation. Herein, our method can successfully update texture without introducing artifacts in the original CT scan. Note that while contrast CT is a standard protocol for common lung disease diagnosis, curating contrast CT is challenging since extra processes such as contrast agent injection and disinfection are required [38, 19] , especially during the pandemic. Finally, we believe texture biases will be unexpectedly or sometimes unavoidably be introduced during the data collection process; thus, a debiasing method that can maintain key structures in their images is vital in the medical domain. Multi-class Biased Datasets: For the multi-class Digit dataset, none of nongenerative models and domain generalization methods report satisfactory performance (see Table 1 (d)). On the other hand, our method achieved the best score over all compared methods (i.e., 68.82%). On Biased PACS dataset, all comparison methods report F1-scores around 10%, while our method significantly improved to 36.68% (Table 1 (e)). The translated images from Photo to other domains are satisfactory as shown in Fig. 6(a) , whereas Sketch to Photo (and Art) have differences between the realistic Photo (and Art) images as shown in Fig. 6(b) . We found that this is due to the scale of Biased PACS dataset (small with only a few hundred images). Despite this, our method achieves reasonable generation quality, retains content and accurately transfers textures for debiasing. On Inverse Biased PACS dataset, even though non-generative models and domain generalization methods utilize multiple domain samples for training, they perform poorly on out-of-distribution samples as shown Ablation Study: We performed ablation studies to evaluate the impact of the content L spatial and texture L texture losses by either removing one or replacing them with a different loss function. It is essential to validate whether texture co-occurrence and self-similarity are the key components for improving image generation quality in biased settings. Thus, we used style [26] and perceptual [9] losses to replace L spatial and L texture , respectively, as they are the most common techniques used in style transfer methods to retain content features for image generation. On the other hand, the adversarial L GAN loss is an essential part of our image generation pipeline and unrealistic results were produced when omitted, thus we include this loss in our ablations. Results on the Five vs. Six dataset in Table 2 (a) show that the use of texture co-occurrence is a key in enabling the classifier to correctly differentiate digits. However, qualitative results in Fig. 7 indicate that the absence of L texture results in a failure to change the texture and correctly mitigate bias information. Likewise, replacing L texture with a style loss overcomes the content constraints by modifying both texture and shape. In addition, while the absence of L spatial lets the classifier achieve higher performance than most methods, qualitative results show that even though the texture is correct, the shape does not resemble neither a number five nor six, which can be exploited by the binary classifier and can lead to incorrect predictions. Moreover, replacing L spatial or L spatial & L texture results in poor image quality with ambiguous shapes. Consequently, we show that the combination of texture co-occurrence and self-similarity losses leads to higher image quality towards mitigating the underlying texture bias. For Dogs vs. Cats, the model seems to be less susceptible to a drop in performance due to a change in the loss term, and usually generates good quality images as is reflected by the good classification results. The only exception is the absence of L spatial , which reported a lower score (Table 2(b)), and it is clear from Fig. 7 that the shape is lost in a similar fashion to the results observed in Five vs. Six data. In contrast, COVID-19 vs. Bacterial pneumonia results show high dependency on the loss functions leading to decrease in performance across all ablation studies. We believe this due to the explicit constraints of content and texture, as they are important to reduce the risk of unintended distortion, and insufficient texture translation, respectively. In summary, the ablation studies show that our loss function design choices played an essential role in alleviating the inherited texture bias problem observed in training data, and proved that our method could mitigate texture bias across multiple datasets with high performance by adopting an image generation strategy. Our ablation studies also demonstrate superior performances shown by the experiments with replacement as they have the same settings with previous style transfer methods [9, 26] . Our method not only achieved higher quantitative performance against style transfer competitors, but also showed better visual quality (Fig. 7) . Discussion: We analyzed the ability of image translation models to mitigate biases on multiple datasets. Our choice to generate additional training images by transferring texture between sets of images that have different classes proved an effective solution to train robust classifiers. Moreover, the ablation studies support the choice of spatial and texture based losses to correctly translate the texture of images without modifying their content, leading to high debiasing performance. This is a vital requirement for most medical and industrial domains that have strict structural constraints on their images, and cannot solely rely on web or social media to collect a large number of data samples. Usually, data collection is often limited and biased to some local settings (e.g., acquisition equipment or protocols), where the labels might be aligned with features not related to the classes intrinsic properties. Notably, we empirically demonstrate the effectiveness of our method to solve a real world scenario by mitigating biases induced by acquisition protocols in the COVID-19 vs. Bacterial pneumonia experiments. In this work, we have proposed a novel strategy to augment an initially texture biased dataset with new instances that mitigate biased properties using a generative model. We have demonstrated that a dataset extended by our proposed method can be used to train a debiased classifier, and the strategy of transferring texture while retaining content information is a valid choice for addressing data bias in multiple types of datasets. In particular, our model was able to handle simple biases such as color, as well as more complex and realistic texture features i.e., induced by different CT scanning protocols. Our results show that using both texture co-occurrence and spatial self-similarity losses to impose constraints in our generative model is key, as the losses are complementary. Finally, we report high accuracy with significant margins across all evaluated datasets, and generated higher quality images compared to prior state-of-the-art methods. Image Translation using Texture Co-occurrence and Spatial Self-Similarity for Texture Debiasing Supplementary Material We show additional results on three datasets. We compare our method with the baseline generative methods: CycleGAN [51] , UNIT [29] , MUNIT [15] , DRIT++ [24] , and CUT [34] . Figures S.1, S.2 , and S.3 show images generated by pairs of images in each of two rows. Since our proposed method creates an image using the contents and texture information of the pair image, only the texture is transformed like the target image while maintaining the contents of the source image. On the other hand since other methods transform an image in one domain into textures of other domain images, the image is transformed by considering various texture characteristics of the different domain, and thus texture conversion between image pairs cannot be performed. We show additional results on multi-class biased datasets. Figures S.4 and S.5 show generated images of our proposed method. The bias information and dataset statistics of Five vs. Six, Dogs vs. Cats, and COVID-19 vs. Bacterial pneumonia datasets are shown in Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models Debiasing skin lesion datasets and models? not so fast Fair generative modeling via weak supervision Deformable convolutional networks Unsupervised domain adaptation by backpropagation Texture synthesis using convolutional neural networks Image style transfer using convolutional neural networks Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness Generative adversarial nets Making the v in vqa matter: Elevating the role of image understanding in visual question answering Deep residual learning for image recognition Arbitrary style transfer in real-time with adaptive instance normalization Multimodal unsupervised imageto-image translation Visual pattern discrimination Textons, the elements of texture perception, and their interactions Covid-19 ct lung and infection segmentation dataset Chest ct practice and protocols for covid-19 from radiation dose management perspective Analyzing and improving the image quality of stylegan Learning not to learn: Training deep neural networks with biased data Style transfer by relaxed optimal transport and self-similarity Mnist handwritten digit database Drit++: Diverse image-to-image translation viadisentangled representations Deeper, broader and artier domain generalization Demystifying neural style transfer Repair: Removing representation bias by dataset resampling Resound: Towards action recognition without representation bias Unsupervised image-to-image translation networks Learning to pivot with adversarial networks Mosmeddata: Chest ct scans with covid-19 related findings dataset Reducing domain gap by reducing style bias Permuted adain: Reducing the bias towards global statistics in image classification Contrastive learning for unpaired image-to-image translation Swapping autoencoder for deep image manipulation Pytorch: An imperative style, highperformance deep learning library Moment matching for multi-source domain adaptation Role of computed tomography in covid-19 Fairness gan: Generating datasets with fairness properties using a generative adversarial network Very deep convolutional networks for large-scale image recognition A large annotated medical image dataset for the development and evaluation of segmentation algorithms Simultaneous deep transfer across domains and tasks Deep hashing network for unsupervised domain adaptation Learning robust representations by projecting superficial statistics out Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations Mitigating unwanted biases with adversarial learning Resnest: Split-attention networks The spatially-correlative loss for various image translation tasks Domain generalization: A survey Domain generalization with mixstyle Unpaired image-to-image translation using cycle-consistent adversarial networks For domain generalization models, the best hyper-parameters reported in their paper were used for training (e.g., alpha in [50] and probability in [33]), other settings were left the same as the baseline classifier. For generative models, all experiments were trained for 200 epochs approximately and batch size was set to 8, however, UNIT and MUNIT were set to 4 because of hardware (CUDA 9) compatibility. In the classification step, we applied random affine translation (rotation, translation, and scale) and horizontal flips to Dogs vs. Cats, COVID-19 vs. Bacterial pneumonia, Biased PACS, and Inverse Biased PACS datasets To integrate non-generative debiasing models, we prioritized using the author's implementation. In case when the author's implementation is not available, we employed a publicly available code as an alternative. For Learning-to-pivot [30], we used a publicly available code 4 . For Adv debias [46], we used a publicly available code 5 . For BlindEye [2], we used the author's implementation of confusion loss 6 . For Learning-not-tolearn [21], we used the author's implementation 7 . For Not Enough [45], we used the author's implementation 8 . For domain generalization methods, all author's implementations are available. For Permuted AdaIN [33], we used the author's implementation 9 . For MixStyle [50], we used the author's implementation 10 . To train generative models, all author's implementations are available. For Cycle-GAN [51] we used the author's implementation 11 we used the author's implementation 12 . For DRIT++ [24], we used the author's implementation 13 . For CUT [34], we used the author's implementation 14