key: cord-0500559-v92ufckc
authors: Shimovolos, Stanislav; Shushko, Andrey; Belyaev, Mikhail; Shirokikh, Boris
title: Adaptation to CT Reconstruction Kernels by Enforcing Cross-domain Feature Maps Consistency
date: 2022-03-28
journal: nan
DOI: nan
sha: 2d869c8ef4ad45fbe8c2a60967fe686d6ce52bce
doc_id: 500559
cord_uid: v92ufckc

Deep learning methods provide significant assistance in analyzing coronavirus disease (COVID-19) in chest computed tomography (CT) images, including identification, severity assessment, and segmentation. Although the earlier developed methods address the lack of data and specific annotations, the current goal is to build a robust algorithm for clinical use, having a larger pool of available data. With the larger datasets, the domain shift problem arises, affecting the performance of methods on the unseen data. One of the critical sources of domain shift in CT images is the difference in reconstruction kernels used to generate images from the raw data (sinograms). In this paper, we show a decrease in the COVID-19 segmentation quality of the model trained on the smooth and tested on the sharp reconstruction kernels. Furthermore, we compare several domain adaptation approaches to tackle the problem, such as task-specific augmentation and unsupervised adversarial learning. Finally, we propose the unsupervised adaptation method, called F-Consistency, that outperforms the previous approaches. Our method exploits a set of unlabeled CT image pairs which differ only in reconstruction kernels within every pair. It enforces the similarity of the network hidden representations (feature maps) by minimizing mean squared error (MSE) between paired feature maps. We show our method achieving 0.64 Dice Score on the test dataset with unseen sharp kernels, compared to the 0.56 Dice Score of the baseline model. Moreover, F-Consistency scores 0.80 Dice Score between predictions on the paired images, which almost doubles the baseline score of 0.46 and surpasses the other methods. We also show F-Consistency to better generalize on the unseen kernels and without the specific semantic content, e.g., presence of the COVID-19 lesions.

After the coronavirus disease (COVID-19) outbreak, a wide spectrum of automated algorithms have been developed to provide an assistance in clinical analysis of the virus (Shoeibi et al., 2020) . Among others, we consider the analysis of the chest computer tomography (CT) images. Firstly, CT imaging de facto has become one of the reliable clinical pretests for COVID-19 diagnosis (Rubin et al., 2020) . Secondly, well-developed deep learning techniques for volumetric CT processing allow precise and efficient analysis of the different COVID-19 markers. The latter includes identification (Song et al., 2021) , prognosis (Meng et al., 2020) , severity assessment (Lassau et al., 2021) , and detection or segmentation of the consolidation or ground-glass opacity.

One of the easiest to interpret and clinically useful markers is segmentation . Segmentation provides us with classification, severity estimation (Shan et al., 2020) , or differentiating from other pathologies in a straightforward manner, by evaluating the output mask. However, training a segmentation model takes huge efforts in terms of voxel-wise annotations. The earlier developed models have faced the lack of publicly available data annotated with segmentation masks. To achieve the high segmentation quality, the more sophisticated methods have been designed, e.g., solving a multitask problem, merging datasets with different annotations (Goncharov et al., 2021) . Now, a larger pool of COVID-19 segmentation datasets is available, e.g., (Tsai et al., 2021) ; therefore, the current goal is to build a robust algorithm for clinical use.

In merging a larger pool of data, the problem of domain shift arises. Domain shift is one of the most salient problems in medical computer vision (Choudhary et al., 2020) . A model trained on the data from one distribution might yield poor results on the data from the different distribution. In CT imaging, one of the main sources of the domain shift is the difference in reconstruction kernels, the parameter of the Filtered Back Projection (FBP) reconstruction algorithm (Schofield et al., 2020) . One could perceptually compare the same image reconstructed with two different kernels in Fig. 3 , e.g., B1 and B2. For the kernel-caused domain shift, several works have shown the deterioration of the models' quality, in lung cancer segmentation , and in emphysema segmentation .

In this paper, we show that the domain shift induced by the difference in reconstruction kernels decreases the quality of the COVID-19 segmentation algorithms. To do so, we construct two domains from the publicly available data: the source domain with the smooth reconstruction kernels and target domain with the sharp reconstruction kernels. See the detailed data description in Sec. 3. We train the segmentation model on the source domain and test it on the target domain. With the observed decrease of test score, we then validate the most relevant domain adaptation methods. In our comparison, we include an augmentation approach (Saparov et al., 2021) , unsupervised adversarial learning (Ganin and Lempitsky, 2015) , and our proposed feature maps consistency regularization. We describe all methods in Sec. 2.

Although the augmentation can work only with the source data, the adver-sarial and consistency-based methods require additional unlabeled data from the target domain. In this task, a large pool of unlabeled chest CT image pairs which differ only in reconstruction kernels within every pair is publicly available, e.g., . The intuition here is that the adaptation methods should outperform the augmentation one when a broader range of real-world data is available. Furthermore, we propose enforcing the cross-domain feature maps consistency between paired images; we call our method F-Consistency.

One could find the schematic representation in Fig. 1 . F-Consistency minimizes the mean squared error (MSE) between the network's hidden representations (feature maps) of paired images. We expect that explicitly enforcing consistency on the paired images should outperform the adversarial learning that emulates the similar behaviour minimizing the adversarial loss. We also note that our method could be scaled on the other tasks, such as classification, detection, or multitask, without any restrictions. Below, we discuss the most relevant works to our method, then summarize the contributions.

We begin with discussing the task-specific augmentation approaches since they are a straightforward solution to the domain shift problem. Contrary to the classical augmentation techniques for CT images like windowing (Lee et al., 2018) or filtering (Ohkubo et al., 2011) , the authors of (Saparov et al., 2021) proposed FBPAug, augmentation that directly approximates our domain shift. The authors also showed that FBPAug outperforms other augmentations. However, this method has several drawbacks which we discuss in Sec. 2.2. We consider FBPAug as one of the solutions and compare it with the other methods.

With the unlabeled target data, we can apply unsupervised domain adaptation methods to improve the model's performance on the target domain. The adversarial approaches are shown to outperform other methods (Ganin and Lempitsky, 2015) . Considering the paired nature of our data, we divide the adversarial methods into two groups: (i) image-to-image translation and (ii) feature-level adaptation.

The first group of methods aims to translate an image from source to target domain. In (Zhu et al., 2017) , authors used CycleGAN for unpaired imageto-image translation. Further, this method was implemented both for the MRI and CT (Sandfort et al., 2019) images translation. The paired image-to-image translation is closer to our setup. Such a method requires the image pairs that have different style but the same semantic content. CT images that differs only in reconstruction kernel correspond to this setup. Here, the authors of and proposed a convolutional neural network (CNN) to translate the images reconstructed with one kernel to another kernel. Despite the reasonable performance, image translation methods lack the generalization ability. If the data consists of more than two domains, we need to train a separate model for every pair. They also do not address the cases with the unseen domains. Therefore, we leave the image translation approaches without consideration.

The other group of methods is independent from the number of domains. Mostly, the feature-level adaptation methods are based on the adversarial learning as in (Ganin and Lempitsky, 2015) . The latter approach also finds several successful applications in medical imaging, e.g., (Kamnitsas et al., 2017) , (Dou et al., 2018) . Since these methods are conceptually close to each other, we stick with implementing a deep adversarial neural network, DANN, from (Ganin and Lempitsky, 2015) .

The idea of F-Consistency is conceptually close to the self-supervised learning, where the unlabeled data is used to pretrain a model using pseudo-labels. In (Taleb et al., 2020) , the authors described different pretext tasks in medical imaging. Similarly to our approach, the authors of enforce the model consistency for the initial and augmented images at the prediction and feature levels. Furthermore, the authors of (Melas-Kyriazi and Manrai, 2021) extended the self-supervised methodology to solve a domain adaptation problem. However, the goal of the self-supervised methods is using a large collection of images without annotations to improve the model's performance on the source domain. Contrary to self-supervised learning, our goal is to achieve the highest possible performance on the target domain.

Finally, we note that there is no standardized benchmark for the COVID-19 segmentation task (Roberts et al., 2021) . We also solve the isolated problem of domain adaptation that could be extended on the classification or detection tasks. Therefore, the comparison with the other COVID-19 segmentation approaches we leave out-of-scope.

Our work highlights a domain shift problem in the COVID-19 segmentation task and suggests an efficient solution to this problem. We summarize our three main contributions as follows:

• Firstly, we demonstrate that the difference in CT reconstruction kernels affects the segmentation quality of COVID-19 lesions. The model without adaptation achieves only 0.56 Dice Score on the unseen domain, while the best adaptation methods scores 0.64. In terms of similarity between predictions on the paired images, the baseline Dice Score is 0.46, which is almost two times lower than 0.80 achieved by our method.

• Secondly, we adopt a series of adaptation approaches to solve the highlighted problem and extensively compare their performance under the different conditions.

• Thirdly, we propose the flexible adaptation approach that outperforms the other considered methods. We also show our method to better generalize to unseen CT reconstruction kernels and it is less sensitive to the semantic content (COVID-19 lesions) in the unlabeled data.

In this paper, we consider solving a binary segmentation task, where the positive class is the voxels of volumetric chest CT image with the consolidation or ground-glass opacity. All methods are built upon the convolutional neural network, which we detail in Sec. 2.1.

We train these methods using the annotated dataset

, where x is a volumetric CT image, y is a corresponding binary mask, and N s is the total size of training dataset. The dataset S s consists of images reconstructed with smooth kernels, and we call it source dataset. We test all methods using the annotated dataset

The testing dataset S t consists of the N t images reconstructed with the sharp kernels, and we call it target dataset. Although S t contains annotations, we use them only to calculate the final score.

In Sec. 2.2, we describe the only adaptation method, FBPAug, that uses no data except the source dataset. The other methods use additional paired dataset

, which has no annotations. However, every image x ∈ S 2 has a paired imagex reconstructed from the same sinogram but with different kernel. Here, we assume that x belongs to the source domain andx belongs to the target domain. Now, the problem can be formulated as unsupervised domain adaptation, and we detail the corresponding adversarial training approach in Sec. 2.3. We also propose to explicitly enforce the similarity between feature maps of paired images; see Sec. 2.4. In Sec. 2.5, we detail enforcing the similarity between predictions.

In all COVID-19 segmentation experiments, we use the same 2D U-Net architecture (Ronneberger et al., 2015) trained on the axial slices. We do not use a 3D model for two reasons. Firstly, as we show in Sec. 3.1, the images have a large difference in the inter-slice distances (from 0.6 to 8.0 mm), which can affect the performance of the 3D model. Secondly, authors of (Goncharov et al., 2021) have shown 2D and 3D models yielding similar results in the same setup with various inter-slice distances. Moreover, we note that all considered methods are independent of the architecture choice. We also introduce the standard architectural modifications, replacing every convolution layer with the Residual Block (He et al., 2016) . To train the segmentation model, we use binary cross-entropy (BCE) loss. Other training details are given in Sec. 4.1.

The first adaptation method that we consider is a task-specific augmentation, called FBPAug (Saparov et al., 2021) . FBPAug emulates the CT reconstruction process with different kernels; thus, it might be a straightforward solution to the domain shift problem, caused by the difference in kernels.

However, FBPAug gives us only an approximate solution, which is also restricted by choice of kernels parameterization. We describe FBPAug as a threestep procedure for a given image x ∈ S s . Firstly, it applies a discrete Radon transform to the image. Secondly, it convolves the transformed image with the reconstruction kernel. The kernel is randomly sampled from the predefined parametric family of kernels on every iteration. Thirdly, it applies the backprojection operation to the result and outputs the augmented image F BP (x). A complete description of the method could be found in (Saparov et al., 2021) .

We outline two weak spots in the FBPAug pipeline that motivate us to use the other domain adaptation approaches. As described above, FBPAug applies two discrete approximations, a discrete Radon transform, and back-projection, leading to information loss. Furthermore, the original convolution kernels used by CT manufacturers are unavailable, and the parametric family of kernels proposed in (Saparov et al., 2021) is also an approximation. Thus, we expect FBPAug to perform worse than the other adaptation methods when a wider range of paired data is available for the latter methods.

Nevertheless, FBPAug improves the consistency scores in (Saparov et al., 2021) , and we consider it as one of the main adaptation approaches. We give the experimental details in Sec. 4.2.

Further, we detail the methods that work with the unlabeled pool of (paired) data S 2 . As mentioned at the beginning of the section, the problem can be reformulated as unsupervised domain adaptation. The most successful approaches to this problem are based on adversarial training. Therefore, we adopt the approach of (Ganin and Lempitsky, 2015) and build a deep adversarial neural network (DANN). The reason to choose this method we discuss in Sec. 1.1.

DANN includes an additional domain classifier or discriminator which aims to classify images between the source and target domains using their feature maps. We train the model to minimize the loss on the primary task (segmentation) and simultaneously maximize the discriminator's loss. Thus, the segmentation part of the model learns domain features that are indistinguishable for the discriminator. The latter should improve the performance of the model on the target domain.

For the segmentation task, we modify the architectural design of DANN; see Fig. 1 (6) . It consists of three parts: (i) feature extractor H f , the part of segmentation model that maps input images x into the feature space; (ii) segmentation head H p , the complement part of segmentation model that predicts binary maskŷ = H p (H f (x)); and (iii) discriminator H d , the separate neural network that predicts domain labeld = H d (H f (x)). In Fig. 1 , H f and H p correspond to the encoder and decoder parts of the model, respectively. H d is denoted with the dashed green arrow that passes the aggregated features to the adversarial loss.

Following (Ganin and Lempitsky, 2015) , our optimization target is

where L s is the segmentation loss (BCE), L d is the domain classification loss (BCE). θ f , θ p , θ d are the parameters of H f , H p , H d , respectively, andθ f ,θ p ,θ d are the solutions we seek. The parameter λ regulates the trade-off between the adversarial and segmentation objectives.

The goal of the discriminator is to classify a kernel that was used to reconstruct the image. To aggregate features before the discriminator, we use 1 × 1 convolutions and interpolation to equalize the number of channels and spatial size, then we concatenate the result. The discriminator consists of a sequence of fully-convolution layers followed by several fully-connected layers. We also use Leaky ReLU activations (Xu et al., 2015) and average pooling to avoid sparse gradients.

However, there is no consensus in the literature on how to connect the discriminator to a segmentation network (Zakazov et al., 2021) . Our experiments also show a high dependency of the model performance from the connection implementation. Therefore, we consider two strategies of connecting the discriminator: aggregating features from the earlier (encoder) and later (decoder) layers. We denote these approaches by DANN (Enc) and DANN (Dec), respectively. The DANN (Enc) version is also presented in Fig. 1 . We describe the experimental details in Sec. 4.3.

Similarly to the adversarial approach, we propose to remove style-specific kernel information from the feature maps. However, we additionally exploit the paired nature of the unlabeled dataset S 2 . Instead of the adversarial loss, we minimize the distance between feature maps of paired images. We use the same

For the paired imagex, we use the similar notationf .

During the training, we minimize the sum of segmentation loss and distance between paired features (f andf ). Thus, the optimization problem is

where L s is the segmentation loss (BCE) and L c is the consistency loss. For the consistency loss, we use mean squared error (MSE) between paired feature maps. Parameter α regulates the trade-off between two objectives. We call this method F-Consistency since it enforces the consistency between paired feature maps.

Along with DANN, we present our method schematically in Fig. 1 (5). Note that we do not need any additional model, e.g., discriminator H d , in the case of F-Consistency. However, the same question of choosing the feature maps to aggregate arises. We consider the same two strategies as in Sec. 2.3: aggregating encoder and decoder feature maps. We further call these implementations of our method F-Consistency (Enc) and F-Consistency (Dec), respectively. All experimental details are given in Sec. 4.4.

A special case of F-Consistency (Dec) is enforcing the consistency of paired predictions, since the predictions are de facto the feature maps of the last network layer. This approach is proposed in (Orbes-Arteaga et al., 2019) also in the context of medical image segmentation. Further, we denote this method P-Consistency. Visually, it could be compared with DANN and F-Consistency in Fig. 1 (4) . The optimization problem is the same as in Eq. 4, except L c is Dice Loss (Milletari et al., 2016) and f andf are the last layer features, i.e., predictions. The experimental details are described in Sec. 4.5.

In our experiments, we use a combination of different datasets with chest CT images. The data could be divided into two subsets according to the experimental purposes. The first collection of datasets consists of images with annotated COVID-19 lesions, i.e., with binary masks of ground-glass opacity and consolidation. It serves to train the COVID-19 segmentation algorithms. We detail every dataset of the segmentation collection in Sec. 3.1.

The second collection consists of chest CT images which are reconstructed with different kernels. We filter the data to contain pairs of images reconstructed (5), and its competitors, P-Consistency (4) and DANN (6). All methods build upon the same U-Net architecture, which we train to segment the COVID-19 binary mask (1) from the axial slices of chest CT images (2). These adaptation methods use unlabeled paired data (3) to improve the model performance on the target domain. We show the flow and different usage of the paired data in different methods with green. In the image, DANN and F-Consistency operate with the encoder layers but can be easily extended to the decoder versions. Output task:

with the smooth and sharp kernels. This data is further used to adapt the models in an unsupervised manner. We detail the second collection in Sec. 3.2.

Most of these datasets are publicly available, yielding the reproducibility of our experiments. We summarize the description in Tab. 1 and Tab. 2.

We use three publicly available datasets to train and test the segmentation algorithms: Mosmed-1110 (Morozov et al., 2020) , MIDRC (Tsai et al., 2021) , and Medseg-9. We ensure that selected datasets contain original 3D chest CT imaging studies without third-party preprocessing artifacts. The images from the Mosmed-1110 and MIDRC datasets are reconstructed using smooth kernels, whereas Medseg-9 images have sharp reconstruction kernels. That allows us firstly to identify and then address the domain adaptation setup. Therefore, we split the segmentation data into the source (COVID-train) and target (COVIDtest) domains and describe them in Sec. 3.1.1 and Sec. 3.1.2, respectively. Summary of the segmentation datasets is presented in Tab. 1. (Morozov et al., 2020) . Scanning was performed on Canon (Toshiba) Aquilion 64 units using standard scanner's protocol: inter-slice distance of 0.8 mm and smooth reconstruction kernels in particular. However, the public version of Mosmed-1110 contains every 10th slice of the original series, which makes the resulting slice distance equal to 8.0 mm. Additionally, 50 series have annotated binary masks depicting COVID-19 lesions (ground-glass opacity and consolidation). Further, we use only these 50 images in our experiments. Also, as one of the preprocessing steps, we crop images to lung masks. However, lungs are not annotated in the dataset. We obtain the lung masks using a standalone algorithm; see details in Sec. 4.1.

MIDRC. MIDRC-RICORD-1a is the publicly available dataset that contains 120 chest CT studies (Tsai et al., 2021) . The total number of volumetric series is 154. According to the DICOM entries, most images have smooth reconstruction kernels. The dataset contains at least 12 paired images (without considering the studies that contain more than two series). However, we do not use these pairs to enforce consistency since both images have smooth kernels. Also, the original dataset does not contain annotated lung masks. Therefore, similarly to Mosmed-1110, we predict the lung masks with a separate model (Sec. 4.1).

Finally, we use only the images that have non-empty annotations. The images that have empty binary masks of COVID-19 are discarded both from the Mosmed-1110 and MIDRC datasets. The resulting training dataset consists of 112 volumetric images with smooth kernels.

Medseg-9. MedSeg website 1 shares a publicly available dataset with 9 annotated chest CT images from here 2 . Although there is no information about reconstruction kernels, we perceptually identify these images to have sharp kernels. The latter assumption also finds an experimental confirmation in the significantly lower scores when the model is trained on the images with smooth kernels. Contrary to the COVID-train dataset, Medseg-9 contains annotated lung masks. However, we find the masks predicted by our algorithm more precise and use the predicted ones.

Also, we ignore the other dataset with 20 annotated CT images from Med-Seg website (Jun et al., 2020) . The preprocessing of these images is unknown, inconsistent, and diverges from our default preprocessing pipeline. Therefore, our testing dataset consists of 9 images with sharp kernels.

To train and evaluate the consistency of the segmentation algorithms, we use two sources of paired data. The first source is a publicly available dataset Cancer-500 . However, the Cancer-500 dataset does not contain COVID-19 cases. Therefore, to properly evaluate the consistency of COVID-19 segmentation algorithms, we use the second source of private data that contains COVID-19 cases.

From Cancer-500, we build the Paired-public dataset (Sec. 3.2.1) and use it only to train the segmentation algorithms in an unsupervised manner. Then, we build the Paired-private dataset from our private data (Sec. 3.2.2). Besides training, we use this dataset to evaluate the consistency scores since it contains the segmentation target (COVID-19 lesions). The summary of the paired datasets is presented in Tab. 2.

Both datasets do not contain any COVID-19 or lungs annotations. Note that we do not need COVID-19 annotations since we use these datasets in an unsupervised training. However, we need lung masks to preprocess images. Thus, we use the same lungs segmentation model (Sec. 2.1) as for the other datasets. 

We build the Paired-public dataset using a publicly available dataset, Cancer-500 . The data was collected from 536 randomly selected patients of Moscow clinics in 2018. All original images were obtained using a Toshiba scanner and reconstructed with FC07, FC51, or FC55 kernels. Here, FC07 is a smooth reconstruction kernel, whereas FC51 and FC55 are sharp kernels. From 536 studies, we extracted 120 pairs, comparing the shape and acqui-sition time of the corresponding DICOM series and filtering contrast-enhanced cases. As a result, the Paired-public dataset consists of 98 FC07/FC51 and 22 FC07/FC55 pairs (Tab. 2). We use this dataset to train the domain adaptation algorithms on paired images.

However, the Paired-public dataset does not contain COVID-19 cases (it was collected before the pandemic). The latter observation limits using this dataset to evaluate the consistency. Otherwise, we evaluate the quality of COVID-19 segmentation algorithms using images with no COVID-19 lesions. Thus, we either evaluate the consistency of noisy or false positive predictions. For the same reason, one should also be careful using this data to enforce the consistency in the last network layers, e.g., in P-Consistency (Sec. 2.5). The data without COVID-19 lesions can force the network to output trivial predictions.

Therefore, we introduce a private dataset for the extended consistency evaluation and robust training of some of the domain adaptation algorithms.

From the private collection of the chest CT images, we filter out 181 pairs to create the Paired-private dataset. These images were initially collected from Moscow outpatient clinics during the year 2020. Scanning was performed on the Toshiba and GE medical systems units using diverse settings. We select the four most frequent kernel pairs with a total of six unique reconstruction kernels: FC07, FC51, FC55, LUNG, SOFT, and STANDARD. We detail the distribution of kernel pairs in Tab. 2.

Due to the purpose of collecting the data, these images contain COVID-19 lesions. Therefore, we use the Paired-private dataset both for training and evaluation. The wider variety of kernels also allows us to test the generalization of algorithms to unseen kernels; see further experimental details in Sec. 4.

The main focus of experiments below is to compare our method to the other unsupervised domain adaptation techniques. To achieve an objective comparison, a fair and unified experimental environment should be created. Therefore, we firstly describe the common preliminary steps that build up every method. This description includes preprocessing, lungs segmentation, and COVID-19 segmentation steps in Sec. 4.1.

Further, we detail every of the domain adaptation methods: Filtered Backprojection Augmentation (FBPAug) in Sec. 4.2, Deep Adversarial Neural Network in Sec. 4.3, cross-domain feature-maps consistency which is our proposed method in Sec. 4.4, and cross-domain predictions consistency in Sec. 4.5.

In all latter experiments, we use the same data split and evaluation metrics. Firstly, we split the COVID-train dataset (source domain with smooth kernels) into 5 folds. Then, we perform a standard cross-validation procedure, training on the data from four folds and calculating the score on the remaining fold. Here, we calculate the Dice Score between the predicted and ground truth COVID-19 masks for every 3D image and average these scores for the whole fold. Also, for every validation, we calculate the average Dice Score on the COVID-test dataset, which is the target domain with sharp kernels. Finally, we report the mean and standard deviation of these five scores on cross-validation and target domain data.

Besides the Dice Score on the source and target data, we also report the Dice Score between predictions on the paired images. To do so, we split the Pairedprivate datasets' pairs into training and testing folds stratified by the type of kernel pairs. The size of the test fold is approximately 30% of the dataset size. Then, we supplement the source domain training data (four current folds of the cross-validation) with the fixed training part of the paired data. Average Dice Score is calculated on the test part in a similar fashion.

Preprocessing. We use the same preprocessing steps for all experiments. Firstly, we rescale all CT images to have 1.75 × 1.75 mm axial resolution. Then, the intensity values are clipped to the minimum of −1000 Hounsfield units (HU) and maximum of 300 HU. The resulting intensities are min-max-scaled to the [0; 1] range.

Lungs segmentation. Further, we crop CT images to a bounding box of the lungs mask. We obtain the latter mask by training a standalone CNN segmentation model. The training procedure and architecture are reproduced from (Goncharov et al., 2021) . The training of the lung segmentation model involves two external chest CT datasets: LUNA16 (Jacobs et al., 2016; Armato III et al., 2011) and NSCLC-Radiomics (Kiser et al., 2020; Aerts et al., 2015) . These datasets have an empty intersection with the other datasets used to train the COVID-19 segmentation models; thus, there is no leak of the test data.

COVID-19 segmentation. In all COVID-19 segmentation experiments, we use the same 2D U-Net architecture described in Sec. 2.1. We train all models for 25k iterations using Adam (Kingma and Ba, 2014) optimizer with the default parameters and an initial learning rate of 10 −4 . Every 6k batches learning rate is multiplied by 0.2. Each iteration contains 32 randomly sampled 2D axial slices. Training of the segmentation model takes approximately 12 hours on nVidia GTX 1080 (8 GB).

We further call the model trained only on a source data and without any pipeline modifications a baseline. For the baseline, we also calculate test scores on the COVID-test dataset and consistency scores on the Paired-private dataset. We refer to it as a starting point for all other methods.

The first method that we consider as a solution to our domain shift problem is FBPAug (Saparov et al., 2021) . One can find the relevant method description and motivation to use it in Sec. 2.2. In our experiments, we use the original implementation of FBPAug from (Saparov et al., 2021) and also preserve the augmentation parameters. However, we sample parameters from the interval that corresponds to sharp reconstruction kernels (a from [10.0, 40.0], b from [1.0, 4.0]), since our goal is to adapt model to sharp kernels. We also reduce the probability of augmenting an image from 0.5 to 0.1. The latter change does not affect performance (tested on a single validation fold) and reduces the experiment time (saving about 90 hours per experiment).

Note that the experimental setup remains the same as in baseline (Sec. 4.1). FBPAug is the only adaptation method that does not use paired data.

The next step is to use unlabeled paired data to build a robust to domain shift algorithm. Here, we adopt a DANN approach (Ganin and Lempitsky, 2015) to the COVID-19 segmentation task. We detail this approach in Sec. 2.3.

In our experiments, we use the scheduling of parameter λ as in (Ganin and Lempitsky, 2015) . The baseline training procedure is extended to sample from the unlabeled data. At every iteration, we additionally sample 16 pairs of axial slices (the batch size is 32) from one of the Paired-private or Paired-public datasets (depending on a data setup). Then, we make the second forward pass to the discriminator and sum the segmentation and adversarial losses. The rest of the pipeline remains the same as in the baseline.

For this method, we select two parameters that can drastically change its behaviour in terms of consistency and segmentation quality. Firstly, we evaluate different λ values, where λ determines how strongly adversarial loss contributes to the total loss (see Sec. 2.3). With the close to zero λ, we expect DANN to behave similar to baseline. With the larger λ, we expect our segmentation model starting to fool the discriminator, making features of the different kernels indistinguishable for the discriminator. However, this consequence does not guarantee the increase of consistency or segmentation quality. Therefore, we manually search for the λ ∈ {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 1} and choose the best. Secondly, as discussed in Sec. 2.3, we compare two approaches to connecting the discriminator to the segmentation model: (i) to the encoder and (ii) to the decoder part.

Finally, we evaluate how well DANN generalizes under the presence of different kernel pairs. To do so, we exclude SOFT/LUNG and STANDARD/LUNG kernel pairs from training. We compare the results of this experiment with the model that is trained on all available kernel pairs from Paired-private. We also test the sensitivity of the DANN approach to the presence of COVID-19 lesions in the unlabeled data. In this case, we train DANN on the Paired-public dataset that does not contain COVID-19 targets.

Our proposed F-Consistency also uses unlabeled paired data. Therefore, the training procedure is the same as for DANN (Sec. 4 .3), except we do not use any scheduling for parameter α. Our method is detailed in Sec. 2.4. Similarly to DANN's experimental setup, we select two parameters to evaluate: different α values and the features that contribute to the consistency regularization. Here, α controls the trade-off between the consistency and segmentation quality. However, contrary to the discriminator's λ in Sec. 4.3, the large α values for consistency regularization ensures the features alignment. We show this trade-off for ten α values in a log-space from 3 −10 to 1. Then, we compare two approaches to enforcing the features consistency: regularizing encoder's and decoder's features.

Finally, we evaluate the generalization of F-Consistency to different kernel pairs similarly by excluding SOFT/LUNG and STANDARD/LUNG kernel pairs from training. Similarly to DANN, we train our method on the Paired-public dataset that does not contain COVID-19 lesions and show its generalization to kernel styles, regardless of semantic content.

One special case of F-Consistency is enforcing the paired predictions consistency, which is independently evaluated in (Orbes-Arteaga et al., 2019). We call this case a P-Consistency and detail it in Sec. 2.5.

We follow the experimental setup as in F-Consistency (Sec. 4.4). We show the trade-off between the target and consistency scores for the same values of α. However, there are no experiments on regularizing specific features (in encoder or decoder), since predictions consistency can be regularized only at the output network's layer.

For the P-Consistency, we draw one's attention to the experiment on the Paired-public dataset. Since this dataset does not contain COVID-19 cases, the enforced predictions consistency on empty-target images can result in trivial predictions. Thus, we expect P-Consistency to be less generalizable to the target domain in terms of target Dice Score. However, for the datasets with COVID-19 lesions we show the generalization for the unseen kernels as for the other methods.

We structure our experimental results as follows. We firstly compare the final versions of all methods in Sec. 5.1 so we directly support our main message. Secondly, we compare the generalization of all methods trained on less data in Sec. 5.2. Then, we visualize the trade-off between the consistency and COVID-19 segmentation quality in Sec. 5.3.

The experimental methodology follows Sec. 4. We compare some of the key results statistically using one-sided Wilcoxon signed-rank test. We report p-values in two testing setups: p 1 , comparing five mean values after crossvalidation, and p 2 , comparing Dice Score on every image as an independent sample. Table 3 : Comparison of the best versions of all considered methods from Sec 2. The adaptation methods are trained using all training kernel pairs of the Paired-private dataset. F-/P-Cons stand for F-/P-Consistency, where F-Consistency is our proposed method. Enc and Dec correspond to the implementation of methods operating with the encoder and decoder network layers, respectively. All results are Dice Scores in the format mean ± std calculated from 5-fold cross-validation. We highlight the best scores in every column in bold. To begin with, we show the existence of the domain shift problem within the COVID-19 segmentation task. The Dice Score of the baseline model on the COVID-test dataset is lower than the cross-validation score on the COVIDtrain dataset, 0.56 against 0.60. Also, this score is significantly lower than 0.64 achieved by our adaptation method (p 1 < 0.05, p 2 < 10 −4 ). One could find the results in Tab. 3 comparing row Baseline to the others. For all adaptation methods, we observe an increase in the consistency score and segmentation quality on the target domain. Moreover, all methods maintain their quality on the source domain comparing to Baseline.

Further, we compare FBPAug to the best adaptation methods since it is a straightforward solution to the domain shift problem caused by the difference in the reconstruction kernels (Sec. 2.2). Although FBPAug achieves comparable results on the target domain, our method outperforms it in terms of the average consistency score, 0.80 Dice Score against 0.76 (p 1 < 0.05, p 2 < 10 −5 ). The results are also in Tab. 3, row FBPAug.

We highlight two possible reasons for FBPAug to perform worse than its competitors. Firstly, the augmented data is synthetic since the method emulates but does not reproduce the CT reconstruction process. Secondly, the range of augmentation is limited compared to the diversity of the real reconstructed images. Therefore, the lower quality and diversity of the augmented data might affect the performance of FBPAug, resulting in lower consistency scores.

Finally, we compare the models that depend on paired regularization data: DANN, F-Consistency (our proposed method), and P-Consistency; the last five rows in Tab. 3.

Implementing methods that operate with the encoder layers outperforms those with the decoder layers. For the adversarial approach, DANN (Dec)'s consistency score 0.52 is significantly lower than 0.77 of DANN (Enc) (p 1 < 0.05, p 2 < 10 −10 ). The same is true comparing F-Cons (Dec) and F-Cons (Enc): consistency score 0.58 is significantly lower than 0.80 (p 1 < 0.05, p 2 < 10 −10 ). We also note that P-Consistency operates with the last decoder layer; thus, we compare it with F-Cons (Enc) and show it resulting in the significantly lower consistency score, 0.63 against 0.80 (p 1 < 0.05, p 2 < 10 −10 ). Thus, our experimental evidences align with the message of (Zakazov et al., 2021) that the earlier (encoder) layers contain more domain-specific information than the later (decoder, output) ones.

Besides F-Consistency outperforms P-Consistency, we also show our method to outperform DANN (encoder versions). Although both methods score similar in terms of target Dice Score, F-Consistency has an advantage over DANN in the consistency score: 0.80 against 0.78 (p 1 < 0.05, p 2 < 10 −2 ). Our intuition here is F-Consistency explicitly enforces features alignment and DANN enforces features to be indistinguishable for the discriminator. The latter differently impacts on the consistency score, and F-Consistency performs better. We conclude the comparison of the methods comparison with the qualitative analysis. In Fig. 2 , one could find examples of the Baseline, FBPAug, DANN (Enc), and F-Cons (Enc) predictions on the COVID-test dataset and compare them with the ground truth. Although all adaptation methods perform similar to the ground truth with minor inaccuracies, Baseline outputs the massive false positive predictions on the unseen domain. Additionally to the quantitative Figure 3 : Examples of CT slices from the Private-paired dataset with the corresponding predictions on the paired images. Four doublets, denoted A, B, C, and D, contain corresponding slices from the smooth and sharp images. The doublets B and D are coronal and sagittal slices, respectively; we also pad them to visually align with the axial slices. Every slice contains predictions of four methods named in the legend. DANN and F-Consistency correspond to DANN (Enc) and F-Cons (Enc) from Tab. 3, respectively. analysis above, the latter observation highlights the relevance of the domain adaptation problem in the COVID-19 segmentation task.

In Fig. 3 , we visualize predictions of the same four methods on the paired images from the Paired-private dataset. For the Baseline, we observe an extreme inconsistency (Fig. 3, A) and massive false positive predictions in healthy lung tissues (Fig. 3, D) and even outside lungs (Fig. 3, B) . For the adaptation methods, their predictions are visually more consistent inside every pair, which aligns with the consistency scores in Tab. 3. Despite the high consistency scores, FBPAug and DANN output perceptually more aggressive predictions. FBPAug predicts motion artifacts near the body regions (Fig. 3, A) and triggers similarly as the baseline on, most likely, healthy lung tissues (Fig. 3, B) . DANN is more conservative but triggers on the consolidation-like tissues (Fig. 3, C and D) . However, without the ground truth annotations on the paired data, we refer to this analysis as a discussion.

Below, we investigate the generalization of the models trained with the less data and trade-off between consistency and segmentation quality.

Firstly, we show how DANN, P-Consistency, and F-Consistency generalize to the unseen reconstruction kernels. We remove SOFT/LUNG and STAN-DARD/LUNG pairs of the Paired-private dataset from training, so we train the models using FC07/FC51 and FC07/FC55 pairs. The results on the removed kernel pairs are shown in Tab. 4.

The methods preserve their segmentation quality on the COVID-train and COVID-test datasets despite we train them with limited data. Moreover, all three methods score considerably higher than Baseline in consistency scores for unseen kernel pairs (SOFT/LUNG and STANDARD/LUNG). The latter means that the adaptations methods manage to align stylistic-related features even from the limited number of training examples. However, we highlight a decrease of the average consistency scores comparing to the versions trained on full Paired-private. At this point, FBPAug (Tab. 3) outperforms the adaptation methods. The latter indicates that the range of synthetically augmented data overlaps the range of reduced Paired-private. Further, we evaluate the models regularized using paired images from the Paired-public dataset. The dataset contains only FC07/FC51 and FC07/FC55 kernel pairs. Besides the previous setup, this data does not contain COVID-19 lesions. Thus, we demonstrate that some methods depend on the semantic content and poorly generalize to kernel styles. The results are shown in Tab. 5.

We highlight two main findings from these results. Firstly, consistency of the methods that operates with the decoder layers decreases to the level of Baseline; see the DANN (Dec), P-Cons, and F-Cons (Dec) rows in Tab. 5. Our intuition here is that the decoder version of models can be more easily enforced to output the trivial predictions than the encoder one. Simultaneously, the images without COVID-19 lesions induce trivial predictions. Therefore, it might be easier for these models to differ the paired dataset from the source dataset by the semantic content and fail to align the stylistic features. Finally, we observe our method, F-Cons (Enc), to outperform the other adaptation methods training only on the publicly available data.

The main problem with maximizing the consistency score is converging to the trivial solution (empty predictions). Following the intuition from the previous section (5.2), the problem is more acute for the adaptation methods that operate with the decoder layers. Thus, we vary α, the parameter that balances the consistency and segmentation losses, for P-Consistency and F-Consistency (Dec); see Fig. 4 . The resulting trade-off follows the expected trend: consistency increases to 1 and target Dice Score decreases to 0 with increasing α. We use Dice Score on the COVID-train dataset as a perceptual criterion to choose α. We stop at the largest α value before Dice Score starts to drop: 10 −3 for P-Cons, 10 −4 for F-Cons (Dec) and 1 for F-Cons (Enc). Here, we also ensure no overfitting under COVID-test by using the cross-validation scores. However, we use Paired-private, which participates in the final comparison, to calculate the consistency score. Firstly, we argue that using the Paired-public dataset in this setup is incorrect. Paired-public does not contain COVID-19 lesions; thus, we can only measure the consistency of false positive predictions. Secondly, we choose α without considering consistency scores. Therefore, we also do not overfit under the consistency scores.

For the DANN method, we choose λ = 10 −2 based on the best score on COVID-train. Far from the optimal λ values, DANN 's prediction scores have the large standard deviation, so the trade-off cannot be observed. We also note that the adversarial approach does not explicitly enforce the trivial predictions. Hence, we report the trade-off only for the F-Consistency and P-Consistency methods.

Below, we summarize our results, discuss the most important limitations of our study, and suggest the possible directions for future work.

We have shown that the proposed F-consistency significantly improves the performance on the target domain compared to the baseline model. However, we do not train the oracle model, which indicates the upper bound for other methods in a domain adaptation task. The oracle model should be trained via cross-validation on the target domain. In our case, the target domain contains only 9 images, which leads either to lower results due to the small size of the training set or high dispersion of the results. Therefore, we compare the adaptation methods only with the baseline model and between each other. In Sec. 5.1, our model achieves the highest results in terms of the consistency score. Contrary, the authors of (Orbes-Arteaga et al., 2019) observe a tendency of models to converge to trivial solutions using consistency loss. They assume that the models learn to distinguish domains for which they are penalized; thus, the models yield trivial but perfectly consistent predictions. Although we run the same setup with (Orbes-Arteaga et al., 2019), we do not observe trivial predictions for our method. The latter is demonstrated through the whole Sec. 5. Our intuition here is that the inner structure of domains and the semantic content of images are more diverse, preventing the model from overfitting under a specific domain.

One may argue that we could use the Paired-public dataset (Sec. 3.2.1) to calculate the consistency scores in our experiments since it also contains the paired images. Here, we highlight that the Paired-public data was collected before the COVID-19 pandemic; thus, it does not contain COVID-19 lesions. Consequently, calculating the consistency scores on Paired-public, we measure the consistency of false-positive predictions, i.e., noise. Therefore, we compare the models using the Paired-private dataset (Sec. 3.2.2) instead.

We highlight that adversarial and consistency-based methods depend on a diverse unlabeled pool of data; see Sec. 5.2. On the other hand, FBPAug does not require additional data since it augments the images from source dataset. One could think of this method as enforcing the consistency between the original and augmented image predictions using ground truth as a proxy. Simultaneously, we show that the models operating with the earlier layers to perform better. Therefore, we could train F-Consistency on the pairs of original and augmented with FBPAug images to achieve even better results. We leave the latter idea for future work.

We have proposed an unsupervised domain adaptation method, F-Consistency, to address the difference in CT reconstruction kernels. Our method uses a set of unlabeled CT image pairs and enforces the similarity between feature maps of paired images. We have shown F-Consistency outperformed the other adaptation and augmentation approaches in the COVID-19 segmentation task. Finally, through extensive evaluation, we have shown our method to better generalize on the unseen reconstruction kernels and without the specific semantic content.

Data from nsclc-radiomics. The cancer imaging archive

The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans

A simple framework for contrastive learning of visual representations

Deep learning-based image conversion of ct reconstruction kernels improves radiomics reproducibility for pulmonary nodules or masses

Advancing medical imaging informatics by deep learning-based domain adaptation

Unsupervised crossmodality domain adaptation of convnets for biomedical image segmentations with adversarial loss

Unsupervised domain adaptation by backpropagation

Ctbased covid-19 triage: deep multitask learning improves joint identification and severity quantification

Deep residual learning for image recognition

Lung nodule analysis

COVID-19 CT Lung and Infection Segmentation Dataset

Unsupervised domain adaptation in brain lesion segmentation with adversarial networks

Adam: A method for stochastic optimization

Data from the thoracic volume and pleural effusion segmentations in diseased lungs for benchmarking chest ct processing pipelines [dataset]. The Cancer Imaging Archive

Integrating deep learning ct-scan model, biological and clinical variables to predict severity of covid-19 patients

Practical window setting optimization for medical image deep learning

Ct image conversion among different reconstruction kernels without a sinogram by using a convolutional neural network

Pixmatch: Unsupervised domain adaptation via pixelwise consistency training

A deep learning prognosis model help alert for covid-19 patients at high-risk of death: a multi-center study

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Mosmeddata: data set of 1110 chest ct scans performed during the covid-19 epidemic

A simplified cluster model and a tool adapted for collaborative labeling of lung cancer ct scans

Image filtering as an alternative to the application of a different reconstruction kernel in ct imaging: feasibility study in lung cancer screening

Multidomain adaptation in brain mri through paired consistency and adversarial learning, in: Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data

Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans

U-net: Convolutional networks for biomedical image segmentation

The role of chest imaging in patient management during the covid-19 pandemic: a multinational consensus statement from the fleischner society

Data augmentation using generative adversarial networks (cyclegan) to improve generalizability in ct segmentation tasks

Zero-shot domain adaptation in ct segmentation by filtered back projection augmentation

Image reconstruction: Part 1-understanding filtered back projection, noise and image acquisition

Lung infection quantification of covid-19 in ct images with deep learning

Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for covid-19

Automated detection and forecasting of covid-19 using deep learning techniques: A review

Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images

Lippert, C., 2020. 3d self-supervised methods for medical imaging

The rsna international covid-19 open radiology database (ricord)

Empirical evaluation of rectified activations in convolutional network

The domain shift problem of medical image segmentation and vendor-adaptation by unet-gan

Anatomy of domain shift impact on u-net layers in mri segmentation

Unpaired image-to-image translation using cycle-consistent adversarial networks

Acknowledgments. The work was supported by the Russian Science Foundation grant 20-71-10134.