key: cord-0700734-f84w99hi
authors: Ding, Weiping; Abdel-Basset, Mohamed; Hawash, Hossam
title: RCTE: A Reliable and Consistent Temporal-ensembling Framework for Semi-supervised Segmentation of COVID-19 Lesions
date: 2021-07-21
journal: Inf Sci (N Y)
DOI: 10.1016/j.ins.2021.07.059
sha: 4d2ce4f6511005eebcc85ea329769c36b78b528b
doc_id: 700734
cord_uid: f84w99hi

The segmentation of COVID-19 lesions from computed tomography (CT) scans is crucial to develop an efficient automated diagnosis system. Deep learning (DL) has shown success in different segmentation tasks. However, an efficient DL approach requires a large amount of accurately annotated data, which is difficult to aggregate owing to the urgent situation of COVID-19. Inaccurate annotation can easily occur without experts, and segmentation performance is substantially worsened by noisy annotations. Therefore, this study presents a reliable and consistent temporal-ensembling (RCTE) framework for semi-supervised lesion segmentation. A segmentation network is integrated into a teacher-student architecture to segment infection regions from a limited number of annotated CT scans and a large number of unannotated CT scans. The network generates reliable and unreliable targets, and to evenly handle these targets potentially degrades performance. To address this, a reliable teacher-student architecture is introduced, where a reliable teacher network is the exponential moving average (EMA) of a reliable student network that is reliably renovated by restraining the student involvement to EMA when its loss is larger. We also present a noise-aware loss based on improvements to generalized cross-entropy loss to lead the segmentation performance toward noisy annotations. Comprehensive analysis validates the robustness of RCTE over recent cutting-edge semi-supervised segmentation techniques, with a 65.87% Dice score.

By the end of December 2019, the world was gripped by a new coronavirus epidemic, scientifically known as SARS-CoV-2, which causes acute viral pneumonia disease (COVID-19) [1] . The epidemic proliferated rapidly owing to human-to-human communication, with a reported 96,801,177 patients positively confirmed as COVID-19 and 2,069,763 mortalities as of January 01, 2021. 1 The epidemic has become a worldwide threat to health and economic infrastructures. Hence the quick identification of COVID-19 threat considerations is essential for the remedy of infected patients and global infection containment.

In clinical practice, the reverse-transcription polymerase chain reaction (RT-PCR) is an important technique for screening COVID-19. However, to obtain test results takes up to two days due to insufficient resources and rigorous situational needs that limit the reliability and speed of screening patients. Moreover, the RT-PCR test exhibits a high ratio of false-negative samples. Clinicians and researchers have adopted the computed tomography (CT) scan as an efficient diagnostic tool, and demonstrated its efficiency at COVID-19 detection in terms of sensitivity and accuracy. Thus to leverage clinical findings and lung CT manifestations is the most appropriate way to realize a rapid and active diagnosis in terms of follow-up assessment and disease progression monitoring [2] .

Advances in artificial intelligence (AI) technologies have facilitated disease forecasting [3] , [4] , [5] and AI-enabled computer-aided diagnosis (CAD) applications in the healthcare field [6] . Among these AI techniques, deep learning (DL) has proved efficient at different tasks in automated medical image analysis, especially for lung disease diagnosis [7] , [8] . However, current supervised DL approaches are known to be data-hungry, as they require a vast number of annotated scans to realize accurate performance. Trustworthy pixel-level annotation of chest CT scans is typically time-consuming, and it requires the laborious effort of an experienced radiological specialist. Owing to the rapid spread of COVID-19, annotating such enormous numbers of CT scans is impractical due to tight timelines and the intense workloads of the healthcare community [3] .

To tackle these limitations, annotation-efficient DL for medical image segmentation has attracted growing research to relax the requirement of a large-scale, pixel-level annotated CT dataset for training, inspiring training techniques with partial or no supervision, such as unsupervised domain adaptation (UDA) [9] , weakly supervised learning (WSL) [10] , semi-supervised learning (SSL) [1] , [11] , [12] , self-supervised learning (S-SL) [13] , and active learning [14] . It is imperative that these techniques prevent overfitting of a network with scant annotated data. The present study investigates the semi-supervised segmentation (SS-seg) of COVID-19 lesions where a large amount of unannotated labeled CT data can be aggregated effortlessly.

SSL has been widely used to improve learning performance in scarce-annotation situations, which is a vital and challenging task with a marked effect on medical applications. Solutions of the SS-seg problem incrementally include segmentation maps from unannotated samples in the training data to improve segmentation performance. Other SSL approaches employ generative adversarial networks (GANs) [15] , variational autoencoders (VAEs) [16] , and ensemble learning for segmentation or classification purposes [17] . Recently, several SSL approaches have implemented a consistently imposing paradigm [18] that benefits from the unannotated data by regularizing model predictions, which will be consistent in case of applying various disturbances to the inputs and parameters of the model. Consistency-based approaches have emphasized the improvement of superiority of consistency targets. For instance, the method of temporal ensembling (TE) [19] , [20] computes consistency targets as exponential moving averages (EMAs) of estimates in several epochs. However, this requires a large prediction matrix throughout model training. The mean teacher (MT) [21] approach solves this issue through an ensembled teacher architecture to provide training consistency entities that have shown robust performance in a number of studies [10] [11] [12] [13] [14] . However, this approach ignores information about relations between samples, which the self-ensembling (SE) framework [22] addresses through a sample relation consistency matrix.

Despite recent improvements in automated DL-based segmentation from CT scans, most current studies employ standard convolutional networks (i.e., U-Nets) that are trained normally by ignoring noisy annotations. Several studies have investigated learning from noisy labels in medical image classification problems [23, 24] . In view of this, to develop a noise-aware cost function has great potential in many medical applications and has motivated much research, as it lessens the need for image refinement techniques and complex models, and can be easily integrated with any learning strategy [25] . However, to apply these functions to segmentation may lead to poor efficiency because of the typical imbalance between the numbers of background and foreground pixels [18] . Therefore, this study investigates a noise-aware semi-supervised TE framework to efficiently reduce the effect of noisy annotations during COVID-19 lesion segmentation.

While several AI models have been developed to facilitate the automation of COVID-19 diagnosis, there has been little study of COVID-19 lesion segmentation. To detect regions of interest (ROIs) from CT scans is an interesting and challenging task for several reasons.  Large divergence in the characteristics of lesions in terms of scope, location, shape, and quality makes them difficult to classify (see Fig. 1 ). For instance, consolidations are small and precise, and often cause erroneous detection.  Small inter-class divergence means that the margins of ground-glass opacity (GGO) predominantly exhibit clouded manifestation and low contrast, which complicates the detection process. This makes it impossible to aggregate a satisfactory amount of annotated data; hence domain experts must devote much time and effort to generate reliable annotations.

 Noisy annotation is inevitable for rare or new diseases (e.g., , which decreases segmentation efficiency. However, the current DL literature focuses on noisy labels in the context of the classification problem.  SS-seg techniques often generate both reliable and unreliable pseudo-annotations and treat them equally when computing the unsupervised loss, which likely degrades performance due to unreliable targets.

To address the above challenges, this study presents a DL approach to use both annotated and unannotated CT images to segment COVID-19 infection lesions.

The primary contributions of our study are summarized below. 1) We introduce a reliable, consistent temporal-ensembling (RCTE) framework that leverages unannotated CT data for efficient semi-supervised segmentation of COVID-19 lesions from limited numbers of annotated CT scans. RCTE can be generalized for SS-seg applications from 2D medical images.

2) A noise-aware loss function is introduced to mitigate the impact of noisy annotations on segmentation performance. The new loss is an improvement of GCE loss [26] , which is vigorous for noisy annotations and less responsive for the imbalance between background and foreground.

3) Comprehensive experiments on public datasets of COVID-19 CT images demonstrate the ability of RCTE to realize more efficient segmentation over recent cutting-edge SS-seg approaches while avoiding the effect of noisy annotations.

The remainder of this paper is structured as follows. Section 2 discusses associated work involving semi-supervised DL approaches and noise-aware DL models. In Section 3, we present a detailed description of our methodology of COVID-19 lesion segmentation. Results, comparisons, analysis, and discussion are presented in Section 4. Section 5 points out the main limitations of this work. Section 6 summarizes our conclusions and identifies future research directions.

Despite the criticality of lesion segmentation for numerical evaluations in the task of COVID-19 diagnosis, few studies have investigated efficient and automated segmentation of COVID-19 lesions from CT scans.

Fan et al. [11] developed a segmentation network called Inf-Net for automated segmentation of infected areas in lung CT slices. It learns a high-level representation using a concurrent partial decoder, and employs reverse and edge attention to enhance boundary detection capability. Wang et al. [1] introduced a DL approach called COPLE-Net to segment infection areas of different sizes and manifestations. Zhou et al. [8] reduced the complexity of segmentation by decomposing the 3D segmentation task into three 2D segmentation tasks, whose results are averaged. Adel et al. [2] presented an end-to-end segmentation method based on CT scan enhancement. Gao et al. [27] introduced a dual-stream approach with an in-between lesion attention module for the classification and segmentation of COVID-19 lesions from input CT slices.

Mahmud et al. [28] segmented COVID-19 lesions using an attention-enabled segmentation model (TA-SegNet), repetitively employing a tri-level attention module at different network positions to aggregate pixel-, channel-, and spatial-wise representations during training. All these supervised approaches were validated on private or small-scale data, making them unreliable for the real world. Their performance must be investigated on sufficient amounts of labeled data. Despite the ease of obtaining a large amount of unlabeled CT data, to aggregate a large set of pixel-level annotated CT scans was impossible at the time of the outbreak, which motivates us to tackle the limited annotation problem through this study.

Owing to difficulties in aggregating large-scale annotated radiological data, SSL techniques have experienced wide adoption, as they enable the improvement of the performance of DL models even with a limited number of annotated images and a large number of unannotated images [17] , [29] , [30] .

Semi-supervised SE approaches have been commonly employed to train DL networks such as to reduce the supervision and regularization costs on annotated and unannotated images, respectively [1, 20, 31] . Other studies have proposed regularization improvement through examining learned expertise, including TE reliance and pseudo-annotations [32] . However, these approaches neglect the relations among images, which provide valuable semantic information from unannotated data. Liu et al. [22] addressed this limitation and introduced an SSL approach for medical image classification that exploits an SE model to generate a target of superior consistency for unannotated images using relation information fused by a sample relation consistency mechanism. Similarly, Shi et al. [20] introduced an SE framework to leverage the semantic information of annotated and unannotated images to classify histopathology images. It maps the ensemble targets of each class to the same cluster for further enhancement. The above studies concentrate on the classification task, and to model the relationships between samples requires a large relation matrix or clustering map, which can retard the training process.

Fu et al. [31] developed an SS-Seg technique for medical images using a teacher-student model that seeks to minimize the weighted mixture of supervised loss using annotated inputs and an unsupervised loss that uses both unannotated and annotated images. Their model encourages consistent predictions for the same input under diverse perturbations (e.g., rotation, scale, dropout, and Gaussian noise) using a transformation-consistent paradigm to improve the regularization outcome for pixel-level inferences.

Similarly, Wang et al. [1] designed a segmentation architecture, COPLE-NET, and integrated it into the SE framework wherever the input was simply disrupted with Gaussian noise. A major shortcoming of the above techniques is to disregard the dependability of pseudo-annotations. The generated pseudo-targets of the unannotated samples might be noisy and erroneous, which can have a negative effect on training the segmentation. These methods apply simple random transformations to ensure consistency without considering the choice of the best transformations for a specific dataset.

Most DL studies address the problem of noisy labels in the context of medical image classification [18] .

Shi et al. [20] adopted the graph TE-based classification approach to validate empirical superiority for a small ratio of classification with noisy labeling. Karim et al. [18] investigated techniques to solve noisy-label problems and showed that the mean absolute error (MAE) and generalized cross-entropy (GCE) were effective for noisy annotation. They also investigated data reweighting methods to eliminate erroneously labeled samples. Other studies have employed iterative training, ensemble techniques, and consistency regularization to learn from noisy labels. Only some of these techniques have been investigated for segmentation.

For medical image segmentation, Tajbakhsh et al. [25] reviewed and categorized the main techniques to solve the noisy-annotation problem using noise-resilient loss functions. Wang et al. [33] introduced a label denoising technique to iteratively train a DL network for male pelvic organ segmentation. Wang et al. [1] introduced a noise-robust Dice loss function that can improve lesion segmentation from weakly annotated CT scans. These techniques must set aside part of the training data for clean annotations, and the experiments employ fake noisy annotation, making real-world noisy annotations a critical challenge in medical image segmentation. Lui et al. [34] addressed the noisy-annotation problem with a three-stage image quality assessment technique that employs a hierarchical residual model to provide slice-, volume-, and subject-level assessments for diffusion magnetic resonance images. Lin et al. [35] introduced a synergistic grouping loss to increase the tolerance of a DL model to noisy annotations, including fuzzy or analogous lesions.

We discuss the proposed RCTE for COVID-19 infection lesion segmentation. We arbitrarily select data from the training data of labeled and unlabeled CT slices, and task-driven transformations are applied to these slices. Our model uses a student architecture that learns to minimize a certain loss value, and a teacher architecture is one of successive student architectures. The training of the student architecture uses the augmented images as an input. Its output is computed with the function, compared with the ground truth (GT) mask using noise-aware loss, as shown in Fig. 2 , and also compared with the output of the teacher utilizing the consistency regularization loss. Once the RCTE starts training, the parameters of the teacher architecture are upgraded by utilizing the exponential moving average (EMA) after the gradient descent updates the parameters of the student architecture. Thus the GT characteristics are broadcast to the unannotated CT slices by ensuring the consistency of model outcomes with the unannotated CT slices.

To facilitate the model description, we formulate our task of COVID-19 lesion segmentation as an SS-Seg problem. In this context, the model training set contains a number of CT slices, which comprise annotated CT images and unannotated CT images. The annotated set is , and the -

, where the input CT slices and are a GT

Hence the overall SS-Seg tasks can be trained to optimize the parameter by minimizing the loss,

where is the function that calculates the supervised loss, is the function that computes the unsupervised loss, is the segmentation network with weight , and is the weighting agent that determines the ( • ) regularization strength.

In this SS-Seg, the softness hypothesis means that neighboring data points in the CT image are expected to be adjacent in the corresponding GT mask [32] . These techniques employ SE and leverage various perturbations, aiming to refine the quality of the target through input augmentation (i.e., dropout and noise).

Thus the unsupervised loss improves the prediction consistency and the model's predictive performance. The unsupervised loss is often formulated as mean squared loss, i.e., ,

where and represent transformations applied to the input CT image. This study adopts an equivalent concept by applying a variety of augmentations to the input CT images. That is, consistency loss based on the regularization term is employed to improve segmentation outputs when applying a variety of transformations (e.g., geometric and intensity transformation, dropout layers) to the same data.

Many experiments with the classification/segmentation network two times aim to obtain two projections under distinct transformations. In this way, the segmentation network has roles of both teacher and student. In the student role, it learns normally, as previously mentioned; as a teacher, it creates the targets to be utilized by itself in student learning. The fact that the network creates the targets by itself can be problematic and erroneous, particularly when extreme weights are assigned to newly created targets. To address this issue and to enable the creation of more reliable targets, we employ the architecture design of the mean teacher (MT) [21] , in which the teacher architecture uses the EMA parameters of the student architecture , as represents the task-driven transformation applied to TD I the images. , ′ = . ′ -1 + (1 -).

(

where respectively, represent the parameters of the student and teacher architecture, and is a ′ , smoothing coefficient that determines to what extent the teacher architecture depends on the parameter of the existing student architecture. The larger the value of the more reliance there is on the preceding teacher architecture. This depends on experimental evidence [21] indicating that the best model performance was achieved when . Hence we set in our experiments. = 0.999 = 0.999

Early-phase COVID-19 lesions frequently occupy a small CT scan area, which might bias the network prediction to the background, especially if the training utilizes traditional classification models and typical objective functions. The Dice loss,

where and represent the model prediction and GT, respectively, for input , has been shown to overcome this limitation by indirectly establishing a balance between background and foreground regions.

DIL can be considered a different form of MSE, whose numerator allocates larger scores to pixels with more estimation errors. MSE has been demonstrated to be ineffective for noisy annotations, and mean absolute error (MAE) can have higher effectiveness than CEL and MSE under assumptions that treat all data points more equally [18] . However, MAE leads to poor performance of deep CNNs because of the equality assumption [21] . It also deals inefficiently with the foreground-background disparity in segmentation [36] .

Additionally, the stochastic characteristic of DL optimizers causes the MAE to down-weight hard samples with accurate annotations, which significantly increases training times and decreases test accuracy [36] .

Charoenphakdee et al. [37] tried to address this with symmetric learning loss (SLL) that integrates CEL and reverse CEL (RCEL),

where and are the GT and model predictions, respectively. Motivated by this, we introduce noise-aware loss that takes advantage of DIL and generalized SLL to form a composite loss function,

NAL is employed to implement the supervised loss and unsupervised loss .

Most SE techniques apply different input transformations for semi-supervised medical classification.

However, applying this to get reliable lesion segmentation is a challenge, especially for COVID-19

diagnosis, because the transformation is invariant in the case of classification and equivariant in the case of segmentation [39] . In other words, in classification, CNN only distinguishes the existence or non-existence of an entity; hence the classification decision is unchanged regardless of the image transformations applied.

In contrast, in the segmentation task, transformations applied to the input image must be applied to the corresponding GT mask. However, the convolutions typically are not transformation-equivariant, which means that applying these transformations to the input of CNN does not necessitate the transformation of convolutional maps in the same way [41] . restricts the unsupervised impact of input augmentation on segmentation performance [21] . Several studies have tried to improve the regularization for efficient utilization of unannotated images using consistency-based SE models. However, they simply employ standard random augmentation techniques (scaling, rotation, and Gaussian noise), which does not guarantee the best segmentation performance at all times [31] .

To tackle this problem, we propose to augment the input data by automated task-driven augmentation ( ) [38] , which enables the selection of the best transformations according to the data characteristics [39] , [40] .

In particular, two conditional generator networks are designed and trained to generate intensity and deformation (i.e., non-affine) transformations for the input image. The task-driven deformation generator receives CT images as an input, plus an arbitrarily sampled vector from a uniform distribution to generate a condensed pixel-wise deformation map . Then the input slice and its GT mask are warped by applying bilinear interpolation centered on the map to produce the augmented slice-mask pair. In the same way, the task-driven intensity generator is trained to perform additive intensity augmentation. It takes a CT image as an input combined with an arbitrarily sampled vector from a uniform distribution to yield a preservative intensity mask that is subsequently employed on the input image to generate the transformed image.

Under the RCTE framework, every CT image is passed into the model for two-fold assessment to attain dualistic outcomes and . The RCTE framework comprises three operations, as shown in Fig. 2 . In the earliest assessment, the function is performed on the input, and applied on the output at the second assessment. Through the two evaluations, arbitrary perturbations are applied in the model. The model is trained to be transformation-consistent by diminishing the gap between and (by unsupervised loss) to regularize the model to be more consistent, and thus to improve the generalization performance. Of note, the regularization loss is estimated on both annotated and unannotated CT images. To use the annotated sample , the same augmentation is applied to the mask and trained with the NAL. Finally, the RCTE ∈ framework is trained to the final consistency loss, = ( ( ( )), ) + ( ) , ( ( ( )), ′ ( ( )) ) (9) where and represent the supervised and unsupervised loss segmentation terms, respectively, which are implemented using the NAL computed with equation (8).

represents a weighting factor for both losses, ( ) considered a Gaussian ramp-up curve and calculated as ,

where is the training epoch, and scales the supreme score of the weighting function. We adopt = 1.0 [31] . When the model starts training, the value of is small, and the supervised loss takes control of the ( ) training using annotated images. The model effectively learns to fuse important and precise information from annotated images. The model's reliability progressively increases during training, and it becomes capable of generating output for the unannotated CT images.

In earlier SE approaches [20] for SSL, the values of and were manually assigned a static number or were altered progressively based on the training iteration , regardless of factors such as the performance of neglect samples with high loss makes them likely to discard complex images. This may cause the model to learn just from simple samples, which would be undesirable. To solve this issue, a conventional strategy is adopted to enable the consideration of complex samples during the training of the network [5] .

Therefore, we develop a reliable network by subduing the deep supervision of the network on the ′ network if the network exhibits poorer performance than the network. This is realized by estimating the ′ value of according to the / network performance, λ

where is set to 0.01 [31] when the network outperforms the network. = 0.1 is a minimum value ′ ′ ′′ ′ that defeats the parameters of the regularization term when the network does not perform better than the ′ network.

Conventional SE techniques generate pseudo-annotations through experimentation on the training set without upgrading the model parameter [18] , [22] . Solo-model experimentation might be unreliable or noisy. The TE-based classification approach mitigates this issue by accumulating the forecasts of several former model experiments into the outcomes of the ensemble. Thus the reliability of the generated labels is not influenced by a specific prediction [22] . We improve the design of RCTE to consider generated ensemble targets from the earlier training history based on a momentum factor.

Algorithm 1 shows the steps of the proposed RCTE approach. represent the count of training , ,

instances, input width, and input height, respectively. Throughout the training process, CT slices of every mini-batch are initially transformed by augmentation. The augmented inputs are passed to the ( ) segmentation model, as previously described. The objective/loss function is computed, and the model parameters are upgraded using the Adam optimizer [41] . Following every training phase, the generated ensemble targets are aggregated into ensemble targets for upgrading as

where is a momentum factor that regulates the amount of history to be considered by the ensemble during training. is a matrix of dimension that is comprised of the generated ensemble targets of all × × images of the training data. is the mini-batch in with dimensions , where is the count × × of CT slices per mini-batch. Based on equation (13) , is upgraded from the initial value of zero, and progressively includes a weighted mean of all preceding generated targets throughout the training, which has minor weights in early epochs and a higher weight during later epochs. That is upgraded from the initial value can be problematic because of the bias toward the start-up. A bias adjustment scheme [34] is adopted to rectify the bias that produces the pseudo-targets ,

where k is the training epoch. Each mini-batch of targets can be marked from to compute the unsupervised loss.

MosMedData [42] , a public dataset of 1,100 COVID-19 chest CT scans with seriousness labels and COVID-19-associated outcomes from Moscow, Russia, was used in experiments to evaluate the segmentation performance of RCTE. Every CT volume was obtained from distinct subjects, with 30-46 slices per volume. Among them, 50 CT volumes were manually annotated by an experienced radiologist for COVID-19 infection lesions. Samples of the training set were annotated according to the human-in-the-loop scheme, where a preliminary architecture was trained on the manually annotated subset, and later employed to generate elementary annotations for training samples that were subsequently purified via inexperienced researchers as the training ground truth [43] . Hence the inter-and intra-observer variations, vague lesion boundaries, and prospective tendency toward the early model are plentiful reasons to consider these as noisy annotations. The GT masks of the images from the validation and testing sets were constructed using the Algorithm 1: Pseudocode of RCTE framework Input: ∈ ( + ), ∈ 1:

Teacher segmentation network '( • ) ← 3:

Task-driven augmentation generator ( • ) ← 4:

Smoothing factor. ←

Regularization weight ramp-up function ( ) ← 6:

Momentum factor ← 7: for to the number of epochs do: = 1 8:

for every training batch B do: 9:

effectuate a suitable update ( ) 10:

∈ ← '( ( ∈ )) 12:

= + ( ) 13:

Upgrade via Adam optimizer 15:

Upgrade ' = . ' -1 + ( 1 -) .

Terminate for 17:

= + ( 1 -) 18:

← /(1 -) 19: Terminate for 20: Return manually annotated 50 CT scans by an expert radiologist, where the validation set contained 20 CT scans and the test set contained 30 CT scans (cross-validated). To estimate the degree of noise in the training masks, an arbitrary sample of 100 images was manually annotated by professionals to obtain their actual GT masks; the similarity between the actual and noisy masks was measured using the Dice score and found to have an average value of 0.87±0.15.

All implementations were performed using the Python PyTorch library, Windows 10, and an Nvidia Quadro GPU. The details of experimental implementations are as follows. We employed the Dense U-Net model [49] as the backbone of the teacher and student architecture models. The architecture of the adopted Dense U-Net is described in Table 1 . In all experiments, the model was trained for 6,000 steps with 0.0001 as the initial learning rate. We eliminated the transformation procedures and conducted one single test with the original images to obtain equivalent comparisons by the phase of model testing. Once the model's prediction map was received, a thresholding operation with a constant value 0.5 was applied to produce the binary segmentation outcome while applying the morphology process to acquire the final segmentation outcome.

RCTE had a total training time of 23.7 hours, and an average inference time of 16 .24±5.79 s per single scan. It worth mentioning that the full radiological CT diagnosis and RT-PCR test consumed around 21.5 minutes and 4 hours, respectively, whether the underlying patient was infected or not. 

Given the true-positive (TP), false-negative (FN), false-positive (FP), and true-negative (TN) samples, the evaluation indicators employed in this paper can be defined as follows.

i. Sensitivity 14) ii. Specificity . ( ) = + (15) iii. Dice similarity coefficient (DSC): To estimate the commonality between the segmentation results, denoted by the set and the ground truth signified using the set , the DSC was calculated as , 16) iv. Region-based Jaccard Index

e discuss the performance from experiments with RCTE when trained using 20% of the training set as an annotated subset and 80% as an unannotated set. Table 2 

We compare the performance of RCTE to recent cutting-edge SS-Seg approaches. Table 3 UA-MT [36] realized desirable performance improvements over the MT model, which explains the effectiveness of measuring the uncertainty of output to improve segmentation performance. The Inf-Net architecture showed good segmentation performance. However, the multistage training of Semi-Inf-Net [11] limited the realization of the optimal performance. Compared with MT [21] , SE-COPLE-Net showed large performance improvements owing to the adaptive knowledge transfer from the teacher network to the student network. In contrast with MT [21] , TCSM_v2 [31] achieved much better performance due to the imposed transformation to improve the regularization effect, and hence the performance. More importantly, RCTE attained robust segmentation performance with performance improvements over competing approaches (JI:

2.95%; DSC: 2.66%; AC: 1.35%; SE: 1.26%; SP: 1.05%). Figure 3 shows a graphical comparison of the segmentation results of different real-world COVID-19 axial slices.

To assess the performance of the new loss function, noisy annotated training images were obtained to delineate the infection lesion with non-specialist researchers [1] . Hence these annotations became unavoidably noisy because of the vague lesion borders, inter-and intra-observer inconsistencies, and the possible tendency of the preliminary pattern. Additional comparative analysis was performed to contrast the introduced NAL with three well-known noise-aware loss functions, i.e., noise-robust Dice (NR-Dice), GCE loss, and MAE loss [44] . NAL was also compared with the standard Dice and CE loss functions. The quantitative findings of these comparative experiments are presented in Table 4 . CE and Dice loss had the lowest segmentation performance. The noise-robust losses (i.e., MAE, GCE, NR-Dice) seemed to be more efficient in comparison with the standard losses. Contrasted with the noise-robust losses, the performance attained by NAL significantly surpassed that of the other losses by large margins (JI: 1.1%; DSC: 1.3%), thus validating the effectiveness of NAL at reducing the effect of noisy labeled areas. A paired sample t-test analysis was conducted to investigate the statistical significance of the results of RCTE against the competing SS-Seg approaches. Two paired-sample t-test experiments were performed on the test set using the measures of JI, DSC, accuracy, sensitivity, and specificity. These experiments were implemented using the SciPy scientific computing library [46] . The significance threshold was set to , where a less than 0.05 indicates statistical significance of the results. Table 5 presents the 5 -2 computed for the test set. Most were less than 5.00E-03, implying that the results from RCTE were distinct from those of the competing SS-Seg approaches. This validates the effectiveness of RCTE. 

For a deeper analysis of the performance of RCTE, ablation experiments were performed to enable understanding of the behavior of the model under different settings. In these experiments, the Dense U-Net optimized with Dice loss was selected as the baseline for RCTE. 

The main purpose of this experiment was to validate the selection of the mean-teacher architecture for RCTE. According to the results presented in Table 7 , the deployment of the baseline model in the MT architecture improved the segmentation performance by 14%, 1.53%, and 0.43% compared to JI, DSC and AC, respectively. This further justifies the significance of leveraging the annotated and unannotated data to maintain the consistency of segmentation.

An experiment was performed to explore the influence of implementing the MT architecture with a reliable teacher network. The results are reported in the third row of Table 7 . It can be seen that this resulted in significant efficiency improvements (JI: 0.59%; DSC: 2.46%; AC: 0.51%) over the baseline model.

Compared with the standard MT design, it is apparent that the segmentation performance achieved good enhancements (JI: 0.45%; DSC: 0.93%; AC: 0.74%), which demonstrates the effectiveness of the reliable teacher in subduing the involvement of the student to the EMA as soon as the student exhibits higher training loss, thus preventing the possible negative impact of noisy annotations.

An experiment was carried out to investigate the impact of redesigning the MT model using the reliable student architecture. The results are presented in the fourth row of to learn from the teacher when the loss of the teacher is better than the student's loss. This enables subduing of the impact of unreliable and noisy annotations. More importantly, it is notable that implementing the MT with a reliable teacher and reliable student network results in improvements of 0.84%, 2.5%, and 1.5% over JI, DSC, and AC, respectively.

Motivated by the recent TSCM_V2 that randomly applied datasets for the input images, we investigate the impact of such transformations in the proposed RCTE. From Table 7 , it can be noted that applying the random transformation improved the segmentation performance of MT by 2.46%, 2.98%, and 2.4% over JI, DSC, and AC, respectively. This observation indicates the effectiveness of augmenting the input image for improving the regularization power of pixel-level segmentation.

The proposed RCTE employs task-driven augmentations to automatically generate intensity or deformation transformations. We performed three experiments to evaluate the impact of each kind of transformation separately and together. In Table 7 , it can be noted that intensity augmentation attained better performance than random augmentation, at 0.88%, 0.94%, and 1.09% over JI, DSC, and AC, respectively.

Deformation augmentation led to more improvements (JI: 1.48%; DSC: 2.49%; AC: 1.73%) over the random augmentations. More importantly, applying both intensity and deformation together enabled better performance than applying them separately.

This experiment analyzed the impact of NAL on segmentation performance, as shown in the last row of Table 7 . It can be seen that training RCTE with NAL resulted in substantial performance improvements (JI:

2.0%; DSC: 1.07%; AC: 0.4%), which validates the effectiveness of NAL in dealing with noisy annotated COVID-19 images.

The advantages of RCTE can be summarized as follows: 1) it enables efficient segmentation of complex COVID-19 lesions from a limited amount of annotated data; 2) it enables the leveraging of unannotated training data to improve the segmentation performance of the model; 3) it prevents unreliable targets from negatively affecting training performance; and 4) it offers noise-aware loss that enables effective learning of lesion features from noisy annotated CT scans.

This study has some limitations. RCTE just considers two kinds of task-driven transformations to be applied to input images, which might lead to suboptimal transformations, and possibly to suboptimal

performance. RCTE was not tested for multi-class segmentation owing to the availability of binary masks only. RCTE does not consider relationships between input instances that could help extract valuable semantic representations from unannotated images, as noted in recent studies [29] . RCTE is designed based on the hypothesis that the data come from a single domain with a shared data distribution. Therefore, to incorporate out-of-distribution data might negatively affect performance.

We introduced a reliable and consistent RCTE framework for efficient semi-supervised segmentation of COVID-19 lesions from 2D lung CT scans. A reliable teacher-student architecture was employed to improve the superiority and reliability of ensemble predictions and mitigate the effect of defective and unreliable pseudo-annotations on segmentation loss. Noise-aware loss (NAL) was introduced to deal with noisy annotated COVID-19 CT scans. RCTE was trained to minimize a subjective mixture of supervised and regularization loss, which was implemented using NAL. Empirical evaluations showed that NAL overcomes existing noise-aware loss functions, and RCTE realized superior performance over cutting-edge self-ensembling-based medical segmentation techniques.

Our future work will investigate the effectiveness of RCTE as a general segmentation approach for similar problems in the medical domain. Another future direction is to improve RCTE to address issues stemming from cross-modality data; domain adaption techniques can offer promising solutions to these issues [47] .

Motivated by the success of RCTE, the development of semi-supervised multiple task/instance learning [29] is essential to provide the ultimate diagnosis framework for COVID-19 and similar pandemics. We also intend to investigate the diagnosis of COVID-19 from lung ultrasound frames/videos in our future work. 

A Noise-Robust Framework for Automatic Segmentation of COVID-19 Pneumonia Lesions from CT Images

Modeling and forecasting of epidemic spreading: The case of Covid-19 and beyond

Forecasting of COVID-19 time series for countries in the world based on a hybrid approach combining the fractal dimension and fuzzy logic

Modeling COVID-19 epidemic in Heilongjiang province

A new approach for classifying coronavirus COVID-19 based on its manifestation on chest X-rays using texture features and neural networks

Accurate and Machine-Agnostic Segmentation and Quantification Method for CT-Based COVID-19 Diagnosis

Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis

An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization

Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images

Hyperspectral tissue image segmentation using semi-supervised NMF and hierarchical clustering

Self-Supervised Feature Learning via Exploiting Multi-Modal Data for Retinal Disease Diagnosis

The diagnosis of COVID-19 with deep active learning

Skin lesion segmentation via generative adversarial networks with dual discriminators

Autoencoders for unsupervised anomaly segmentation in brain MR images: A comparative study

Adaptive Semi-Supervised Classifier Ensemble for High Dimensional Data Classification

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Temporal ensembling for semi-supervised learning

Graph temporal ensembling based semi-supervised convolutional neural network with noisy labels for histopathology image analysis

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Semi-Supervised Medical Image Classification With Relation-Driven Self-Ensembling Model

Multiobjective Semisupervised Classifier Ensemble

Semi-Supervised Image Classification with Self-Paced Cross-Task Networks

Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation

Generalized cross entropy loss for training deep neural networks with noisy labels

Dual-branch combination network (DCN): Towards accurate diagnosis and lesion segmentation of COVID-19 using CT images

CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction

Not-so-supervised: A survey of semi-supervised

Transformation-Consistent Self-Ensembling Model for Semisupervised Medical Image Segmentation

Exploring uncertainty in pseudo-label guided unsupervised domain adaptation

Iterative Label Denoising Network: Segmenting Male Pelvic Organs in CT from 3D Bounding Box Annotations

Hierarchical Nonlocal Residual Networks for Image Quality Assessment of Pediatric Diffusion MRI With Limited and Noisy Annotations

Dual-path network with synergistic grouping loss and evidence driven risk stratification for whole slide cervical image analysis

Uncertainty-Aware Self-ensembling Model for Semi-supervised 3D Left Atrium Segmentation

On symmetric losses for learning from corrupted labels

Semi-supervised task-driven data augmentation for medical image segmentation

Learning augmentation strategies from data

Adam: A method for stochastic optimization

MosMedData: data set of 1110 chest CT scans performed during the COVID-19 epidemic

Abnormal lung quantification in chest CT images of COVID-19 patients with deep learning and its application to severity prediction

Robust loss functions under label noise for deep neural networks, in: 31st AAAI Conf. Artif. Intell. AAAI 2017

V-Net: Fully convolutional neural networks for volumetric medical image segmentation

Author Correction: SciPy 1.0: fundamental algorithms for scientific computing in Python

Dual-Teacher++: Exploiting Intra-domain and Inter-domain Knowledge with Reliable Transfer for Cardiac Segmentation

Author Contributions Section

Weiping Ding and Mohamed Abdel-Basset contributed the central idea, analyzed most of the data, and wrote the initial draft of the paper

Hossam Hawash contributed to refining the ideas, carrying out additional analyses and finalizing this paper. All authors discussed the results and contributed to the revisions

We would like to submit the original manuscript entitled "RCTE: A Reliable and Consistent Temporal-Ensembling Framework for Semi-supervised Segmentation of COVID-19 Lesions".We have read and have abided by the statement of ethical standards for manuscripts submitted to the Journal of Information Science.