key: cord-0154957-ay1d3lrd authors: Wang, Ce; Shang, Kun; Zhang, Haimiao; Li, Qian; Hui, Yuan; Zhou, S. Kevin title: DuDoTrans: Dual-Domain Transformer Provides More Attention for Sinogram Restoration in Sparse-View CT Reconstruction date: 2021-11-21 journal: nan DOI: nan sha: f502016e25658ae902025f60d40d3f99933b7c9f doc_id: 154957 cord_uid: ay1d3lrd While Computed Tomography (CT) reconstruction from X-ray sinograms is necessary for clinical diagnosis, iodine radiation in the imaging process induces irreversible injury, thereby driving researchers to study sparse-view CT reconstruction, that is, recovering a high-quality CT image from a sparse set of sinogram views. Iterative models are proposed to alleviate the appeared artifacts in sparse-view CT images, but the computation cost is too expensive. Then deep-learning-based methods have gained prevalence due to the excellent performances and lower computation. However, these methods ignore the mismatch between the CNN's textbf{local} feature extraction capability and the sinogram's textbf{global} characteristics. To overcome the problem, we propose textbf{Du}al-textbf{Do}main textbf{Trans}former (textbf{DuDoTrans}) to simultaneously restore informative sinograms via the long-range dependency modeling capability of Transformer and reconstruct CT image with both the enhanced and raw sinograms. With such a novel design, reconstruction performance on the NIH-AAPM dataset and COVID-19 dataset experimentally confirms the effectiveness and generalizability of DuDoTrans with fewer involved parameters. Extensive experiments also demonstrate its robustness with different noise-level scenarios for sparse-view CT reconstruction. The code and models are publicly available at https://github.com/DuDoTrans/CODE Computed Tomography (CT) is a widely used clinically diagnostic imaging procedure aiming to reconstruct a clean CT image X from observed sinograms Y, but its accompanying radiation heavily limits its practical usage. To decrease the induced radiation dose and reduce the scanning Figure 1 . Parameters versus performances of DuDoTrans and deep-learning-based CT reconstruction methods. L1 and L2 are two light versions while N represents the normal version. Firstly, we find Transformer-based reconstruction methods have consistently achieved better performance with fewer parameters. Further, our DuDoTrans derives better results than at a less computation cost. time, Sparse-View (SV) CT is commonly applied. However, the deficiency of projection views brings severe artifacts in the reconstructed images, especially when common reconstruction methods such as analytical Filtered Backprojection (FBP) and algebraic reconstruction technique (ART) [25] are used, which poses a significant challenge to image reconstruction. To tackle the artifacts, some iterative methods are proposed to impose the well-designed prior knowledge (ideal image properties) via additional regularization terms R(X), such as Total Variation (TV) based methods [23, 27] , Nonlocal-based methods [38] , and sparsity-based meth-ods [2, 16] . Although these models have achieved better qualitative and quantitative performances, they suffer from over-smoothness. Besides, the iterative optimization procedure is often computationally expensive and requires careful case-by-case hyperparameter tuning, which is practically less applicable. With the success of CNNs in various vision tasks [13, 14, 34, 45, 46] , CNN-based models are carefully designed and exhibit potential to a fast and efficient CT image reconstruction [1, 7, 8, 11, 15, 35] . These methods render the potential to learn a better mapping between low-quality images, such as reconstructed results of FBP, and ground-truth images. Recently, Vision Transformer [5, 6, 9, 18] has gained attention with its long-range dependency modeling capability, and numerous models have been proposed in medical image analysis [3, 6, 37, 39, 44] . For example, Tran-sCT [40] is proposed as an efficient method for low-dose CT reconstruction, while it suffers from memory limitation with involved patch-based operations. Besides, these deep learning-based methods ignore the informative sinograms, which makes their reconstruction inconsistent with the observed sinograms. To alleviate the problem, a series of dual-domain (DuDo) reconstruction models [21, 30, 42, 43] are proposed to simultaneously enhance raw sinograms and reconstruct CT images with both enhanced and raw sinograms, experimentally showing that enhanced sinograms contribute to the latter reconstruction. Although these DuDo methods have shown satisfactory performances, they neglect the global nature of the sinogram's sampling process, which is inherently hard to be captured via CNNs, as CNNs are known for extracting local spatial features. This motivates us to go a step further and design a more applicable architecture for sinogram restoration. Inspired by the long-range dependency modeling capability & shifted window self-attention mechanism of Swin Transformer [22] , we specifically design the Sinogram Restoration Transformer (SRT) by considering the timedependent characteristics of sinograms, which restore informative sinograms and overcome the mismatch between the global characteristics of sinograms and local feature modeling of CNNs. Based on the SRT module, we finally propose Dual-Domain Transformer (DuDoTrans) to reconstruct CT image. Compared with previous image reconstruction methods, we summarize several benefits of Du-DoTrans as follows: • Considering the global sampling process of sinograms, we introduce SRT module, which has the advantages of both Swin-Transformer and CNNs. It has the desired long-range dependency modeling ability, which helps better restore the sinograms and has been experimentally verified in CNN-based, Transformer-based, and deep-unrolling-based reconstruction framework. • With the powerful SRT module for sinogram restoration, we further propose Residual Image Reconstruction Module (RIRM) for image-domain reconstruction. To compensate for the drift error between the dual-domain optimization directions, we finally utilize the proposed differentiable DuDo Consistency Layer to keep the restored sinograms consistent with reconstructed CT images, which induces the final DuDo-Trans. Hence, DuDoTrans not only has the desired long-range dependency and local modeling ability, but also has the benefit of dual-domain reconstruction. • Reconstruction performance on the NIH-AAPM dataset and COVID-19 dataset experimentally confirms the effectiveness, robustness, and generalizability of the proposed method. Besides, by adaptively employing Swin-Transformer and CNNs, our DuDoTrans has achieved better performance with fewer parameters as shown in Figure 1 and similar FLOPs (shown in later experiments), which makes the model practical in various applications. Human body tissues, such as bones and organs, have different X-ray attenuation coefficients µ. When considering a 2D CT image, the distribution of the attenuation coefficients X = µ(a, b), where (a, b) indicate positions, represents the underlying anatomical structure. The principle of CT imaging is based on the fundamental Fourier Slice Theorem, which guarantees that the 2D image function X can be reconstructed from the obtained dense projections (called sinograms). When imaging, projections of the anatomical structure X are indeed inferred by the emitted and received X-ray intensities according to the Lambert-Beer Law. Further, when under a polychromatic X-ray source with an energy distribution η(E), the CT imaging process is given as: where P represents the sinogram generation process, i.e., Radon transformation (commonly defined with a fan-beam imaging geometry). With the above forward process, CT imaging aims to reconstruct X from the obtained projections Y = PX (abbreviation for simplicity) with the estimated/learned function P † . In practical SVCT, the projection data Y is incomplete, where the total α max projection views are sampled uniformly in a circle around the patient. This reduced sinogram information heavily limits the performance of previous methods and results in artifacts. In order to alleviate the phenomena, many works have been recently proposed, which can be categorized into the following two groups: Iterative-based reconstruction methods [2, 16, 23, 27, 38] and Deep-learning-based reconstruction methods [13, 14, 34, 45, 46] . Different from these prevalent works, our method DuDoTrans is based on deeplearning, but is the only dual-domain method based on Transformer. Based on the powerful attention mechanism [4, 10, 26, 29, 36, 41] and patch-based operations, Transformer is applied to many vision tasks [9, 12, 22, 28] . Especially, Swin Transformer [22] incorporates such an advantage with the local feature extraction ability of CNNs. With such an intuitive manner, Swin-Transformer-based models [19] have relieved the limitation of memory in previous Vision Transformer-based models. Based on these successes and for better modeling the global features of medical images, Transformer has been applied to medical image segmentation [3, 6, 20, 44] , registration [39] , classification [31, 37] , and achieved surprising improvements. Nevertheless, few works explore Transformer structures in SVCT reconstruction. Although TransCT [40] attempts to suppress the noise artifacts in low-dose CT with Transformer, they neglect the consideration of global sinogram characteristics in their design, which is taken into account in DuDoTrans. As shown in Fig. 2 , we build DuDoTrans with three modules: (a) Sinogram Restoration Transformer (SRT), (b) DuDo Consistency Layer, and (c) Residual Image Reconstruction Module (RIRM). Assume that a sparse-view sinogram Y ∈ R Hs×Ws is given, we first use FBP [25] to reconstruct a low-quality CT imageX 1 . Simultaneously, the SRT module is introduced to output an enhanced sinogram Y, followed by the DuDo Consistency Layer to yield another estimationX 2 . At last, these low-quality imagesX 1 andX 2 are concatenated and fed into RIRM to predict the CT imageX, which will be supervised with the corresponding clean CT image X gt ∈ R H I ×W I . We next introduce the above-involved modules in detail. Sinogram restoration is extremely challenging since the intrinsic information not only contains spatial structures of human bodies, but follows the global sampling process. Specifically, each line {Y i } Hs i=1 of a sinogram Y are sequentially sampled with overlapping information of surrounding sinograms. In other words, 1-D components of sinograms heavily correlate with each other. The global characteristic makes it difficult to be captured with traditional CNNs, which are powerful in local feature extraction. For this reason, we equip this module with the Swin-Transformer structure, which enables it with long-range dependency model-ing ability. As shown in Fig. 2 , SRT consists of m successive residual blocks, and each block contains n normal Swin-Transformer Module (STM) and a spatial convolutional layer, which have the capacity of both global and local feature extraction. Given the degraded sinograms, we first use a convolutional layer to extract the spatial structure F conv . Considering it as F ST M 0 , then n STM components of each residual block output with the following formulation: denotes n Swin-Transformer layers, and represents the successive operation of Swin-Transformer Layers. Finally, the enhanced sinograms are estimated with: As a restoration block, L SRT is used to supervise the output of the SRT: where Y gt is the ground truth sinogram, and it should be given when training. Although input sinograms have been enhanced via the SRT module, directly learning from the concatenation of X 1 andX 2 leaves a drift between the optimization directions of SRT and RIRM. To compensate for the drift, we make use of a differentiable DuDo Consistency Layer M DC to back-propagate the gradients of RIRM. In this way, the optimization direction imposes the preferred sinogram characteristics toỸ, and vice versa. To be specific, given the input fan-beam sinogramỸ, the DuDo Consistency Layer first converts it into parallel-beam geometry, followed with Filtered Backprojection: To additionally keep the restored sinograms consistent with the ground-truth CT image X gt , L DC is proposed as follows: As a long-standing clinical problem, the final goal of CT image reconstruction is to recover a high-quality CT image for diagnosis. With the initially estimated low-quality images that help rectify the geometric deviation between the sinogram and image domains, we next employ Shallow Sinogram Restoration Transformer (SRT) Figure 2 . The framework of proposed DuDoTrans for SV CT image reconstruction. When under-sampled sinograms are given, our DuDoTrans first restores clean sinograms with SRT, followed by RIRM to reconstruct the CT image with both restored and raw sinograms. Layer M sl to obtain shallow features of input low-quality imageX: Then a series of Deep Feature Extraction Layers are introduced to extract deep features: where F 0 df = F sl . Finally, we utilize a Recon Layer M re to predict the clean CT image with residual learning: To supervise our network optimization, the below L RIRM loss is used for this module: The full objective of our model is: . Further, by dynamically tuning the depth m and width n, SRT modules are flexible in practice depending on the balance between memory and performance. We will explore this balancing issue in later experiments. Datasets. We first train and test our model with the "2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge" [24] dataset. Specifically, we choose a total of 1746 slices (resolution 512×512) from five patients to train our models, and use 314 slices of another patient for testing. We employ a scanning geometry of Fan-Beam X-Ray source with 800 detector elements. There are four SV scenarios in our experiment, corresponding to α max = [24, 72, 96, 144] views. Note that these views are uniformly distributed around the patient. The original dose data are collected from the chest to the abdomen under a protocol of 120 kVp and 235 effective mAs (500mA/0.47s). To simulate the photon noise in numerical experiments, we add to sinograms mixed noise that is by default composed of 5% Gaussian noise and Poisson noise with an intensity of 5e 6 . Implementation details and training settings. Our models are implemented using the PyTorch framework. We use the Adam optimizer [17] with (β 1 , β 2 ) = (0.9, 0.999) to train these models. The learning rate starts from 0.0001. Models are all trained on a Nvidia 3090 GPU card for 100 epochs with a batch size of 1. Evaluation metrics. Reconstructed CT images are quantitatively measured by the multi-scale Structural Similarity Index Metric (SSIM) (with level = 5, Gaussian kernel size = 11, and standard deviation = 1.5) [32, 33] and Peak Signalto-Noise Ratio (PSNR). We next prove the effectiveness of our proposed SRT module and exhaust the best structure for DuDoTrans. Firstly, we conduct the experiments with the five models: (a) FBPConvNet [15] , (b) DuDoNet [21] , (c) FBPCon-vNet+SRT, which combines (a) with our proposed SRT, (d) ImgTrans, which replaces the image-domain model in (a) with Swin-Transformer [22] , and (e) our DuDoTrans. The experimental settings are by default with α max = 96, and the results are shown in Table 1 . The Effectiveness of SRT. Comparing models (a) and (c) in Table 1 , the performance is improved 0.66 dB, which confirms that the SRT module outputỸ indeed provides useful information for the image-domain reconstruction. The exploration of RIRM. Inspired by the success of Swin-Transformer in low-level vision tasks, we simply replace the post-precessing module of FBPConvNet with Swin-Transformer, named ImgTrans. Comparing it with the baseline model (a), the achieved 1 dB improvement confirms that Transformer is skilled at characterizing deep features of images, and a thorough exploration is worthy. The effectiveness of DuDoTrans. When comparing (d) and (e), the boosted 0.18 dB proves SRT effectiveness again. Further, comparing (b) and (e), both dual-domain architectures with corresponding CNNs and Transformer, the improvement demonstrates that Transformer is very suitable in CT reconstruction. We then investigate the impact of each sub-module on the performance of DuDoTrans: RIRM depth and width. Similar to the SRT structure, the RIRM depth represents the number of sub-modules of RIRM, and the RIRM width describes the number of successive Swin-Transformers in each sub-module. Results in Fig 3 (a) and (b) show the corresponding effect of the RIRM width and RIRM depth on the reconstruction performance. When increasing the RIRM depth (with fixed RIRM width 2), the performance is improved quickly when RIRM depth is smaller than 4. Then the PSNR improvement slows down, while the introduced computational cost is increased. Then we fix RIRM depth equal to 3 (blue) and 4 (yellow) and increase RIRM width, and find that the performance is improved fast till RIRM width = 3. After balancing the computation cost and performance improvement, we set RIRM width and depth to 4 and 2, respectively, which is a small model with similar FLOPs to FBPConvNet. The SRT size. With a similar procedure, we explore the most suitable architecture for the SRT module. As Fig 3 (c) shows, with fixed RIRM depth and width. the performance is not influenced when we enlarge the SRT size (depth n and width m as introduced in Section 3.1). Specifically, we test five paired models whose RIRM depth and width are set to {(2, 1), ( 3, 1) , (3, 2) , (4, 2), (5, 2)}, respectively. Then we increase (m,n) from (3, 1) to (4, 2) , but the PSNR is even reduced sometimes. Therefore, we set SRT depth and with to (3, 1) as default in later experiments. Further, we analyze the convergence, robustness, and ef- Table 3 . We test the robustness of DuDoTrans with varied Poisson noise levels. As follows, the intensity is varied to 1e 6 (H1), 5e 5 (H2), and 1e 5 , respectively. DuDoTrans keeps the best performance except when the Poisson noise level is enlarged to 1e 5 , which is too hard to restore clean sinograms. Convergence. In Fig. 3 (d) , we plot the convergence curve of FBPConvNet, ImgTrans, and DuDoTrans. Evidently, the introduction of Transformer Structure not only improves the final results, but also stabilizes the training process. Besides, our Dual-Domain design achieves consistently better results, compared with ImgTrans. Robustness. In practice, the photon noise in the imaging process influences the reconstructed images, therefore the robustness to such noise is important for application usage. Here, we simulate such noise with mixed noise (Gaussian & Poisson noise). Specifically, we train models with default noise level and test them with varied Poisson noise levels (with fixed Gaussian noise), whose intensity correspond to [1e 5 , 5e 5 , 1e 6 , 5e 6 ], and show the results in Fig. 3 (e) . Evidently, our models achieve better performances except when the intensity is 1e 5 , which noise is extremely hard to be suppressed, while DuDoTrans is still better than CNNbased methods, which confirms its robustness. Training dataset scale. Vision Transformers need largescale data to exhibit performance, and thus limits its development in medical imaging. To investigate it, we train FBPConvNet, ImgTrans, and DuDoTrans with [20%, 40%, 60%, 80%, 100%] of our original training dataset, and show the performance in Fig. 3 (f) . Obviously, reconstruction performance of DuDoTrans is very stable till the training dataset decreases to 20%, in which case training data is too less for all models to perform well and DuDoTrans still achieves the best performance. Ground Truth FBP FBPConvNet DuDoNet ImgTrans DuDoTrans We next conduct thorough experiments to test the performance of DuDoTrans on various sparse-view scenarios. Specifically, we first train models when α max is 24, 72, 96, 144, respectively. The results are shown in Table 3 , and DuDoTrans have achieved consistently better results. Besides, we observe that ImgTrans and DuDoTrans are more stable in training, and the learned parameters are both extremely small, compared with CNN-based models. Furthermore, the improvement of DuDoTrans over ImgTrans becomes larger when the α max increases, which confirms the usefulness of restored sinograms in reconstruction. Qualitative comparison. We also visualize the reconstructed images of these methods in Fig. 4 with α max = [72, 96 ,144] (See more visualizations in Appendix). In all three rows, our DuDoTrans shows better detail recovery, and sparse-view artifacts are suppressed. Further, when decreasing α max , where raw sinograms are too messy to be restored and low-quality images from FBP are too hard to capture global features, Transformer-based models exhibit reduced performance. The phenomena suggests that we should design suitable structures with the Transformer and CNNs, facing with different cases. With the decrease of view number α max , input sinograms would be messier, which makes SVCT more difficult. Therefore, we test the robustness of all trained models with aforementioned Poisson noise levels when α max = [24, 72, 96, 144] , and report performances in Table. 3. The included notation Noise-H1, Noise-H2, and Noise-H3 correspond to Poisson intensity [1e 6 , 5e 5 As a practical problem, reconstruction speed is necessary when deployed in modern CT machines. Therefore, we compare the parameters and FLOPs versus performances in Fig 1 and Fig. 6 , respectively. We find that Transformerbased methods have achieved better performances with fewer parameters, and our DuDoTrans exceeds ImgTrans with only a few additional parameters. As known, the patchbased operations and the attention mechanism are computationally expensive, which limits their application usage. Therefore, we further compare the FLOPs of these methods. As shown, light versions (DuDoTrans-L1, DuDoTrans-L2) have achieved 0.8-1 dB improvement with fewer FLOPs, and DuDoTrans-N with default size has enlarged the improvement to 1.2 dB. Besides, we report the inferring time in each Table 2 3 4, and computation time is very similar to CNN-based methods, whose additional consumption is because of the patch-based operations. As in Table 1 , we have shown the effectiveness of SRT with FBPConvNet [15] and ImgTrans [22] , which are two post-processing methods. Recently deep-unrolling methods have attracted much attention in reconstruction. To concretely verify the SRT module's effectiveness, we further combine it with PDNet [1] , which is a deep-unrolling method. Results of involved three paired models (w/ v.s. w/o SRT) are shown in Table 5 with default experimental settings when α max = 96 (See other cases when α max = [24, 72, 144] in Appendix). All these three reconstruction methods have been improved by the use of the SRT module. Furthermore, our DuDoTrans still performs the best without any unrolling design. Thus, our SRT is flexible and power-ful probably in any existing reconstruction framework. We propose a transformer-based SRT module with long-range dependency modeling capability to exploit the global characteristics of sinograms, and verify it in CNNbased, Transformer-based, and Deep-unrolling reconstruction framework. Further, via combining SRT and similarlydesigned RIRM, we yield DuDoTrans for SVCT reconstruction. Experimental results on the NIH-AAPM dataset and COVID-19 dataset show that DuDoTrans achieves state-of-the-art reconstruction. To further benefit DuDo-Trans with the accordingly designing advantage of deepunrolling methods, we will explore "DuDoTrans + unrolling" in the future. Learned primal-dual reconstruction Convolutional sparse coding for compressed sensing ct reconstruction Swin-unet: Unet-like pure transformer for medical image segmentation Gcnet: Non-local networks meet squeeze-excitation networks and beyond End-toend object detection with transformers Pre-trained image processing transformer Low-dose ct with a residual encoder-decoder convolutional neural network Learned full-sampling reconstruction from incomplete data Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale Dual attention network for scene segmentation Cnn-based projected gradient descent for consistent ct image reconstruction Transformer in transformer Deep residual learning for image recognition Densely connected convolutional networks Deep convolutional neural network for inverse problems in imaging Sparse-view spectral ct reconstruction using spectral patch-based low-rank penalty Adam: A method for stochastic optimization Localvit: Bringing locality to vision transformers Swinir: Image restoration using swin transformer Ds-transunet: Dual swin transformer u-net for medical image segmentation Dudonet: Dual domain network for ct metal artifact reduction Swin transformer: Hierarchical vision transformer using shifted windows Adaptive graph-based total variation for tomographic reconstructions Tu-fg-207a-04: Overview of the low dose ct grand challenge The mathematics of computerized tomography. SIAM Standalone self-attention in vision models Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization Training data-efficient image transformers & distillation through attention Attention is all you need Improving generalizability in limited-angle ct reconstruction with sinogram extrapolation Transpath: Transformer-based self-supervised learning for histopathological image classification Image quality assessment: from error visibility to structural similarity Multiscale structural similarity for image quality assessment Supercharging imbalanced data learning with energy-based contrastive representation transfer Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss Disentangled non-local neural networks Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification Spectral ct image restoration via an average image-induced nonlocal means filter Learning dual transformer network for diffeomorphic registration Transct: Dual-path transformer for low dose computed tomography Exploring self-attention for image recognition Dudodr-net: Dual-domain data consistent recurrent network for simultaneous sparse view and metal artifact reduction in computed tomography Dudornet: Learning a dualdomain recurrent network for fast mri reconstruction with deep t1 prior nnformer: Interleaved transformer for volumetric segmentation A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises Deep reinforcement learning in medical imaging: A literature review Acknowledge. This work was supported by the National Natural Science Foundation of China under Grant No. 12001180 and 12101061.