key: cord-0531584-p3gsornk authors: Bahrami, Abbas; Karimian, Alireza; Arabi, Hossein title: Comparison of different deep learning architectures for synthetic CT generation from MR images date: 2021-05-13 journal: nan DOI: nan sha: db4c6df62fefe277581aa58545bb6c4b5c2750dd doc_id: 531584 cord_uid: p3gsornk MRI-guided radiation treatment planning is widely applied because of its superior soft-tissue contrast and no ionization radiation compared to CT-based planning. In this regard, synthetic CT (sCT) images should be generated from the patients MRI scans if radiation treatment planning is sought. Among the different available methods for this purpose, the deep learning algorithms have and do outperform their conventional counterparts. In this study, we investigated the performance of some most popular deep learning architectures including eCNN, U-Net, GAN, V-Net, and Res-Net for the task of sCT generation. As a baseline, an atlas-based method is implemented to which the results of the deep learning-based model are compared. A dataset consisting of 20 co-registered MR-CT pairs of the male pelvis is applied to assess the different sCT production methods' performance. The mean error (ME), mean absolute error (MAE), Pearson correlation coefficient (PCC), structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR) metrics were computed between the estimated sCT and the ground truth (reference) CT images. The visual inspection revealed that the sCTs produced by eCNN, V-Net, and ResNet, unlike the other methods, were less noisy and greatly resemble the ground truth CT image. In the whole pelvis region, the eCNN yielded the lowest MAE (26.03-8.85 HU) and ME (0.82-7.06 HU), and the highest PCC metrics were yielded by the eCNN (0.93-0.05) and ResNet (0.91-0.02) methods. The ResNet model had the highest PSNR of 29.38-1.75 among all models. In terms of the Dice similarity coefficient, the eCNN method revealed superior performance in major tissue identification (air, bone, and soft tissue). All in all, the eCNN and ResNet deep learning methods revealed acceptable performance with clinically tolerable quantification errors. Computed tomography (CT) imaging is highly contributive in treatment planning and dose calculation in the radiation therapy (RT) procedure via providing patient-specific 3D electron density maps (attenuation coefficients) [1] . Modern techniques, like the intensity-modulated radiation therapy (IMRT) and volumetric-modulated radiation therapy (VMAT), rely on CT imaging to accurately define and calculate the delivery doses to the target and organs at risk (OAR) [1] [2] [3] in their accurate sense. In clinical practice, applying magnetic resonance imaging (MRI) for treatment planning is on an increase due to its zero radiation risk and producing high-contrast soft-tissue images compared to CT. Some clinical studies revealed that the functional MRI information, like the diffusion-weighted imaging (DWI) and dynamic contrast-enhanced imaging, is highly contributive in identifying the active tumor sub-volumes in head and neck cancers [4, 5] . Currently, the complementary information in MR and CT images is exploited through the deformable image alignment for precise delineation of the target volumes and generating the electron density maps [2, 4] . In this context, the errors associated with the MR/CT image alignment introduce a systematic uncertainty in organ delineation and dose calculation which is more outstanding in small tumors or complex organs at risk (OAR) [6, 7] . To fully benefit from the merits of MR imaging in radiation therapy workflow, the MRI-only RT, which solely relies on MR images for organ delineation and electron density map creation is introduced. The MRI-only RT eliminates the need for CT imaging, which decreases the number of scans and the associated costs, next to, reducing the received dose particularly for the patients requiring multiple scans during the treatment process [2, 6, 8] . The MRI-only RT faces certain challenges like the geometric distortions due to magnetic field non-uniformities, absence of cortical bone signal in the MR images, and estimation of the accurate electron density map. The primary challenge in MRI-only RT is that the MR signals correlate to the tissue proton density and relaxation properties rather than tissue attenuation coefficients unlike the CT images [9] [10] [11] . The same challenge is evident in the simultaneous PET/MR (and SPECT/MR) systems for the task of PET attenuation correction [12] [13] [14] [15] [16] . Approaches in generating synthetic (pseudo) CT from MRI data have been and are being proposed [8, 17, 18] . The methods adopted in generating synthetic CT images consist of tissue segmentation-based [19] , Atlas-based [20, 21] , and AI-based [14, 22, 23] categories. The first approach generates attenuation maps by segmenting the MR images into a couple of major tissue classes followed by assigning predefined attenuation coefficients to each tissue class [16] . Discrimination of the bony structures from the air is challenging in conventional MR imaging due to their very weak and similar signals [10] . Applying ultra-short echo time (UTE) and zero-echo-time (ZTE) could address this issue, while, these MR sequences suffer from long acquisition time and low signal to noise ratio (SNR) [24] . The second approach consists of deformable image registration algorithms for aligning the target MR image to the corresponding MR atlas images. The one-to-one correspondence of the MR and CT atlas images allows the CT images in the atlas dataset to be applied in estimating the synthetic CT for the target MR image [7, 25] . Their performance highly depends on the availability of similar anatomical and/or pathological variations in the atlas dataset, thus, their major drawback [3, 15] . Such a study would shade more light on the efficiency, strength, and shortcomings of these models as to their appropriate adoption in clinical or research settings. The objective of this study is to compare some popular, state-of-the-art deep learning architectures (including eCNN, U-Net, GAN module, V-Net , ResNet) where the same dataset and evaluation metrics are applied. An Atlas-based synthetic CT generation approach is implemented to provide the baseline for assessing the performance of the deep learning models. A dataset of 20 co-registered MRI and CT images of the male pelvis, who referred to the department of radiation therapy for the treatment of prostate cancer were employed in this study. time interval between CT and MR imaging was one day. Before image registration, the intra-subject non-uniformity of MR intensities was corrected by applying the N4 ITK software followed by denoising through a bilateral edge-preserving filter [35, 36] . The inter-patient MRI intensity variation was reduced by matching the histogram to a common histogram template. Deep residual networks formed through stacking several residual blocks, were introduced by He et al., where the gradient vanishing problem within the training of the deep neural networks and reducing computational cost are addressed [37] . Residual or shortcut connections with an identity map lead to skipping one or more layers in a network ( Figure 1 ). Residual block including residual function (F) and identity map [37] . The residual function (F) yields an output as: where, ( ) is the input data, f is the activation function, and 1 and 2 are the trainable parameters, and 1 and 2 are the corresponding biases. The identity mapping introduces no extra parameter and computational complexity to the model. The residual block input in Figure 1 is added directly to the output, thus, the output of a residual block in the l-st layer expressed as: The residual connections enable the direct propagation of signals in the forward and backward paths from one block to the others. Implementing residual connections within the training of a network reduces the border effects of the convolution process which leads to a decrease in distortion near the borders. The architecture of the ResNet model ( Figure 2 ), consists of 20 convolutional layers where every set of two convolutional layers is stacked by the residual connections. Each convolutional layer is composed of an element-wise rectified linear unit (ReLU) and a batch normalization (BN) layer. In the initial layers, 3×3×3 filters designed to detect/extract low-level image features are applied. For extracting midlevel and high-level image features, the kernels are dilated with factors of two and four in the deeper layers. The output of the final layer, a fully connected softmax layer, has dimensions equal to that of the input image [38] . The efficient CNN (eCNN) model was developed based on the conventional encoder-decoder networks in the U-Net model, where, some modifications were made to extract discriminative image features from the input MRI for generating accurate sCTs [39] . In the eCNN model, each simple plain U-Net model convolutional layer is replaced with the building structures proposed by He et al. [37] . According to the residual block in Figure 1 , the building structure is designed by applying two 3×3 convolutional layers, where each layer is followed by batch normalization and SeLU activation layers to avoid the dying rectified linear unit (ReLU) or dead state. An identity shortcut connection inside the building structure transfers some information from upper to lower layers (serving as a residual connection). Another set of batch normalization and SeLU activation layers is inserted to complete the building structure as illustrated in Figure 3 . The general adversarial networks (GAN)s were suggested by Goodfellow et al. [40] . This network consists of two adversarial generative and discriminative models that are trained simultaneously [22, 29] . The generative model learns to generate new data, while the discriminator model determines the probability of whether the input data is real or synthesized by the generator. Samples are generated through passing random noise across a multilayer perceptron generator and the output will be fed into the disciminator. The discriminator output would be a scalar likelihood classifying the input as true or false. The discriminator is trained to enhance the differentiation ability, the generator is trained to maximize the probability of the discriminator by assigning an incorrect label to the synthetic data. The GAN model proposed by Hu et al. [41] is implemented in this study, where, the generator component exploits a random Gaussian noise distribution with zero mean and unit variance and a ReLU activation function. To generate the synthetic images of proper size, the up-scaling layers which are composed of a transposed convolution with 2×2 stride, and convolution with BN and ReLU are applied. These layers duplicate the previous feature maps' size and halve the number of channels. The final layer is of two parts: a convolution kernel, with BN and ReLU, and a convolutional operator, with hyperbolic tangent function without BN to maintain the true statistical features of the data [30] . The discriminator network takes both synthetic and real images as the inputs, where, the first convolutional layer with a 5×5 kernel size and leaky ReLU (LReLU) as activation function, forms the initial feature maps of the same size as the input image. The ResNet blocks and down-scaling layers duplicate the number of channels and halve the size of feature maps in each layer. Each ResNet layer has two convolution kernels, both with BN and LReLU, and each down-scaling layer has a 2×2 stride convolution with BN and LReLU. The Logit component is inserted after the final layer that consists of a ResNet, a projection before and after ReLU. Except for the first layer in the discriminator, all the convolutional layers in the generator and discriminator apply kernels with a size of 3×3 [41] . The overall structure of the GAN model is shown in Figure 5 . Next to the ResNet and GAN models, the V-net structure [42] , a 3D fully convolutional neural network, is implemented/evaluated in this study. The core structure of this model consists of a compression path and a decompression, wherein both sides mirror each other. The compression side is divided into different blocks/stages which treat the input data at different resolutions. Each/every stage/block consists of 1-3 convolutional layers. This stage/block, first, learns a residual function where, the input of each stage is processed through a non-linear function and next, is added/linked to the last convolutional layer of the same stage. This model was originally proposed for the semantic image segmentation task [42] , and due to its promising performance, it was applied in image regression applications. In this study, the V-net is implemented with an L2-norm loss function to synthesize CT images from MR images. The early U-net architecture was introduced by Ronneberger et al. in 2015 as one of the fundamental architectures of the deep learning approaches with robust and promising performance in many applications [43] . Fully convolutional layers in the contracting and expanding parts are applied in this model, (Figure 6 ). The basic U-net model was implemented in this study to be compared with other advanced deep learning models. The L2-loss function led to the peak performance of this model in synthesizing the CT images from MR images. Figure 6 . The architecture of the U-net model. The Atlas-based method is considered a robust and effective approach for synthetic CT image generation from MRI [3] . An Atlas-based sCT generation method is implemented in this study to provide a baseline for the deep learning approaches' performance assessment. to the target MR images was calculated to create an sCT image for the target MR image. In evaluating the synthetic CT images generated by the different approaches, the patients' CT images constitute the reference. The entire CT images (reference and synthetic CT images) were segmented into the major tissue classes of air, bone, and soft tissue. The intensity threshold of <-400 HU, and >160 HU were assumed for air and bone segmentation, respectively. The voxels within the -400 to 160 HU range were considered as the soft tissue. To assess the identification accuracy of the major tissue classes in the generated sCT images, the Dice similarity coefficient (DSC) was calculated for each segmented region as follows: where, and represent the volume of a specific tissue in the reference CT images and synthetic CT images, respectively. The mean absolute error (MAE), mean error (ME), Pearson correlation coefficient (PCC), structural similarity index metric (SSIM), and peak signal-to-noise ratio (PSNR) metrics were computed within different major tissue classes for different sCTs as follows: = (2µ µ + 1 )(2 + 2 ) (µ 2 +µ 2 + 1 )( 2 + 2 + 2 ) where both ( ) and ( ) are the intensity of i th voxel in the sCT and reference CT images and ( ) = ( ( ) − ( )), N is the number of voxels for each of the tissue classes. In Eq.6, the and are the means of reference CT and synthetic CT images, respectively. In Eq. (7), is the maximum intensity value of the reference CT or synthetic CT images, and MSE denotes the mean square error. In Eq. (8), both µ and µ are the mean intensity value, and and are the variances of the CT images. Parameters 1 = ( 1 ) 2 and 2 = ( 2 ) 2 with the constants of 1 = 0.01 and 2 = 0.02 were defined to avoid the division with small denominators. The above-mentioned metrics were measured over the entire pelvis area to assess the overall performance of the different synthetic CT generation approaches. The It is observed that the sCT images generated through the eCNN, V-Net, and ResNet models are less noisy and correspond well with the reference CT. The mean and standard deviation of MAE, ME, PCC, SSIM, and PSNR metrics computed on the whole pelvis region to evaluate the different methods' performance in comparison to the ground truth CT images over the 20 test subjects are tabulated in Table 1 . Considering the MAE and PSNR metrics, on average the eCNN method revealed better performance on the whole pelvis region followed closely by the ResNet method. The axial views of the sCT and the reference CT images together with the corresponding binary masks of air cavities, bone, and soft tissue obtained from the different images are shown in Figure (8 The promising performance of the deep learning approaches to synthesize a "pseudo-CT" from MRonly images for the tasks of MR-guided attenuation correction in PET imaging and MR-only radiation planning is evident in [22, 32, 44, 45] . Five deep learning algorithms together with an Atlas-based method were implemented/evaluated for the synthetic CT generation from pelvis MR images in this article. Comparison between the sCT image quality generated through these models and the ground truth CT indicates that both the eCNN,V-Net, and Res-Net are of the lowest noise and the highest visual inspection, while, the measured MAE, ME, PCC, and PSNR metrics indicate outperformance of eCNN and ResNet models vs. the other models. All in all, the deep learning approaches reveal acceptable performance with ranges of errors reported in the literature. In general, different deep learning architectures would lead to considerable differences in performance, thus, for any specific application different deep learning algorithms should be implemented/evaluated in determining the model with the best performance. In this article, eCNN and ResNet models revealed excellent performance in synthetic CT estimation from MR images. The ResNet model is equipped with the dilation convolutional kernels which enable it to process the input image at the highest resolution (without reducing the resolution of the input image) and extract discriminative features from the input image(s). The new building block structure, applied in the eCNN model, provided an effective update of the free parameters and a very efficient training convergence rate. For these reasons, these two models exhibited superior performance over the other deep learning models. The deep learning approaches, in general, are of better performance than the Atlas-based method. Five state-of-the-art deep learning approaches together with an Atlas-based method were implemented and evaluated for the synthetic CT estimation from the MR images. The deep learning-based methods revealed high performance over the Atlas-based method. Among the deep learning approaches, the eCNN and ResNet models had the lowest quantification errors. The error levels observed in these two models would be tolerable for the accurate MR-guided PET attenuation correction and MR-only radiation planning tasks. Emerging role of MRI in radiation therapy Systematic Review of Synthetic Computed Tomography Generation Methodologies for Use in Magnetic Resonance Imaging-Only Radiation Therapy Comparative study of algorithms for synthetic CT generation from MRI: Consequences for MRI-guided radiation planning in the pelvic region Dosimetric Impact of MRI Distortions: A Study on Head and Neck Cancers Functional imaging for radiotherapy treatment planning: current status and future directions-a review A review of substitute CT generation for MRI-only radiation therapy Atlas-guided generation of pseudo-CT images for MRI-only and hybrid PET-MRI-guided radiotherapy treatment planning Vision 20/20: Magnetic resonance imaging-guided attenuation correction in PET/MRI: Challenges, solutions, and opportunities Quantifying the Effect of 3T Magnetic Resonance Imaging Residual System Distortions and Patient-Induced Susceptibility Distortions on Radiation Therapy Treatment Planning for Prostate Cancer Comparison of atlas-based techniques for whole-body bone segmentation Comparison of atlas-based bone segmentation methods in whole-body PET/MRI Deep learning-guided joint attenuation and scatter correction in multitracer neuroimaging studies Feasibility of Deep Learning-Guided Attenuation and Scatter Correction of Whole-Body 68Ga-PSMA PET Studies in the Image Domain Deep learning-based attenuation correction in the image domain for myocardial perfusion SPECT imaging Quantitative analysis of MRI-guided attenuation correction techniques in time-of-flight brain PET/MRI Tc-99m (methylene diphosphonate) SPECT quantitative imaging: Impact of attenuation map generation from SPECT-non-attenuation corrected and MR images on the diagnosis of bone metastasis One registration multi-atlas-based pseudo-CT generation for attenuation correction in PET/MRI Applications of artificial intelligence and deep learning in molecular imaging and radiotherapy Clinical assessment of MRguided 3-class and 4-class attenuation correction in PET/MR Whole-body bone segmentation from MRI for PET/MRI attenuation correction using shape-based averaging Magnetic resonance imaging-guided attenuation correction in wholebody PET/MRI using a sorted atlas approach Novel adversarial semantic structure deep learning for MRI-guided attenuation correction in brain PET/MRI Deep learning-guided estimation of attenuation correction factors from time-of-flight PET emission data Investigation of a method for generating synthetic CT models from MRI scans of the head and neck for radiation therapy Comparison of state-of-the-art atlas-based bone segmentation approaches from brain MR images for MR-only radiation planning and PET/MR attenuation correction A survey on deep learning in medical image analysis Deep learning-based metal artefact reduction in PET/CT imaging Automated lung segmentation from CT images of normal and COVID-19 pneumonia patients Deep learning-based noise reduction in low dose SPECT Myocardial Perfusion Imaging: Quantitative assessment and clinical performance Deep learning-based synthetic CT generation from MR images: comparison of generative adversarial and residual neural networks Truncation compensation and metallic dental implant artefact reduction in PET/MRI attenuation correction using deep learning-based object completion MR-based synthetic CT generation using a deep convolutional neural network method Medical Image Synthesis with Deep Convolutional Adversarial Networks Deep MR to CT synthesis using unpaired data Improvement of image quality in PET using post-reconstruction hybrid spatial-frequency domain filtering N4ITK: improved N3 bias correction Deep residual learning for image recognition On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task A new deep convolutional neural network design with efficient learning capability: Application to CT image synthesis from MRI Generative adversarial networks Freehand ultrasound image simulation with spatially-conditioned generative adversarial networks V-net: Fully convolutional neural networks for volumetric medical image segmentation U-net: Convolutional networks for biomedical image segmentation Feasibility of Deep Learning-Based PET/MR Attenuation Correction in the Pelvis Using Only Diagnostic MR Images Attenuation correction for brain PET imaging using deep neural network based on Dixon and ZTE MR images Monte Carlo-based estimation of patient absorbed dose in 99mTc-DMSA, -MAG3, and -DTPA SPECT imaging using the University of Florida (UF) phantoms Generating synthetic CTs from magnetic resonance images using generative adversarial networks