key: cord-0434200-nq19tyoo authors: Huang, Yixing; Maier, Andreas; Fietkau, Rainer; Bert, Christoph; Putz, Florian title: Learning Perspective Deformation in X-Ray Transmission Imaging date: 2022-02-13 journal: nan DOI: nan sha: e7fb50ba5b0c42517e3c3bfe930550c33e39fc22 doc_id: 434200 cord_uid: nq19tyoo In cone-beam X-ray transmission imaging, due to the divergence of X-rays, imaged structures with different depths have different magnification factors on an X-ray detector, which results in perspective deformation. Perspective deformation causes difficulty in direct, accurate geometric assessments of anatomical structures. In this work, to reduce perspective deformation in X-ray images acquired from regular cone-beam computed tomography (CBCT) systems, we investigate on learning perspective deformation, i.e., converting perspective projections into orthogonal projections. Directly converting a single perspective projection image into an orthogonal projection image is extremely challenging due to the lack of depth information. Therefore, we propose to utilize one additional perspective projection, a complementary (180-degree) or orthogonal (90-degree) view, to provide a certain degree of depth information. Furthermore, learning perspective deformation in different spatial domains is investigated. Our proposed method is evaluated on numerical spherical bead phantoms as well as patients' chest and head X-ray data. The experiments on numerical bead phantom data demonstrate that learning perspective deformation in polar coordinates has significant advantages over learning in Cartesian coordinates, as root-mean-square error (RMSE) decreases from 5.31 to 1.40, while learning in log-polar coordinates has no further considerable improvement (RMSE = 1.85). In addition, using a complementary view (RMSE = 1.40) is better than an orthogonal view (RMSE = 3.87). The experiments on patients' chest and head data demonstrate that learning perspective deformation using dual complementary views is also applicable in anatomical X-ray data, allowing accurate cardiothoracic ratio measurements in chest X-ray images and cephalometric analysis in synthetic cephalograms from cone-beam X-ray projections. C ONE-BEAM X-ray imaging along with a flat-panel detector is widely used for disease diagnosis, treatment planning, and intervention guiding. It is used for direct twodimensional (2D) imaging such as fluoroscopic angiography [1] and chest radiographs [2] as well as three-dimensional (3D) volume reconstruction in cone-beam computed tomography (CBCT) [3] - [5] . In cone-beam X-ray imaging, because of the divergence of X-rays, imaged structures with different depths have different magnification factors on the X-ray detector. As a consequence, acquired images suffer from geometric distortions, which is called perspective deformation. Perspective deformation causes difficulty in direct, accurate geometric assessments of structures of interest (SOI) in many practical applications, e.g., anatomic landmark detection [6] , [7] , fluoroscopic image stitching [8] , fiducial marker registration [9] - [11] , and dual-modality image fusion [12] . Therefore, orthogonal projections of SOI are preferred over perspective projections in many applications. To reduce perspective deformation in practice, specialized cone-beam X-ray devices are designed for certain ap- plications. For example, cephalometers [13] and chest X-ray systems [2] are designed specially for cephalometric analysis and chest imaging respectively, both of which have a relatively large source-to-detector distance and a short objectto-detector distance. However, in regular cone-beam X-ray imaging systems, e.g., C-arm CBCT systems, such a large source-to-detector distance is not available. In such systems, perspective deformation remains an issue for applications like chest X-ray imaging and cephalometric analysis. Alternative to specialized devices, to deal with perspective deformation, digitally reconstructed radiographs (DRRs) generated from an intermediate 3D volume are commonly used. For example, in dental CBCT, DRRs using orthogonal projection are used as synthetic cephalograms [6] , which provide more cephalometric landmark accuracy than those with perspective projection. However, obtaining such 3D CBCT volumes brings additional dose exposure to patients, since hundreds of projections are acquired for 3D reconstruction. Another potential application is in hybrid magnetic resonance imaging (MRI) and X-ray imaging [14] , [15] . Obtaining 3D MRI volumes to generate DRRs is time consuming, taking around 30 min for each scan. Most importantly, pre-acquired 3D MRI volumes cannot provide accurate information of daily organ changes [16] . Therefore, a solution to fast 2D MRI/X-ray hybrid imaging in one exam without patient repositioning [14] , [17] is desired for future potential applications in interventional surgery and radiation therapy [18] , [19] . According to Fourier sampling theorems, 2D parallel-beam MRI images are much more Fig. 1 . The virtual detector is located at the isocenter of a cone-beam X-ray imaging system. natural and efficient to acquire than 2D cone-beam MRI images. Hence, the registration between 2D cone-beam Xray images and 2D parallel-beam MRI images is necessary, which requires the conversion between perspective and orthogonal projections [12] . Therefore, learning perspective deformation, which directly converts 2D perspective (conebeam) projections to 2D orthogonal (parallel-beam) projections, has important value in many applications. Perspective deformation is also a common problem in optical imaging [20] - [23] . But optical images are reflection (surface) images and such perspective deformation is typically caused by the distortion of camera lens. Therefore, perspective deformation in X-ray transmission imaging has substantial difference from that in optical imaging. As a result, perspective distortion correction algorithms for optical imaging cannot be applied. This work is a proof-of-concept study on perspective deformation learning in X-ray transmission imaging. The contributions of this work mainly lie in the following aspects: a) Investigation of learning perspective deformation in different spatial spaces, i.e., in Cartesian, polar, and logpolar coordinates; b) Investigation of learning perspective deformation using different views, i.e., single view, dual orthogonal views, and dual complementary views; c) Evaluation of learning perspective deformation on general spherical bead phantom data, as well as patients' anatomical data (chest and head data), which indicates the promising aspects of future real applications. In a CBCT system, we denote the source-to-isocenter distance by D si , the source-to-detector distance by D sd , and the rotation axis is along the Z-axis. In this work, the perspective projection of interest is acquired at view angle 0 • . At this view angle, the X-ray source is located at (−D si , 0, 0) and the detector center is located at (D sd − D si , 0, 0). For points located at the mid-sagittal plane where x = 0, they have a magnification factor of D sd /D si . Because of this magnification factor, a projected structure on the detector with perspective projection has larger size than that with orthogonal projection. To remove this magnification factor, the detector can be virtually moved to the isocenter, as illustrated in Fig. 1 . For this purpose, acquired perspective projection images are rebinned to the virtual detector with a scaling factor of D si /D sd . The orientation and the length of an arrow reflects the direction and the magnitude of perspective deformation at the corresponding position. The red 3 × 3 grids represent 3 × 3 convolutional kernels. After rebinning, structures in the mid-sagittal plane have no magnification. However, structures in other planes, which are parallel to the mid-sagittal plane, are either magnified (dilated) or shrinked depending on their depths, i.e., their X-coordinates. For an arbitrary point a(x, y, z), its perspective projection a PP at the virtual detector has a magnification factor of m = D si /(x + D si ). Hence, the position of a PP at the virtual detector is (m · y, m · z), while the position of its orthogonal projection a OP is (y, z). Hence, a PP , a OP and the detector origin O det are collinear, which indicates that perspective deformation occurs along radial directions as described in Fig. 2(a) . And the amount of deformation is, where ρ 0 is the distance of a OP to the principle ray (the principle ray hits at the detector center). It tells us that the further a point is away from the principal ray, the larger its perspective deformation magnitude is, given a fixed depth (i.e., x or m). This position dependency is reflected by the different lengths of arrows in Fig. 2(a) . In the illustration, m > 1 is used as an example. For m < 1, the arrows have opposite directions. In this work, we aim to learn perspective deformation using convolutional neural networks (CNNs). In CNNs, the same convolutional kernels slide along different positions of an image to extract features. Therefore, CNNs have the property of shift invariance. However, perspective deformation has different magnitudes in different positions ( Fig. 2(a) ). Although deep CNNs have the capacity to learn very complex functions, it is suboptimal, or even inept, for CNNs to learn perspective deformation in Cartesian coordinates because of this spatial inhomogeneity. Due to the radial symmetry of perspective deformation, we propose to learn it in polar coordinates. The Cartesian detector coordinates are converted to polar coordinates as follows, where u and v are horizontal and vertical detector indices respectively, θ is the polar angle, and ρ is the radial distance in perspective projection images. At view angle 0 • , u and v are equivalent to y and z, respectively. Perspective deformation correction means to move a point at position ρ in perspective projection back to ρ 0 in orthogonal projection along the same radial direction, which is horizontal shift in polar coordinates, as displayed in Fig. 2 (b). After polar transform, perspective deformation is homogeneous along the polar angular dimension. However, it is still inhomogeneous along the radial dimension, as illustrated in Fig. 2(b) . Log-polar transform turns rotations and dilations in Cartesian coordinate systems into translations [24] - [26] . After log-polar transform, perspective deformation becomes homogeneous along both the angular and radial dimensions. In this work, we will investigate whether such a log-polar transform can further improve perspective deformation learning. More information on logpolar transform is described in Appendix A. For learning perspective deformation, it is essential to estimate the depths of structures of interest. With a single view, the depth information is lost in the perspective projection image. For example, in Fig. 3(a) , the black point can be located anywhere along the red solid ray, if no depth information is available. Therefore, all the points along the dashed line, including the blue point, are potential candidates as the paired one of the red point in orthogonal projection. As a consequence, it is very challenging to learn perspective deformation from one single view. Therefore, a second view is necessary. Ideally, two views of the same point of interest can determine its 3D position, which is the intersection point of the two corresponding rays. In practice, biplanar X-ray systems [27] - [29] are widely used in interventional surgery for depth estimation. In a biplanar system, an additional orthogonal view is utilized, as illustrated in Fig. 3(b) . The two perspective projection images from two orthogonal views for a 3D bead phantom as well as its orthogonal projection image are displayed in Fig. 4 as an example. Fig. 4 (d) shows the perspective deformation, which is the difference between the perspective projection ( Fig. 4(c) ) and the orthogonal projection ( Fig. 4(b) ) from the 0 • view. To better compare the perspective projection images from the two orthogonal views, an RGB image is fromed in Fig. 4 (f), where the red and blue channels use images from the 0 • perspective view, while the green channel uses images from the 90 • perspective view. Here we fill up three channels so that pixels have similar intensity from both views appear grey such as the background cylinder area. In the formed RGB image, the magenta beads from the 0 • view and the green beads from 90 • view are located in different positions. In Fig. 3 (b), the red ray and the green ray determines the position of the black point. It is easy to determine because only one point of interest is present. However, in Fig. 4 (f) the bead-to-bead (or ray-to-ray) correspondence between two views is not straightforward because of the large number of beads. Therefore, it is challenging for neural networks to learn perspective deformation directly from such two perspective projection images. An alternative is to utilize the orthogonal view to reconstruct an intermediate 3D volume to determine the 3D positions of structures. Recently, deep learning has made image reconstruction from few views possible [29] - [31] . However, such methods require test structures to have spatial distributions very similar to training data. Otherwise, vital structures are not reconstructed reliably. For example, the beads reconstructed by X2CT-GAN [29] in Fig. 5 From left to right, the more beads, the more difficult to determine beadto-stripe correspondences. incorrect in position, number and intensity. This is because beads may occur in arbitrary positions and the X2CT-GAN is not able to learn the fundamental geometric information. Therefore, such intermediate 3D volumes cannot provide reliable information for perspective deformation learning. Simple geometric information may be altered in nonlinear neural networks. To avoid this, we use simple direct back-projection to get an intermediate 3D reconstruction. Afterwards, the orthogonal projection of the intermediate reconstruction is obtained. For a point of interest, e.g. the black point in Fig. 3 (c), its back-projection is the green solid ray, and the orthogonal projection of the green solid ray is the green dashed line on the detector. Note that this green dashed line is different from the paired epipolar line (the cyan dotted line) [32] of the red line. The orthogonal projection of the black point, i.e. the blue point, is located at the intersection of the green dashed line and the radial line to the red point. An exemplary RGB image combing the 0 • perspective projection (red and blue channels) and the orthogonal projection of back-projection from the 90 • perspective projection (green channel) is displayed in Fig. 6 . To learn perspective deformation, neural networks need to move each bead along a radial direction to a position where one of the stripes passes. The stripes have different widths and intensities, which are potentially beneficial for neural networks to determine bead-to-stripe correspondences. Although determining the bead-to-stripe correspondences is relatively easy in Fig. 6 (a), such determination is very challenging in Fig. 6 (b) and Fig. 6 (c) when the number of beads is large. Alternative to an orthogonal view, an additional complementary (180 • ) view ( Fig. 3(d) ) is proposed in this work for perspective deformation learning. A complementary view in parallel-beam X-ray imaging is fully redundant. However, in cone-beam X-ray imaging, because of the coneangle, two complementary views of the same point of interest can still determine its 3D position. For example, the red ray and the green ray determines the position of the black point in Fig. 3 (d) as well. Moreover, a complementary view straightforwardly provides an interval where the orthogonal projection of the point of interest should be located. For example, in Fig. 3 (d) the blue point is located between the red point and the green point, and these three points are collinear. The 180 • perspective projection of the same 3D bead phantom is displayed in Fig. 7 (a) and its difference with respective to the 0 • perspective projection is displayed in Fig. 7 (b). Fig. 7 (b) is similar to Fig. 4 To integrate such dual-view information, like Fig. 4 (f), we convert the perspective projections images from the 0 • and 180 • views to a 3-channel RGB image in Fig. 7 (c). The red and blue channels use images from the 0 • view, while the green channel uses images from the 180 • view. The 0 • view instead of the 180 • view takes two channels, because the desired orthogonal projection images are acquired in the 0 • view in our setting. In the RGB images, the color reveals the intensity difference between the 0 • and 180 • perspective projection images. Grey areas contain close intensity values from both views. Instead, magenta and green areas indicate larger intensity values from the 0 • and 180 • views respectively, where perspective deformation correction is necessary. They correspond to the positive (bright) and negative (dark) areas in the difference image in Fig. 7 (b). Learning perspective deformation is fundamentally an image-to-image translation problem, where GANs are the state-of-the-art approaches. In this work, a pixel-to-pixel generative adversarial network (pix2pixGAN) [33] is applied as an exemplary neural network. Note that the Cy-cleGAN is not chosen since the original CycleGAN is re-ported to have difficulty in tasks where geometric changes are involved [34] . The Pix2pixGAN uses the U-Net, the most popular and successful neural network in biomedical imaging, as the generator G and a 5-layer CNN as the discriminator D. G learns to convert a perspective projection image to an orthogonal projection image. D learns to distinguish the synthetic orthogonal projection image from the target orthogonal projection image. The objective of the conditional GAN is, where x is the input, y is the target, G tries to minimize this objective against an adversarial D that tries to maximize it, i.e., G * = arg min G max D L cGAN (G, D). In addition, an 1 loss function is applied to train the generator's output close to the target with less blurring compared to 2 loss and a perceptual loss L perc [35] using the VGG-16 model to further reduce blurring, The overall objective function is It is worth noting that for learning perspective deformation in polar coordinates or log-polar coordinates, periodic padding instead of zero padding is used along the angular dimension to avoid stitching artifacts at the 0 • (360 • ) radial direction. The above different methods for learning perspective deformation are investigated on numerical bead phantom data for general purpose as well as anatomical data including chest CT data and head CT data. In this work, we focus on perspective deformation learning using different view settings in different spatial spaces, instead of different neural network architectures or training loss functions. The CBCT system used in this work has a source-to-detector distance (D sd ) of 1200 mm and a source-to-isocenter distance (D si ) of 750 mm. The detector has 1240 × 960 pixels with a pixel size of 0.308 × 0.308 mm 2 . Acquired projection images are rebinning to a virtual detector at the isocenter, which has 512 × 512 pixels with a pixel size of 0.625 × 0.625 mm 2 . In practice, higher pixel resolution for the virtual detector is achievable. Here a coarse resolution of 0.625 × 0.625 mm 2 is used as a proof of concept. The volume centers of imaged objects are located at the origin of the world coordinates. Perspective projections are generated via forward projection of the volumes. The orthogonal projections are generated using a large virtual source-to-detector distance of 12000 mm and a short isocenter-to-detector distance of 100 mm. As displayed in Fig. 4 , 3D bead phantoms are generated to demonstrate the efficacy of our proposed algorithms for general perspective deformation learning. Each bead phantom is a cylinder containing spherical beads. The height and diameter of the cylinder are randomly generated, with values of 240 ± 16 mm and 225 ± 32 mm respectively. The background intensity value of the cylinder is a random value of 50 ± 35 HU. Small and big beads have the sizes of 6.4 ± 1.6 mm and 16 ± 8 mm, respectively. Their intensities are either 3500 ± 350 HU or 6000 ± 1000 HU. The phantoms have a size of 512 × 512 × 512 voxels with a voxel size of 0.625 × 0.625 × 0.625 mm 3 . 200 bead phantoms are generated, with 185 phantoms for training, 5 phantoms for validation, and 10 phantoms for testing. Each phantom in the dataset contains approximately 50 spherical beads with random positions inside the cylinder. Note that the number of beads also varies randomly in the same data set. For data augmentation, each phantom is rotated by 15 In chest X-ray imaging, a long source-to-detector distance is used to reduce perspective deformation. In order to acquire chest X-ray radiographs in regular CBCT systems, the proposed perspective deformation learning algorithms are evaluated on chest CT data. Three COVID-19 chest CT releases (MIDRC-RICORD-1a [36] , MIDRC-RICORD-1b [36] , and COVID-19 sequential data [37] ) and a kidney CT dataset [38] from the cancer imaging archive (TCIA) [39] are used. The volumes whose slice spacing is larger than 2.5 mm (resolution in the Z-direction is too low) or the total slice number is smaller than 100 are removed (the volume cannot cover a complete chest). In total, 92 patients are used for training, 5 patients are used for validation, and 27 patients are used for testing. For data augmentation, each volume has a random anisotropic scaling transform along X, Y, and Z directions in spatial extents. The scaling factors are between 0.95 and 1.05. Each volume is augmented 5 times. In other words, 552 volumes are used for training, 30 volumes are used for validation, and 162 volumes are used for testing. The same CBCT system, i.e., D sd = 120 mm and D si = 750 mm, is used to generate perspective projections. The input perspective images are rebinned to a virtual detector of 472 × 352 pixels with a resolution of 0.8 × 0.8 mm 2 , where lateral and vertical data truncation occur since CBCT detectors typically cannot cover the complete chest. The target orthogonal projection images have 412 × 300 pixels with a resolution of 0.8 × 0.8 mm 2 . During training, both images are zero-padded to the size of 512 × 512, which is convenient for the down-sampling path of the U-Net. In dental imaging, orthogonal X-ray projections are preferred over perspective projections for cephalometric analysis. Hence, the proposed perspective deformation learning algorithms are also evaluated on head CT data. The CQ500 dataset [40] as well as a public domain database for computational anatomy (PDDCA) [41] and 10 complete human mandible data sets [42] are used for this purpose. The CQ500 dataset consists of 491 scans, and the PDDCA consists of 48 complete patient head and neck CT images. Many volumes in the CQ500 dataset miss the lower head (neck) part. That is why the other two datasets are necessary for training. The training volumes are augmented by random anisotropic scalings. The volumes whose slice spacing is larger than 2.5 mm or the total slice number is smaller than 100 are removed. In total, 960 volumes are used for training, 10 volumes are used for validation, and 30 volumes are used for testing. For the head CT data, the system has a source-to-detector distance of 960 mm and a source-to-isocenter distance of 600 mm, since dental CBCT systems typically have a shorter source-to-detector distance than that of angiographic C-arm CBCT systems. The projection images in Cartesian coordinates have an image size of 512 × 512 with a pixel size of 0.5 mm × 0.5 mm. Images in polar coordinates have an image size of 512×512 with a pixel size of 0.5 mm × 0.703 • . For training Pix2pixGAN, the Adam optimizer is used with an initial learning rate of 0.0002, a momentum term of 0.5. A weight of 100 is applied for the 1 loss and the perceptual loss to combine the adversarial loss. Validation is performed during training to avoid over-fitting. In total, 300 epochs are used for training. The best model on validation data close to 300 epochs for each training is selected for testing. The input images of one exemplary phantom from the bead phantom data are displayed in Fig. 8 Fig. 8(j) , the bead positions from the two views vary along different radial directions. In Fig. 8(k) and Fig. 8(l) , they vary only along the horizontal direction. The prediction results are displayed in Fig. 9 , while their error images are displayed in Fig. 10 for the better comparison of image quality. Figs. 9(a)-(c) uses the single 0 • view. In Fig. 9 (a), many beads are distorted, losing their circular shapes. They are blurry as well. Although the background cylinder is restored to have a rectangle-like shape, aliasing is observed. In Fig. 9(b) , the beads have better shapes and less blur than those in Fig. 9(a) . However, artifacts occur near certain beads and some tiny beads are still blurry. Fig. 9 (c) has similar appearance to Fig. 9 (b), also suffering from blur for tiny beads and artifacts near certain beads. The error images in Figs. 10(a)-(c) also show that learning perspective deformation in polar or log-polar coordinates is better than that in Cartesian coordinates. The results of dual orthogonal views are displayed in Figs. 9(d)-(f). Compared with Fig. 9 (a), Fig. 9 (d) has less aliasing and less shape distortion. However, Fig. 10(d) indicates that the error in Fig. 9 (d) is still large, which has a root-mean-square error (RMSE) value of 6.41. The RMSE values indicate that Fig. 9 (e) and Fig. 9 (f) have similar image quality. The results of dual complementary views are displayed in Figs. 9(g)-(i). In Fig. 9 (g), aliasing still remains. In Fig. 9 (h) and Fig. 9 (i), all the beads have decent appearance. In Fig. 10, Fig. 10 (h) has the least error among all the error images, and Fig. 10 (i) has slightly larger error than Fig. 10(h) . For over all image quality comparison, the mean RMSE and structure similarity index measurement (SSIM) values for the prediction results in different spaces with different views are displayed in Tab. 1. In addition to the results in Fig. 9 , three more results are included for comparison: a) combining 0 • and 90 • perspective projection images naively like Fig. 4 (f) as the input of the neural network, denoted by "0 • &90 • naive" in Tab. 1; b) Using the difference image between 0 • and 180 • perspective projection images like Fig. 7(b) as the third channel of the RGB image instead of using the 0 • perspective projection again, denoted by "0 • &180 • +" in Tab. 1; c) The direct combination of 0 • , 90 • , and 180 • perspective projection images as the three channels of an RGB input, denoted by "0 • , 90 • &180 • ". Regarding image space, Tab. 1 demonstrates that learning perspective deformation in polar coordinates can drastically improve image quality compared with that in Cartesian coordinates. It also shows that learning in log-polar coordinates is slightly worse than that in polar coordinates, with one exception for "0 • &180 • +". Due to the small difference, we have repeated the experiments of "0 • &180 • " ten times to avoid the influence of random weight initialization. such difference is neglectable for human visualization. Regarding acquisition views, with the naive combination of orthogonal views, the RMSE values have no considerable improvement than those with a single view. With OPBP, the RMSE is 0.35 less, which is not large improvement, although it demonstrates that an orthogonal view with OPBP is beneficial for learning perspective deformation. Regarding using the difference image as the third channel (comparison between "0 • &180 • " and "0 • &180 • +"), it improves image quality in Cartesian coordinates. However, in polar and log-polar coordinates, it has no large influence in the image quality. In addition, naively combine three views ("0 • , 90 • &180 • +") has no considerable improvement compared with "0 • &180 • ". All in all, Tab. 1 recommends to use two complementary views in polar coordinates for learning perspective deformation. The results of one patient in chest X-ray imaging are displayed in Fig. 11 , where the cardiothoracic ratio is assessed as an exemplary clinical application [43] . In the reference image ( Fig. 11(a) ), the maximal horizontal cardiac diameter (MHCD) and the maximal horizontal thoracic diameter (MHTD) are indicated by two green horizontal lines. Its cardiothoracic ratio is 0.4237. In the 0 • perspective projection image ( Fig. 11(b) ), all the anatomical structures can be visualized with fine resolution. However, due to perspective deformation, anatomical structures, e.g. the ribs and the spine, are deformed. The deformations are visualized better in the difference image Fig. 11(c) . Compared with the ribs and the spine, the heart has less deformation as its location is closer to the isocenter. In Fig. 11(b) , the MHCD and the MHTD are indicated by two red horizontal lines, while the green lines are those of the reference image. While the MHCD has little change from 10.47 cm to 10.16 cm, the MHTD has relatively large change, from 24.71 cm to 25.40 cm. As a consequence, the cardiothoracic ratio becomes 0.4002, which is below the normal range of 0.42 -0.50 [43] . The result of learning perspective deformation from 0 • single view is displayed in Fig. 11(d) , where the MHCD and the MHTD are 10.63 cm and 24.71 cm, respectively. The MHTD of Fig. 11(d) is the same as that of the reference image. This is also reflected by the difference image Fig. 11(g) , where the lower ribs have small errors. However, the upper ribs as well as the spine still have considerable errors. The results of perspective Fig. 11 (e) and Fig. 11 (f), respectively. The measured MHCDs and MHTDs in these two images are very close to the reference ones. Hence, their cardiothoracic ratios, 0.4214 and 0.4240 respectively, are close to the reference ratio as well. In the difference images ( Fig. 11 (h) and Fig. 11 (i)), the errors of ribs and spine decrease as their boundaries are no longer apparently visible. Nevertheless, Fig. 11 (i) has less error than Fig. 11 The results of one exemplary patient for cephalometric imaging are displayed in Fig. 12 . In the 0 • perspective pro-jection image (Fig. 12(b) ), because of perspective deformation, anatomical structures from the left and right sides do not overlap well, especially for the mandible as indicated by the red arrow in Fig. 12(b) . It causes inaccuracy in determining the cephalometric landmark of gonion. The difference of Fig. 12(b) to the reference Fig. 12 (a) is displayed in Fig. 12(c) . A scale bar of 2 mm is displayed in Fig. 12 (c), as 2 mm is the clinically acceptable precision for cephalometric landmark detection. It is obvious that many anatomical structures in the 0 • perspective projection images have position shifts larger than 2 mm. In the prediction image ( Fig. 12(d) ) using a single 0 • view in Cartesian coordinates, perspective deformation is reduced to some degree, as displayed in the difference image Fig. 12(g) . For example, the mandible region has less error. However, Fig. 12 (g) also indicates that many bony structures have deviations larger than 2 mm. The results of learning from dual complementary views in Cartesian and polar coordinates are displayed in Fig. 12 (e) and Fig. 12(f) , respectively. The both images have little perspective deformation, as revealed by their difference images in Fig. 12 (h)and Fig. 12 (i). Nevertheless, in Fig. 12 (e), two dark regions are indicated by the two arrows, which are better visualized in the difference image Fig. 12 (h). In the experiments on numerical bead phantom data, false positive and false negative beads are observed, especially for tiny beads. The results of an exemplary phantom containing tiny beads are displayed in Fig. 13 . Fig. 13(a) is the reference image where five zoomed-in regions-of-interest (ROIs) (NO. 1-5) are displayed. Fig. 13 For experiments on anatomical data, learning perspective deformation from 0 • &180 • views in polar coordinates in general preserves all anatomical structures. However, despite of the joint optimization with adversarial loss, 1 loss and perceptual loss, there is still slight resolution loss. For example, the rib indicated by the arrow in Fig. 14(c) has blurry boundaries. Note that the same rib in this region also has low contrast in the reference image ( Fig. 14(a) ) and the 0 • perspective input image ( Fig. 14(b) ), due to the occlusion of the lung. In addition, similar to the results in the numerical bead phantom experiments (Fig. 13 ), tiny structures, which are around 1 mm in radius, cannot be reconstructed reliably, especially when such structures are not present in training data. For example, the tiny metal implants, probably vessel stents, in Fig. 14(g) and Fig. 14(h) are not predicted in Fig. 14 it is easy for human visual systems to determine such correspondences based on the size and intensity information. However, learning feature correspondences, e.g. point-topoint correspondences, from two views is still a challenging task for neural networks. Currently, many efforts are being devoted to this topic [44] - [46] . Nevertheless, the current research focuses on correspondence learning in optical images and yet it is still open for solution. An exploration of correspondence learning in transmission images is a promising direction to improve perspective deformation learning from dual orthogonal views. Fig. 10 and Tab. 1 demonstrate that learning perspective deformation in polar coordinates has a distinct advantage over learning in Cartesian coordinates, while learning in log-polar coordinates does not provide further improvement. In perspective deformation, dilation and contraction occur with different scaling factors depending on the depths of structures, which are inhomogeneous translations fundamentally along radial directions ( Fig. 2(b) ). In other words, CNNs are able to learn translations, no matter homogeneous or inhomogeneous, while they are not optimal for learning rotational transforms directly ( Fig. 2(a) ). The experiments in this work demonstrate that our proposed perspective deformation learning method from dual complementary views not only work on numerically synthetic data, but also on patients' CT data. Note that although spherical beads themselves have simple geometric structures, they have arbitrary sizes, intensities and locations, which increases the difficulty for neural networks to learn. Therefore, neural networks have the risk to predict false positive beads or false negative beads. In our experiments, we observe that such risks also exist for learning from two complementary views, especially for very tiny beads as displayed in Fig. 13 , but the frequency is much less than learning from one single view or two orthogonal views. Fig. 13 also demonstrate that such failure is not caused by the resampling between Cartesian and polar coordinates, but potentially by the resolution loss of the neural network (Fig. 14(c) ). On patients' CT data, similar limitation is also observed, as the tiny implants are not predicted by the neural network in Fig. 14(i) . This work is a proof-of-concept study on learning perspective deformation in X-ray transmission imaging. The experiments on numerical bead phantom data and anatomical data in this work demonstrate that learning perspective deformation from a single view in general is not sufficient. Instead, learning perspective deformation form dual complementary views achieves the best performance. Regarding spatial spaces, learning perspective deformation in polar coordinates has drastic advantage over that in Cartesian coordinates, while learning in log-polar coordinates has no further considerable improvement. The limitation is that tiny structures (1 or 2 pixels in radius) might be missing in the neural network output. Nevertheless, the experiments on the chest data and head data demonstrate that our method allows accurate cardiothoracic ratio measurement and cephalometric imaging, which has potential to empower conventional CBCT systems with more applications. According to Eq. (1) in the main text, the perspective deformation magnitude for a point a(x, y, z) is ∆d = (m−1)ρ 0 = ρ(m − 1)/m. In order to obtain homogeneous deformation, at position ρ in a perspective projection image, we wish to rescale the deformation ∆d with a scaling factor of 1/ρ, i.e., ∆d = ∆d/ρ = (m − 1)/m, so that the deformation is independent of the position ρ. In other words, we seek a transformed polar radial coordinateρ, which has the following relation to the original polar radial coordinate, where d is the derivative symbol and α is a constant coefficient. It leads to the natural logarithm of the polar coordinate, i.e., the log-polar coordinate, where c 0 is a constant to avoid negative values forρ. In practice, image grids typical have the same spacings between grids, which uses uniform sampling. For log-polar transformed grids, it can be regarded that a nonuniform image grid like Fig. 15(b) is used in the original linear raidal dimension (before log-transform), where a large pixel spacing is used to cover the large deformation magnitude when ρ is large. Therefore, a small grid spacing is preferred in the log-polar transformed image to avoid significant resolution loss. In this work, we choose α = 1 for simplicity. The spacings for polar and log-polar coordinates are 0.375 mm and 0.0075 mm, respectively. c 0 is determined by the locations of the first grid center in polar and log-polar coordinates, i.e., c 0 = 1.6777. The angular step is 0.703 • . An exemplary image in its Cartesian form (a), polar form (b) and log-polar form (d) is displayed in Fig. 16 . Since the left half image of the log-polar image ( Fig. 16(d) ) contains little information (mostly constant values), only the right half image is used for training. Due to resampling, some error is introduced for high frequency structures (e.g., edges), as displayed in Fig. 16(c) and Fig. 16(e) . On average, the resampling errors for polar and log-polar transforms are 0.67 and 0.93, respectively. Real-time fusion of coronary ct angiography with x-ray fluoroscopy during chronic total occlusion pci Dynamic chest X-ray using a flat-panel detector system: Technique and applications Flat-panel cone-beam computed tomography for image-guided radiation therapy Cone-beam computed tomography-based radiomics in prostate cancer: a mono-institutional study Tumor regression during radiotherapy for non-small cell lung cancer patients using cone-beam computed tomography images Comparison of conventional and cone beam CT synthesized cephalograms Hybrid approach for automatic cephalometric landmark annotation on conebeam computed tomography volumes Reconstruction of orthographic mosaics from perspective X-ray images Robust automatic rigid registration of mri and x-ray using external fiducial markers for XFM-guided interventional procedures Comparison of 2D radiographic images and 3d cone beam computed tomography for positioning head-and-neck radiotherapy patients Stochastic formulation of patient positioning using linac-mounted cone beam imaging with prior knowledge Known operator learning enables constrained projection geometry conversion: Parallel to cone-beam for hybrid MR/X-ray imaging A new x-ray technique and its application to orthodontia A truly hybrid interventional MR/X-ray system: Feasibility demonstration projection-to-projection translation for hybrid Xray and magnetic resonance imaging On-line reoptimization of prostate IMRT plans for adaptive radiation therapy Truly hybrid interventional MR/X-ray system: investigation of in vivo applications Registration and tracking to integrate x-ray and mr images in an xmr facility Feasibility of mri guided proton therapy: magnetic field dose effects Perspective distortion modeling, learning and compensation Deep convolutional neural networks for estimating lens distortion parameters Method for estimation and correction of perspective distortion of electroluminescence images of photovoltaic panels Blind first-order perspective distortion correction using parallel convolutional neural networks Rotation and scale invariance with polar and log-polar coordinate transformations Robust image registration using logpolar transform Mean shift and log-polar transform for road sign detection Eos® biplanar x-ray imaging: concept, developments, benefits, and limitations Robust self-supervised learning of deterministic errors in single-plane (monoplanar) and dual-plane (biplanar) X-ray fluoroscopy X2CT-GAN: reconstructing CT from biplanar X-rays with generative adversarial networks Single-image tomography: 3d volumes from 2d cranial x-rays Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning Epipolar consistency in transmission imaging Image-to-image translation with conditional adversarial networks Unpaired image-toimage translation using cycle-consistent adversarial networks Deep learning for low-dose ct denoising using perceptual loss and edge detection layer The RSNA international COVID-19 open radiology database (RICORD) Generalized chest ct and lab curves throughout the course of covid-19 The state of the art in kidney and kidney tumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge The cancer imaging archive (TCIA): maintaining and operating a public information repository Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study Evaluation of segmentation methods on head and neck ct: autosegmentation challenge Computed tomography data collection of the complete human mandible and valid clinical ground truth models Radiological cardiothoracic ratio in evidence-based medicine Learning to find good correspondences Learning two-view correspondences and geometry using order-aware network Learning to find good correspondences of multiple objects