Benchmarking of five commercial deformable image registration algorithms for head and neck patients


a Corresponding author: Jason Pukala, Department of Radiation Oncology, UF Health Cancer Center at Orlando 
Health, 1400 S. Orange Ave., MP 730-13, Orlando, FL 32806, USA; phone: (321) 841 6204; fax: (407) 649 6895;  
email: Jason.Pukala@orlandohealth.com

Benchmarking of five commercial deformable image 
registration algorithms for head and neck patients

Jason Pukala,1a Perry B. Johnson,2 Amish P. Shah,1 Katja M. Langen,3 
Frank J. Bova,4 Robert J. Staton,1 Rafael R. Mañon,1 Patrick Kelly,1 and 
Sanford L. Meeks1
Department of Radiation Oncology,1 UF Health Cancer Center at Orlando Health, 
Orlando, FL, USA; Department of Radiation Oncology,2 University of Miami, Miami, FL, 
USA; Department of Radiation Oncology,3 University of Maryland, Baltimore, MD, USA; 
Department of Neurosurgery,4 University of Florida, Gainesville, FL, USA
Jason.Pukala@orlandohealth.com

Received 29 March, 2015; accepted 14 January, 2016

Benchmarking is a process in which standardized tests are used to assess system 
performance. The data produced in the process are important for comparative 
purposes, particularly when considering the implementation and quality assurance 
of DIR algorithms. In this work, five commercial DIR algorithms (MIM, Velocity, 
RayStation, Pinnacle, and Eclipse) were benchmarked using a set of 10 virtual 
phantoms. The phantoms were previously developed based on CT data collected 
from real head and neck patients. Each phantom includes a start of treatment CT 
dataset, an end of treatment CT dataset, and the ground-truth deformation vector 
field (DVF) which links them together. These virtual phantoms were imported into 
the commercial systems and registered through a deformable process. The resulting 
DVFs were compared to the ground-truth DVF to determine the target registration 
error (TRE) at every voxel within the image set. Real treatment plans were also 
recalculated on each end of treatment CT dataset and the dose transferred according 
to both the ground-truth and test DVFs. Dosimetric changes were assessed, and TRE 
was correlated with changes in the DVH of individual structures. In the first part 
of the study, results show mean TRE on the order of 0.5 mm to 3 mm for all phan-
toms and ROIs. In certain instances, however, misregistrations were encountered 
which produced mean and max errors up to 6.8 mm and 22 mm, respectively. In the 
second part of the study, dosimetric error was found to be strongly correlated with 
TRE in the brainstem, but weakly correlated with TRE in the spinal cord. Several 
interesting cases were assessed which highlight the interplay between the direction 
and magnitude of TRE and the dose distribution, including the slope of dosimetric 
gradients and the distance to critical structures. This information can be used to 
help clinicians better implement and test their algorithms, and also understand the 
strengths and weaknesses of a dose adaptive approach.

PACS number(s): 87.57.nj, 87.55.dk, 87.55.Qr 

Key words: deformable image registration, virtual phantoms, quality assurance, 
adaptive radiotherapy, head and neck cancer

 
I. INTRODUCTION

Deformable image registration (DIR) is the nonaffine process of mapping voxels from one image 
to another where the individual vectors that describe the mapping may vary in both magnitude 
and direction from their neighbors. The entire process is encompassed by the deformation vector 

JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, VOLUME 17, NUMBER 3, 2016

25   25


26  Pukala et al.: Benchmarking of DIR algorithms 26

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

field (DVF), which aggregates the individual vectors into a single map and specifies the coor-
dinate transformation between the two datasets. The DVF facilitates the transfer of information 
and allows the user to perform a number of useful functions such as contour propagation(1) or 
dose accumulation.(2) Initially, these functions were primarily limited to academic centers where 
many of the deformation algorithms were developed. Over the past several years, however, the 
use of DIR has expanded widely to the point where DIR is now included in several commercial 
treatment planning and contouring platforms. Commercial DIR algorithms are powerful and 
complex. Given two datasets with inherent differences, DIR algorithms are capable of quan-
tifying these differences and minimizing them by creating a new deformed image, the result 
of a process which morphs one image into another. This ability to deform almost anything is 
also the fundamental limitation of these systems. Because DIR algorithms are based on com-
plex mathematical models, there is no guarantee that the deformation defined by the DVF will 
represent biological change accurately. Thus, as with any new tool intended for use within the 
radiation oncology clinic, implementation must go hand-in-hand with validation. 

Much research has been done in this area, primarily through the creation of ground-truth 
models and QA metrics. In relation to the former, ground-truth models consist of two image sets 
linked via a known DVF or transferred content such as known landmarks. In an effort to provide 
consistent datasets for algorithm validation, several researchers have made their ground-truth 
models publicly available. Examples include the extended cardiac–torso (XCAT) phantom,(3) 
the point-validated pixel-based breathing thorax model (POPI model),(4) the DIR-Lab Thoracic 
4D CT model,(5,6) and those provided as part of the EMPIRE 10 challenge.(7) In an effort to 
improve the correlation of computer-based phantoms with the actual anatomical changes seen 
over a course of radiation therapy, the current authors previously developed synthetic datasets 
derived from clinical images of real head and neck patients.(8) These phantoms are publicly 
available for users to download as part of the Deformable Image Registration Evaluation Project 
(DIREP) (http://sites.google.com/site/dirphantoms). 

With each of the models described in the previous section, it is left to the clinical physicist to 
test their system, analyze the data, and compare their results with known benchmarks. Several 
metrics have been proposed for this analysis, ranging from the simple (DICE similarity) to 
the abstract (the Jacobian of the DVF). One commonly used metric is the target registration 
error (TRE), which describes the difference between co-located voxels once they have been 
transferred through the ground-truth DVF and a test DVF. In the upcoming report on DIR 
validation by AAPM’s Task Group 132, the proposed goal (as opposed to “tolerance”) for 
TRE is 95% of voxels within 2 mm.(9) One limitation shared by all DIR metrics, however, is 
the disconnect between the quantification of DIR error and the effect that error has on a given 
dose distribution. This is a complicated issue akin to relying on the gamma criterion for IMRT 
QA analysis. Currently, there are few publications which translate target registration error to 
DVH error, and thus one of the aims of this research was to investigate this process. In addition 
to TRE/DVH error, the traditional commissioning framework of test, analyze, and compare 
is hindered by the current lack of benchmarks specifically purposed for the clinical validation 
of commercial algorithms. Due to the fact that the vast majority of usage will be through one 
of these commercial platforms, it is important to provide clinical physicists comparison data 
which are directly linked to open-source, ground-truth models. In this way, end users can 
assess their own implementation using the same tools and metrics as those used during the 
benchmarking process.

In this work, the five most prevalent commercial DIR algorithms are assessed using the 
10 phantom series of the DIREP ground-truth model. Benchmark data are presented for each 
algorithm using TRE as the DIR metric. Questions relevant to the commissioning process are 
discussed, including what constitutes a DIR failure, what differences may be expected between 
different commercial algorithms, and where those differences occur. A further analysis is per-
formed to translate TRE into DVH error, and the results are discussed in relation to the strengths 
and weakness of such an approach. 

 
http://sites.google.com/site/dirphantoms


27  Pukala et al.: Benchmarking of DIR algorithms 27

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

II. MATERIALS AND METHODS

A.  Ground-truth model
The Deformable Image Registration Evaluation Project (DIREP) includes a library of 10 virtual 
phantoms. These phantoms were previously created based on volumetric imaging of 10 head 
and neck cancer patients.(8) Briefly, each patient received a start-of-treatment CT (SOT) and a 
near end-of-treatment CT (EOT). The SOT dataset for each patient was deformed in a forward 
fashion to represent the anatomy of the EOT dataset (i.e., the EOT was the target dataset and the 
SOT dataset was deformed to match it). This was done using a combination of a biomechanical 
algorithm(10,11) and human-guided thin plate splines(12) available as a deformation tool within 
the ImSimQA software package (Oncology Systems Limited, Shrewsbury, Shropshire, UK). 
These tools allowed for the modeling of head rotation and translation, mandible rotation, spine 
flexion, shoulder movement, hyoid movement, tumor/node shrinkage, weight loss, and parotid 
shrinkage.(13) The combinative approach of utilizing multiple algorithms applied iteratively with 
human intervention minimizes the bias towards any particular algorithm during subsequent 
DIR using these phantoms.

The result of the deformation process was a simulated EOT dataset for each patient where 
the ground-truth deformation was known. Together, the SOT and simulated EOT images form a 
virtual phantom that may be imported into a third-party DIR software package. The DVF from 
any third-party algorithm may then be compared to the known ground truth to obtain the deforma-
tion error for each image voxel. Expert drawn contours are also included for several structures 
in order to provide error statistics for individual organs. Table 1 shows the patient characteristics 
for each of the virtual phantoms, and Fig. 1 shows an example of one of the phantoms.

Table 1. Attributes of patients selected for the development of the virtual phantoms.

         No. of
        Initial Days
      Mean Mean Weight/ Between
      Right Left End of Planning
      Parotid Parotid Treatment and
 Patient Disease   Fractions Dose Dose Weight EOT
 No. Site Stage Gender Delivered (Gy) (Gy) (kgs) Images

 1 Base of tongue T2N2bM0 M 35 25.2 25.7 74.8/70.3 60
 2 Base of tongue T2N2cM0 F 35 34.7 23.2 68.0/62.1 56
 3 Tonsil T2N2bM0 M 35 29.4 25.6 96.2/88.5 57
 4 Nasopharynx T1N3M0 F 33 26.7 39.5 65.3/61.2 58
 5 Unknown T0N2aM0 M 35 24.5 29.0 90.3/81.2 57
 6 Supraglottic larynx T1N1M0 M 33 26.1 21.3 95.3/82.6 43
 7 Tonsil T2N2aM0 M 35 14.2 41.7 93.4/86.2 47
 8 Tonsil T2N2aM0 F 35 23.5 28.5 106.1/101.6 48
 9 Nasopharynx T4N2M0 M 33 55.7 48.7 68.0/56.7 59
 10 Base of tongue T0N2aM0 F 35 21.5 23.4 99.8/81.2 68

EOT = end of treatment.


28  Pukala et al.: Benchmarking of DIR algorithms 28

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

B.  DIR algorithm evaluation
At the time of evaluation, each software package had only one primary algorithm available for 
DIR represented by the algorithms described below. In order to protect intellectual property, 
companies that market DIR systems typically do not disclose detailed information about their 
algorithms. Therefore, each algorithm will be treated as a “black box” for the purposes of this 
investigation. The virtual phantom image pairs were imported into each system for registration. 
To avoid an additional confounding variable, an initial rigid registration was not performed 
before DIR for four out of the five algorithms examined. It was not feasible to disable the initial 
rigid registration for the Pinnacle algorithm, but this should not substantially affect the results 
because the phantom image pairs were already well-aligned in all of the systems before initiating 
the deformation. For each registration, the SOT dataset was designated as the primary dataset (the 
target) and the simulated EOT dataset was designated as the secondary dataset (to be deformed 
to match the target). This arrangement was chosen to mimic an adaptive workflow, in which a 
treatment plan is recalculated on daily IGRT imaging and the dose is then transferred back to 
a planning CT for evaluation. The registration procedure for each algorithm is described in the 
Material & Methods sections B.1 to  B.5 below. After registration, DVFs were exported from 
each system and compared to the ground-truth DVFs using MATLAB (MathWorks Inc., Natick, 
MA). DIR error statistics were calculated for all of the voxels contained within the brainstem, 
spinal cord, mandible, left parotid, right parotid, and external contours.

B.1 Algorithm 1: MIM
Deformable registration was performed using the VoxAlign algorithm incorporated with MIM 
version 5.6.2 (MIM Software Inc., Cleveland, OH). The VoxAlign algorithm is a constrained, 
intensity-based, free-form DIR algorithm. Because the phantom image pairs were in the same 
DICOM frame of reference, no initial rigid registration was performed. Following deform-
able registration, the DIR information was saved and exported from the system as a DICOM 
Deformable Spatial Registration object. Reg Reveal and Reg Refine (MIM Software Inc.) were 
implemented recently into MIM as tools that allow the user to view and refine deformable 
registrations. These tools were not used in this study.

Fig. 1. Virtual phantom example (Phantom 5). The top row shows the planning image set and the bottom row shows the 
simulated EOT image set. Comparison of the images reveals the simulated parotid shrinkage (axial view), head rotation 
and spine flexion (sagittal view), and weight loss (coronal view).


29  Pukala et al.: Benchmarking of DIR algorithms 29

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

B.2 Algorithm 2: VelocityAI
The deformable multipass algorithm of VelocityAI version 2.8.1 (Varian Medical Systems, 
Palo Alto, CA) was used to register the virtual phantoms. The deformable multipass algorithm 
is a multiresolution B-spline algorithm. No initial rigid registration was performed between 
the image pairs. The DVF data were exported from VelocityAI as a binary deformation field 
(BDF) file for analysis.

B.3 Algorithm 3: RayStation
The hybrid deformable registration algorithm of RayStation version 3.5 (RaySearch Laboratories, 
Stockholm, Sweden) was used to register the phantoms. The hybrid deformable registration 
algorithm is an intensity-based algorithm that can also incorporate ROI constraints. RayStation 
requires that a rigid registration be performed before initiating a deformable registration. 
Therefore, the “Set Identity” registration tool was used to mark the initial coordinates of each 
image pair as aligned. In other words, all initial rigid registration parameters were set to zero. 
The phantoms’ external contours were defined as required, but no other ROIs were specified. 
A deformation grid of 2.5 × 2.5 × 3.0 mm was selected for each case. RaySearch provided a 
script that enabled the export of the calculated DVFs.

B.4 Algorithm 4: Pinnacle
DIR was accomplished using the Dynamic Planning Module of Pinnacle3 9.6 (Philips Healthcare, 
Fitchburg, WI). The Pinnacle3 system used an implementation of the demons algorithm for 
deformable registration. Before DIR, rigid registration and image preprocessing were performed 
on the virtual phantom image pairs. DVFs were exported from the software as binary files. 

B.5 Algorithm 5: Eclipse
The DIR tools in Eclipse version 11 (Varian Medical Systems) were used to register the virtual 
phantom images. The Eclipse treatment planning system also uses an implementation of the 
accelerated demons algorithm. Rigid registration was not performed prior to deformable reg-
istration. DVF information was exported from the software as a DICOM Deformable Spatial 
Registration object.

C.  Propagation of dose
As noted previously, the 10 DIREP phantoms were created from real patient data. In addition 
to imaging, this data included treatment plans which were subsequently used in this study to 
calculate dose on each EOT dataset. All of the treatment plans were calculated and delivered 
using the TomoTherapy treatment platform (Accuray Inc., Sunnyvale, CA). This dose was then 
transferred to the individual SOT datasets via the ground-truth DVF. Dose was also transferred 
through the DVFs provided by each DIR algorithm. The difference between the two dose propa-
gation methods (ground truth vs. test) was evaluated in terms of the difference in mean and 
maximum (max) dose to each organ at risk. The effect of TRE on DVH error was also quantified 
by calculating the dosimetric difference in the DVH for each organ on a point-by-point basis 
along the DVH. This metric, termed the mean DVH difference (ΔDVHmean), was used along 
with mean and max dose difference in Pearson correlations with both mean (TREμ) and max 
TRE. Due to the fact that all treatment plans were not prescribed for the same total dose, only 
the dose as calculated for a single 2 Gy fraction was used for this study. All treatment plans 
were originally prescribed at this fractionation pattern, as indicated in Table 1. At the time of 
the writing of this manuscript, the Pinnacle and Eclipse platforms did not perform deformable 
dose accumulation. They are included here for comparison purposes.

 
30  Pukala et al.: Benchmarking of DIR algorithms 30

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

III. RESULTS 

Target registration error for each OAR is presented in Tables 2–7. As shown, the differences 
between the algorithms are small, although registrations performed using the MIM algorithm 
did consistently produce lower mean errors. This difference was found to be significant (t-test, 
df = 10, p < 0.05) for each OAR, except the right parotid. For this organ, no significance was 
found between any of the five DIR algorithms. Interestingly, both MIM and RayStation also 
generated a large misregistration specific to the right parotid of Phantom 9 (TREμ > 6 mm and 
TREmax > 15 mm for both algorithms). In order to better visualize this result, histograms of the 

Table 2. Registration error statistics for all of the voxels contained within the brainstems of the virtual phantoms. 
Statistics are listed as the mean ± 1 SD. The maximum errors are shown in parentheses. All errors are reported in mm.

 Phantom No. MIM Velocity RayStation Pinnacle Eclipse

 1 0.3±0.1 (0.6) 1.8±0.2 (2.3) 0.8±0.5 (1.8) 3.8±2.0 (10.0) 0.9±0.4 (1.9)
 2 0.2±0.1 (0.6) 2.0±0.3 (3.0) 1.5±0.8 (2.9) 4.3±1.9 (10.0) 0.7±0.4 (2.1)
 3 0.4±0.2 (0.8) 0.7±0.3 (1.5) 1.9±0.6 (3.7) 1.4±0.6 (3.3) 1.2±0.4 (2.1)
 4 0.3±0.1 (0.7) 1.2±0.3 (1.9) 1.2±0.1 (1.4) 1.0±0.5 (3.4) 1.1±0.6 (2.7)
 5 0.3±0.1 (0.7) 1.0±0.3 (1.8) 2.2±0.8 (4.1) 6.3±2.5 (13.2) 0.7±0.4 (2.5)
 6 0.5±0.2 (0.9) 0.7±0.1 (1.2) 1.0±0.2 (1.6) 1.4±0.7 (4.3) 1.0±0.6 (3.2)
 7 0.8±0.2 (1.5) 0.9±0.3 (1.6) 0.7±0.3 (1.6) 1.5±0.8 (4.4) 1.3±0.5 (3.1)
 8 0.5±0.2 (1.2) 0.9±0.3 (1.7) 1.3±0.4 (2.2) 6.8±3.3 (16.1) 1.5±0.5 (3.0)
 9 0.8±0.3 (1.5) 1.7±0.3 (2.6) 3.0±0.5 (4.1) 3.3±1.5 (9.2) 2.1±0.9 (5.0)
 10 0.7±0.4 (2.5) 1.1±0.3 (1.9) 0.7±0.3 (1.6) 3.5±2.1 (10.9) 1.0±0.4 (2.5)
 Mean 0.5±0.2 (2.5) 1.2±0.5 (3.0) 1.4±0.7 (4.1) 3.3±2.1 (16.1) 1.1±0.4 (5.0)

Table 3. Registration error statistics for all of the voxels contained within the spinal cords of the virtual phantoms. 
Statistics are listed as the mean±1 SD. The maximum errors are shown in parentheses. All errors are reported in mm.

 Phantom No. MIM Velocity RayStation Pinnacle Eclipse

 1 0.3±0.2 (1.0) 2.2±0.4 (2.9) 0.3±0.2 (1.2) 0.7±0.3 (1.7) 0.7±0.3 (1.7)
 2 0.4±0.2 (0.8) 2.1±0.6 (3.4) 0.7±0.2 (1.4) 0.7±0.4 (2.4) 0.9±0.3 (1.8)
 3 0.7±0.5 (2.6) 1.9±0.9 (3.8) 2.4±0.9 (4.8) 1.4±0.6 (3.2) 1.4±0.5 (3.0)
 4 0.4±0.3 (1.2) 1.1±0.4 (2.0) 0.5±0.2 (1.2) 0.8±0.4 (1.9) 0.9±0.5 (2.5)
 5 0.4±0.2 (1.0) 0.8±0.2 (1.3) 0.5±0.2 (1.4) 1.3±0.7 (4.3) 0.9±0.6 (2.9)
 6 0.5±0.2 (0.9) 0.7±0.2 (1.5) 2.1±1.2 (4.4) 1.1±0.4 (2.5) 0.8±0.3 (1.7)
 7 0.5±0.2 (1.7) 4.7±4.3 (14.8) 1.8±1.6 (7.7) 1.3±0.8 (3.9) 1.9±0.9 (4.2)
 8 0.5±0.3 (1.3) 1.1±0.4 (2.6) 0.3±0.2 (1.1) 0.7±0.3 (2.6) 1.0±0.5 (3.8)
 9 0.5±0.2 (1.1) 2.0±0.5 (2.8) 0.9±0.5 (3.0) 1.2±0.5 (3.1) 1.2±0.5 (2.9)
 10 0.5±0.2 (1.1) 0.9±0.2 (1.6) 0.4±0.2 (1.0) 1.2±0.7 (3.5) 1.2±0.4 (2.4)
 Mean 0.5±0.1 (2.6) 1.8±1.2 (14.8) 1.0±0.8 (7.7) 1.0±0.3 (4.3) 1.1±0.3 (4.2)

Table 4. Registration error statistics for all of the voxels contained within the mandibles of the virtual phantoms. 
Statistics are listed as the mean±1 SD. The maximum errors are shown in parentheses. All errors are reported in mm.

 Phantom No. MIM Velocity RayStation Pinnacle Eclipse

 1 0.5±0.5 (2.4) 2.1±0.5 (3.4) 0.8±0.3 (1.9) 1.0±0.5 (3.0) 1.9±0.9 (4.5)
 2 0.7±0.5 (4.7) 1.5±0.7 (4.9) 0.9±0.6 (5.6) 1.0±0.7 (6.2) 1.7±0.7 (7.0)
 3 1.2±0.9 (4.9) 1.5±1.0 (4.7) 2.8±1.7 (6.9) 1.2±0.6 (3.0) 1.5±0.6 (4.4)
 4 0.6±0.6 (3.1) 1.0±0.4 (2.2) 1.7±1.2 (6.1) 0.9±0.4 (3.0) 1.5±0.6 (4.0)
 5 0.7±0.6 (4.8) 0.9±0.4 (2.7) 2.9±2.2 (8.6) 1.2±0.7 (5.3) 1.6±0.9 (5.0)
 6 0.9±0.6 (2.7) 1.1±0.4 (2.2) 1.4±1.0 (6.2) 1.1±0.5 (3.5) 1.2±0.5 (3.0)
 7 1.1±0.7 (3.5) 1.4±0.4 (2.6) 1.2±0.6 (3.6) 0.9±0.5 (2.8) 1.6±0.6 (3.3)
 8 0.7±0.6 (4.9) 1.4±0.7 (4.6) 1.4±0.7 (4.4) 2.4±1.4 (6.4) 3.5±2.0 (9.5)
 9 1.5±1.2 (6.3) 1.8±0.8 (5.6) 1.7±1.1 (5.2) 1.3±0.8 (5.6) 2.6±1.4 (7.7)
 10 0.6±0.4 (2.5) 1.9±1.0 (5.4) 0.9±0.5 (2.7) 1.4±0.9 (4.8) 3.5±1.5 (7.7)
 Mean 0.9±0.3 (6.3) 1.5±0.4 (5.6) 1.6±0.7 (8.6) 1.2±0.4 (6.4) 2.1±0.9 (9.5)


31  Pukala et al.: Benchmarking of DIR algorithms 31

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

TRE of the left and right parotid of Phantom 9 are shown in Fig. 2. The long tail of the histogram 
evident in the right parotid but absent from left parotid indicates the large maximum errors 
generated by the MIM and RayStation algorithms for this case. Registration error histograms 
for all of the ROIs and phantoms are available in the Supplementary Materials file available 
on the JACMP website at www.jacmp.org

The DVH results from the dose propagation study are shown in Fig. 3–7. Starting with the 
spinal cord, Fig. 3 illustrates the best (Phantom 1) and worst (Phantom 10) case scenarios as 
scored by ΔDVHmean. It is evident from the graph that TRE had little effect on the DVH for this 
structure. This is further supported in the Pearson correlations (Table 8 and Fig. 8) which show 

Table 5. Registration error statistics for all of the voxels contained within the left parotids of the virtual phantoms. 
Statistics are listed as the mean±1 SD. The maximum errors are shown in parentheses. All errors are reported in mm.

Phantom No. MIM Velocity RayStation Pinnacle Eclipse

 1 1.6±1.6 (7.9) 2.7±0.3 (3.9) 1.8±1.1 (5.0) 1.3±0.6 (3.1) 1.7±1.2 (5.0)
 2 0.7±0.5 (3.5) 1.5±0.6 (2.6) 1.7±0.9 (4.5) 2.5±1.1 (6.0) 2.1±1.2 (5.0)
 3 1.8±1.8 (10.8) 1.7±1.2 (6.7) 4.5±1.9 (11.8) 2.5±1.3 (9.4) 3.2±2.2 (11.4)
 4 0.9±0.7 (3.5) 2.3±1.1 (6.1) 1.1±0.4 (2.9) 2.6±1.3 (5.9) 1.7±0.9 (4.1)
 5 1.1±1.2 (7.3) 2.8±1.6 (8.7) 2.6±1.6 (8.8) 2.5±1.2 (7.2) 1.5±0.8 (4.9)
 6 0.7±0.7 (4.4) 1.8±0.8 (5.1) 1.3±0.7 (4.2) 1.1±0.5 (3.2) 1.1±0.5 (2.6)
 7 0.8±0.4 (2.5) 0.9±0.3 (1.7) 1.1±0.5 (3.1) 0.8±0.4 (2.5) 1.9±0.8 (3.7)
 8 1.0±1.1 (6.9) 3.0±1.1 (6.6) 1.5±0.8 (4.3) 1.5±1.1 (6.2) 3.2±1.8 (7.8)
 9 2.5±1.5 (7.0) 2.3±1.0 (6.2) 3.4±0.9 (5.5) 2.7±0.8 (4.5) 2.8±1.4 (6.2)
 10 0.6±0.6 (4.1) 2.8±0.8 (6.0) 1.3±1.1 (5.7) 1.5±0.9 (5.1) 1.8±1.0 (4.1)
 Mean 1.2±0.6 (10.8) 2.2±0.7 (8.7) 2.0±1.1 (11.8) 1.9±0.7 (9.4) 2.1±0.7 (11.4)

Table 6. Registration error statistics for all of the voxels contained within the right parotids of the virtual phantoms. 
Statistics are listed as the mean±1 SD. The maximum errors are shown in parentheses. All errors are reported in mm.

 Phantom No. MIM Velocity RayStation Pinnacle Eclipse

 1 1.1±1.0 (5.0) 1.9±0.5 (3.6) 1.3±0.7 (3.4) 1.3±0.6 (3.1) 1.4±1.0 (3.9)
 2 0.7±0.4 (2.8) 2.1±0.4 (2.8) 1.0±0.6 (2.4) 1.5±0.6 (3.3) 3.2±1.8 (6.7)
 3 0.4±0.3 (2.2) 0.9±0.4 (2.5) 1.4±0.6 (2.8) 1.1±0.5 (3.2) 1.1±0.5 (3.3)
 4 0.9±0.9 (5.6) 1.3±0.4 (3.1) 2.3±0.6 (4.4) 1.7±1.0 (4.8) 1.3±0.7 (4.2)
 5 1.6±1.8 (10.2) 1.6±0.7 (4.6) 3.5±1.5 (8.5) 2.6±1.3 (6.3) 1.3±0.9 (5.1)
 6 1.5±1.6 (7.4) 1.6±0.7 (4.0) 2.9±1.1 (5.9) 1.3±0.7 (3.6) 1.3±0.9 (4.5)
 7 0.8±0.7 (5.2) 1.3±0.3 (2.3) 1.5±0.5 (3.1) 0.8±0.4 (3.1) 1.7±0.8 (3.9)
 8 0.7±0.6 (4.5) 2.4±1.1 (4.6) 2.8±0.9 (5.1) 1.8±0.9 (4.8) 1.6±0.8 (4.8)
 9 6.3±5.1 (22.0) 1.9±1.1 (7.9) 6.1±3.2 (15.2) 3.1±1.6 (8.3) 4.0±2.0 (8.5)
 10 0.6±0.8 (5.8) 1.2±0.5 (3.0) 0.9±0.7 (3.5) 1.5±0.9 (5.0) 1.6±1.0 (4.5)
 Mean 1.5±1.7 (22.0) 1.6±0.5 (7.9) 2.4±1.6 (15.2) 1.7±0.7 (8.3) 1.8±1.0 (8.5)

Table 7. Registration error statistics for all of the voxels contained within the external contours of the virtual phantoms. 
Statistics are listed as the mean±1 SD. The maximum errors are shown in parentheses. All errors are reported in mm.

 Phantom No. MIM Velocity RayStation Pinnacle Eclipse

 1 1.3±2.1 (28.4) 2.1±0.9 (12.2) 1.9±2.8 (23.0) 2.4±2.8 (18.5) 2.6±2.7 (20.3)
 2 0.9±1.1 (10.9) 2.1±1.0 (15.1) 2.3±1.8 (13.7) 2.7±3.0 (20.4) 2.3±2.0 (15.4)
 3 1.5±2.3 (22.1) 1.7±1.2 (10.7) 4.3±3.4 (20.7) 2.5±2.3 (17.1) 2.2±2.2 (18.5)
 4 0.9±1.1 (13.9) 1.5±0.9 (10.9) 1.6±1.4 (16.9) 1.3±1.1 (15.2) 2.2±1.8 (12.4)
 5 1.6±2.0 (18.9) 2.2±1.9 (20.7) 3.4±2.7 (30.5) 4.1±3.9 (23.9) 2.6±2.6 (25.7)
 6 1.2±1.9 (25.5) 1.3±0.9 (15.7) 2.4±1.9 (19.1) 2.1±1.9 (16.3) 2.3±2.3 (16.8)
 7 1.1±1.6 (19.2) 2.0±2.0 (23.3) 3.0±3.4 (19.8) 1.7±1.8 (16.2) 2.5±1.8 (17.8)
 8 1.4±1.5 (18.8) 1.9±1.2 (14.1) 3.2±2.4 (23.7) 3.1±3.0 (22.6) 3.2±2.5 (19.2)
 9 1.8±2.5 (29.8) 2.1±1.1 (15.4) 3.1±2.6 (26.1) 3.3±2.8 (24.1) 2.9±2.2 (16.1)
 10 2.2±3.5 (31.5) 2.4±1.8 (20.2) 4.1±5.2 (36.8) 3.4±3.9 (35.0) 2.7±2.0 (22.8)
 Mean 1.4±0.4 (31.5) 1.9±0.3 (23.3) 2.9±0.9 (36.8) 2.7±0.8 (35.0) 2.5±0.3 (25.7)

http://www.jacmp.org


32  Pukala et al.: Benchmarking of DIR algorithms 32

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

several dosimetric parameters having only a weak to very weak correlation with both TREμ 
and TREmax. In direct comparison, the brainstem shows a strong to very strong correlation with 
TREμ and a moderate correlation with TREmax. Four DVHs of the brainstem are highlighted in 
Fig. 4, showing both similarity and divergence amongst the algorithms for different cases. The 
remaining OARs show a moderate to strong correlation with TREμ and a very weak correlation 
with TREmax. The previously mentioned misregistration in the right parotid of Phantom 9 is 
shown in Fig. 5 (bottom right). Interestingly, while both MIM and RayStation produced similar 
mean errors (~ 6 mm), clearly these errors did not affect the DVH in the same way. Another 
interesting case is that of the left parotid of Phantom 4. In this instance, there is a large devia-
tion in both directions away from the ground-truth DVH. The net effect on the mean dose to 
the left parotid is a 10.4% increase when using Velocity compared to an 8% decrease when 
using Pinnacle. This was the largest discrepancy encountered in this study amongst the five 
commercial algorithms.

 
Fig. 2. TRE histograms for the left (top) and right (bottom) parotids of Phantom 9. The histograms show the number of 
voxels in the designated ROI that were deformed with the error specified by the x-axis for each algorithm. Note the scale 
differences of each histogram.

Fig. 3. DVH curves of the spinal cord for Phantom 1 and Phantom 10 after dose has been propagated from the EOT to 
SOT dataset through the ground-truth and test DVFs. Phantom 1 showed the least disagreement overall with the ground-
truth DVH, while Phantom 10 showed the most disagreement.


33  Pukala et al.: Benchmarking of DIR algorithms 33

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

Fig. 4. DVH curves of the brainstem for Phantoms 1, 2, 5, and 9 after dose has been propagated from the EOT to SOT 
dataset through the ground-truth and test DVFs. Note the strong convergence with the ground truth for Phantom 2. In this 
case, the dose distribution did not include the nasopharynx, increasing the distance between the brainstem and the high 
dose regions of the treatment plan.

Fig. 5. DVH curves of the right parotid for Phantoms 2, 5, 6, and 9 after dose has been propagated from the EOT to 
SOT dataset through the ground-truth and test DVFs. Note disagreement for Phantom 9 when dose is transferred using 
the RayStation algorithm. In this case, the mean target registration error was equal to 6.1 mm and a high dose gradient 
directly traversed the right parotid.


34  Pukala et al.: Benchmarking of DIR algorithms 34

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

Fig. 6. DVH curves of the left parotid for Phantoms 1, 3, 4, and 8 after dose has been propagated from the EOT to SOT 
dataset through the ground-truth and test DVFs. Note disagreement for difference between the Velocity and Pinnacle 
algorithms for Phantom 4. This occurs because the DVF for each algorithm mapped voxels in opposite directions away 
from the ground truth.

Fig. 7. DVH curves of the mandible for Phantoms 2 and 10 after dose has been propagated from the EOT to SOT dataset 
through the ground-truth and test DVFs.


35  Pukala et al.: Benchmarking of DIR algorithms 35

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

Fig. 8. Correlation between mean target registration error and dosimetric error for the spinal cord, right parotid, and 
brainstem. Pearson correlation coefficients can be found for the distributions in Table 8.

Table 8. Pearson correlation coefficients for several different combinations of target registration error (P1) and dosi-
metric error (P2).

 P1 
TREμ	 TREμ	 TREμ	 TREμ	 TREmax

  (mm) (mm) (mm) (mm) (mm)
 P2 

Dμ	 Dμ	 DVH Dmax Dmax
  (%) (Gy) (Gy) (Gy) (Gy)

 Brainstem 0.807 0.717 0.752 0.585 0.470
 Mandible 0.672 0.705 0.637 0.211 0.122
 Parotid Rt 0.531 0.668 0.647 0.161 0.010
 Parotid Lt 0.523 0.484 0.449 0.194 0.140
 Cord 0.265 0.286 0.273 0.188 0.132
 External 0.262 0.220 0.404 0.320 0.192


36  Pukala et al.: Benchmarking of DIR algorithms 36

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

IV. DISCUSSION

Benchmarking is a process in which standardized tests are used to assess system performance. 
In many applications the results are strictly used for intercomparison (i.e., which processor 
computes the fastest, which business model is more profitable, etc.). In this application, how-
ever, the primary purpose is less about the intercomparison of algorithms and more about the 
establishment of clinically relevant baseline data. This data, as presented in Tables 2–7, may be 
used in a number of ways. First, users of one of these commercial algorithms may download 
the 10 head and neck datasets and run them as part of their commissioning process, using the 
benchmarks for comparison. Additionally, during upgrades to newer software versions, the 
cases may be rerun as a way of maintaining self-consistency, as discussed by AAPM’s Task 
Group 53.(14) As an example, the large registration errors in the brainstem seen when utilizing 
the Pinnacle algorithm (Table 2) were reported to the vendor and have reportedly been fixed 
in current updates to their software. This can be verified using the proposed methodology and 
will be done in future work.

The benchmark data also provide users with a general idea of the magnitude and variation 
in TRE for the specific use case adopted in this study — head and neck dose adaptation. As an 
example, the mean registration error for the spinal cord using the MIM algorithm was 0.5 ± 
0.1 mm with an associated standard error of 0.03 mm (n = 10). These statistics provide high 
confidence that the TRE for an individual deformable registration using the MIM algorithm 
under similar conditions will be less than 1 mm for this structure. Conversely, in looking at the 
right parotid when applying the RayStation algorithm, the mean error was 2.4 ± 1.6 mm, with 
an associated standard error of 0.5 mm. Users in this situation can thus expect higher variability 
when considering an individual case where the mean error may range between 2–3 mm. 

In comparison to other studies, Varadhan et al.(13) and Nie et al.(15) used ImSimQA to create 
virtual H&N phantoms and evaluate DIR. The Varadhan study reported contour comparison 
metrics between contours deformed using the ground-truth DVF and contours deformed using 
two different DIR algorithms. While these metrics are useful for contour comparison studies, 
they are of limited use for deformable dose accumulation or other DIR applications, and would 
be difficult to compare to the data presented here. This fact highlights the need for consistent 
metrics and datasets to compare DIR algorithms. The Nie study used a single virtual H&N 
phantom, along with phantoms for other treatment sites, to compare the DVFs generated by 
MIM 5.4.7 and VelocityAI 2.2.1 to the ground-truth provided by the ImSimQA DVF. The MIM 
algorithm resulted in 24.2% of voxels having errors greater than 2 mm, 14.6% greater than 
3 mm, and only 7% greater than 5 mm. The VelocityAI algorithm resulted in 29.8% of voxels 
having errors greater than 2 mm, 5.1% greater than 3 mm, and only 0.1% greater than 5 mm. 
These results are consistent with some of the phantom data reported in this study. However, 
the differences that we observed between phantoms emphasize the need to evaluate multiple 
cases for comprehensive DIR QA.

To this point, it is evident that, on an individual basis, H&N DIR can result in large spatial 
errors which may be difficult to detect. All of the commercial systems generated registrations that 
could be considered “failures” in the sense that they produced a TREμ more than two standard 
deviations (SDs) away from the overall system-independent average (TREavg) for selected ROIs. 
The effects of these failures on a given DVH are not easy to predict. As noted in the results, 
the strongest correlations between TRE and dosimetric error were in the brainstem. In looking 
at the three instances where the TREμ qualified as a failure (Pinnacle – Phantoms 2, 5, and 8), 
two out of the three cases also generated large dosimetric errors (Phantoms 5 and 8), while 
the third case did not (Phantom 2). In viewing the dose distributions for these phantoms it is 
clear why this occurred. As seen in Fig. 9, the treatment plan for Phantom 2 did not encompass 
the nasopharynx, which resulted in a larger distance between the brainstem and the high-dose 
regions of the treatment plan. Also shown in the figure are dose distributions for Phantoms 8 
and 9, which can be compared with the DVH data from Fig. 4 and the TRE data from Table 2. 


37  Pukala et al.: Benchmarking of DIR algorithms 37

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

Only the single large error from the Pinnacle algorithm (6.8 mm) appears to affect the DVH 
for Phantom 8, while smaller errors from all the algorithms (0.8–3.3 mm) produced divergence 
in the DVH for Phantom 9. The common trend amongst the three phantoms is that the closer 
the brainstem is to the dose gradient, the more sensitive the brainstem becomes to dosimetric 
error caused by TRE. 

These cases illustrate the difficulty in trying to generalize the correlation between TRE and 
DVH error. Clearly, the correlation depends upon not only the magnitude of the registration 
error, but also the dose distribution itself, including the slope of nearby dose gradients and the 
distance to critical structures. This is highlighted further in Fig. 10 where sharp dose gradients 
traverse both parotids, but avoid the cord of Phantom 4. In this case, large differences were 
seen amongst the five algorithms when viewing the DVH for each parotid, but little difference 
was seen when considering the spinal cord. While TREavg for the cord was lower compared 
to both parotids for this phantom, the trend also held true for other cases such as Phantom 7 
where TREavg for the cord was nearly double that of the parotids. This is further reflected in the 
Pearson correlations for the cord which show only weak to very weak correlation with TREμ. 

Fig. 9. Dose distribution shown for Phantom 2 (bottom), Phantom 8 (middle), and Phantom 9 (top). In comparing the 
DVH and TRE data for these patients, it is evident that the distance between the brainstem and dose gradient plays a large 
role in the sensitivity to dosimetric error caused by TRE itself.


38  Pukala et al.: Benchmarking of DIR algorithms 38

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

This is likely attributed to the fact that H&N plans uniformly avoid dosing the cord by applying 
margins to this structure. The margin increases the distance between high dose gradients and 
the true cord, thus mitigating the impact of target registration error. 

For other structures, the direction of the error also played a significant role. The previously 
mentioned left parotid of Phantom 4 is a prime example, where the DVF produced by the 
Velocity algorithm erred towards the low-dose region while the DVF produced by the Pinnacle 
algorithm erred towards the high-dose region (Fig. 11). This led to DVH curves which were 
shifted in opposite directions (Fig. 6). Without information on both the direction of the error 
and the distance to the dose gradient, this result would be difficult to contextualize. It is thus 
important to note that the benchmark data published in this study should be interpreted for 
individual cases in combination with a priori knowledge of the dose distribution, keeping in 
mind all the factors that attribute to dosimetric uncertainty.

It should also be noted that the data obtained in this work from the virtual phantoms is 
likely only applicable to cases involving the same treatment site, imaging modality, and mag-
nitude of anatomical changes. For these phantoms, that would include kVCT images acquired 
from H&N patients with appropriate immobilization over a single course of treatment. Our 
experience has shown that deformation algorithms may behave differently depending on these 
factors. The virtual phantoms presented in this study also have inherent limitations. All of the 
complexities of the deformation of the human body during a radiotherapy course would be 
difficult to model. For example, these phantoms do not model sliding interfaces as might be 
found in the expansion or contraction of the lungs. They also do not model cavities that appear 
in one image but not the other.

 
Fig. 10. Dose distribution shown for Phantom 4. Note the steep dose gradients which traverse the parotids but avoid the cord. 
Large differences were seen in the DVH for each parotid, but little difference was seen when considering the spinal cord.


39  Pukala et al.: Benchmarking of DIR algorithms 39

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

V. CONCLUSIONS

In this work, five commercial deformation algorithms have been benchmarked using a set of 
10 computation H&N phantoms. The benchmarks have been presented using the error metric 
TRE, which provides a general assessment of DIR performance. The benchmarks can be used 
during the commissioning and QA process to help users validate their systems. A dose adap-
tive strategy was also assessed, whereby TRE was correlated with DVH error. Several factors 
influenced the results including the magnitude and direction of the registration error, the slope 
of the dose distribution, and the distance to critical structures. In assessing clinical cases, the 
latter two pieces of information will be known whereas the former must be estimated based 
on case studies such as this one. When well-defined trends exist, such as shown for the spinal 
cord, the DIR for a given structure can be utilized with high confidence. In other instances 
caution is warranted. For these scenarios, strategies which include DIR uncertainty into the 
dose adaptive process should be considered. 

 
ACKNOWLEDGMENTS

The authors would like to thank each of the vendors for their support in completing this study. 
This work was partially funded by a grant from Accuray, Inc.

 
Fig. 11. Coronal view of Phantom 4 showing the DVF within the left parotid as determined by the Pinnacle (below) and 
Velocity (above) algorithms. Note the opposite direction of the vector fields towards and away from the high-dose regions 
of the plan, which is located medially. 


40  Pukala et al.: Benchmarking of DIR algorithms 40

Journal of Applied Clinical Medical Physics, Vol. 17, No. 3, 2016

COPYRIGHT

This work is licensed under a Creative Commons Attribution 4.0 International License.

 
REFERENCES

 1. Zhang T, Chi Y, Meldolesi E, Yan D. Automatic delineation of on-line head-and-neck computed tomography 
images: toward on-line adaptive radiotherapy. Int J Radiat Oncol Biol Phys. 2007;68(2):522–30.

 2. Yan D, Jaffray D, Wong J. A model to accumulate fractionated dose in a deforming organ. Int J Radiat Oncol 
Biol Phys. 1999;44(3):665–75.

 3. Segars W, Bond J, Frush J, et al. Population of anatomically variable 4D XCAT adult phantoms for imaging 
research and optimization. Med Phys. 2013;40(4):043701.

 4. Vandemeulebroucke J, Sarrut D, Clarysse P. The POPI-model, a point-validated pixel-based breathing tho-
rax model. XVth International Conference on the Use of Computers in Radiation Therapy (ICCR), Toronto,  
Canada, 2007.

 5. Castillo R, Castillo E, Fuentes D, et al. A reference dataset for deformable image registration spatial accuracy 
evaluationusing the COPDgene study archive. Phys Med Biol. 2013;58(9):2861–77.

 6. Castillo R, Castillo E, Guerra R, et al. A framework for evaluation of deformable image registration spatial 
accuracy using large landmark point sets. Phys Med Biol. 2009;54(7):1849–70.

 7. Murphy K, Van Ginneken B, Reinhardt JM, et al. Evaluation of registration methods on thoracic CT: the EMPIRE 
10 challenge. IEEE Trans Med Imaging. 2011;30(11):1901–20.

 8. Pukala J, Meeks SL, Staton RJ, Bova FJ, Manon RR, Langen KM. A virtual phantom library for the quantifica-
tion of deformable image registration uncertainties in patients with cancers of the head and neck. Med Phys. 
2013;40(11):111703.

 9. Brock K and Mutic S. Image registration II: TG 132—quality assurance for image registration [abstract]. Med 
Phys. 2014;41(6):453.

 10. Gering D, Lu W, Ruchala K, Olivera G. Anatomy driven deformation. Proceedings of the IVIth ICCR, 
Amersterdam, The Netherlands, 2010.

 11. Gering D, Lu W, Ruchala K, Olivera G. Image deformation based on a marionette model [abstract]. Med Phys. 
2010;37(6):3127.

 12. Bookstein F. Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans Pattern Anal 
Mach Intell. 1989;11(6):567–85.

 13. Varadhan R, Karangelis G, Krishnan K, Hui S. A framework for deformable image registration validation in 
radiotherapy clinical applications. J Appl Clin Med Phys. 2013;14(1):192–213.

 14. Fraass B, Doppke K, Hunt M, et al. American Association of Physicists in Medicine Radiation Therapy Committee 
Task Group 53: Quality assurance for clinical radiotherapy treatment planning. Med Phys. 1998;25(10):1773–829.

 15. Nie K, Chuang C, Kirby N, Braunstein S, Pouliot J. Site-specific deformable imaging registration algorithm 
selection using patient-based simulated deformations. Med Phys. 2013;40(4): 041911.

https://urldefense.proofpoint.com/v2/url?u=http-3A__creativecommons.org_licenses_by_4.0_&d=AwMFAg&c=SgMrq23dbjbGX6e0ZsSHgEZX6A4IAf1SO3AJ2bNrHlk&r=W7OS2arm7HC48L0cStRk8JAVcS2AWyTd5bUny_O6-7c&m=hDRuSOqc1Cbkl69t1p0aHz0q45hql-wqyEEGFTlMX1c&s=dO6PKuPeKbNMnXOtW17q1Bu5mDb0OH2O3sDjKJmAJFc&e=