key: cord-0731411-tcjtn1pr
authors: Rheault, Francois; Schilling, Kurt G.; Valcourt‐Caron, Alex; Théberge, Antoine; Poirier, Charles; Grenier, Gabrielle; Guberman, Guido I.; Begnoche, John; Legarreta, Jon Haitz; Y. Cai, Leon; Roy, Maggie; Edde, Manon; Caceres, Marco Perez; Ocampo‐Pineda, Mario; Al‐Sharif, Noor; Karan, Philippe; Bontempi, Pietro; Obaid, Sami; Bosticardo, Sara; Schiavi, Simona; Sairanen, Viljami; Daducci, Alessandro; Cutting, Laurie E.; Petit, Laurent; Descoteaux, Maxime; Landman, Bennett A.
title: Tractostorm 2: Optimizing tractography dissection reproducibility with segmentation protocol dissemination
date: 2022-02-10
journal: Hum Brain Mapp
DOI: 10.1002/hbm.25777
sha: 135d5e64da149620fee76b399124ce0d19430b18
doc_id: 731411
cord_uid: tcjtn1pr

The segmentation of brain structures is a key component of many neuroimaging studies. Consistent anatomical definitions are crucial to ensure consensus on the position and shape of brain structures, but segmentations are prone to variation in their interpretation and execution. White‐matter (WM) pathways are global structures of the brain defined by local landmarks, which leads to anatomical definitions being difficult to convey, learn, or teach. Moreover, the complex shape of WM pathways and their representation using tractography (streamlines) make the design and evaluation of dissection protocols difficult and time‐consuming. The first iteration of Tractostorm quantified the variability of a pyramidal tract dissection protocol and compared results between experts in neuroanatomy and nonexperts. Despite virtual dissection being used for decades, in‐depth investigations of how learning or practicing such protocols impact dissection results are nonexistent. To begin to fill the gap, we evaluate an online educational tractography course and investigate the impact learning and practicing a dissection protocol has on interrater (groupwise) reproducibility. To generate the required data to quantify reproducibility across raters and time, 20 independent raters performed dissections of three bundles of interest on five Human Connectome Project subjects, each with four timepoints. Our investigation shows that the dissection protocol in conjunction with an online course achieves a high level of reproducibility (between 0.85 and 0.90 for the voxel‐based Dice score) for the three bundles of interest and remains stable over time (repetition of the protocol). Suggesting that once raters are familiar with the software and tasks at hand, their interpretation and execution at the group level do not drastically vary. When compared to previous work that used a different method of communication for the protocol, our results show that incorporating a virtual educational session increased reproducibility. Insights from this work may be used to improve the future design of WM pathway dissection protocols and to further inform neuroanatomical definitions.

to convey, learn, or teach. Moreover, the complex shape of WM pathways and their representation using tractography (streamlines) make the design and evaluation of dissection protocols difficult and time-consuming. The first iteration of Tractostorm quantified the variability of a pyramidal tract dissection protocol and compared results between experts in neuroanatomy and nonexperts. Despite virtual dissection being used for decades, in-depth investigations of how learning or practicing such protocols impact dissection results are nonexistent. To begin to fill the gap, we evaluate an online educational tractography course and investigate the impact learning and practicing a dissection protocol has on interrater (groupwise) reproducibility. To generate the required data to quantify reproducibility across raters and time, 20 independent raters performed dissections of three bundles of interest on five Human Connectome Project subjects, each with four timepoints. Our investigation shows that the dissection protocol in conjunction with an online course achieves a high level of reproducibility (between 0.85 and 0.90 for the voxel-based Dice score) for the three bundles of interest and remains stable over time (repetition of the protocol).

Suggesting that once raters are familiar with the software and tasks at hand, their interpretation and execution at the group level do not drastically vary. When compared to previous work that used a different method of communication for the protocol, our results show that incorporating a virtual educational session increased reproducibility. Insights from this work may be used to improve the future design of WM pathway dissection protocols and to further inform neuroanatomical definitions. It is common to expect researchers to quickly understand complex information or to be able to fill in the gaps if the information is missing. When dealing with intricate tasks or software, this premise often leads to inefficient communication. For example, diffusion tractography is used to study the connections of the brain. A chosen protocol or method must be reproducible to facilitate studies of the white-matter (WM) pathways of the brain. Teaching and conveying a protocol involves describing both complex anatomy and software usage. In our previous work , we introduced a pyramidal tract (PYT) dissection protocol inspired by (Chenot et al., 2019) and evaluated the performance of collaborators executing the instructions. Collaborators were split into two groups: Experts with advanced knowledge in neuroanatomy and nonexperts with only basic/no knowledge in neuroanatomy. Tractostorm (V1) showed experts and nonexperts had similar levels of variability (between 0.60 and 0.65 for the voxel-based Dice score) with a large deviation for the average.

In this work, we evaluate the efficacy of teaching an online education session for a WM dissection protocol. The purpose of this study is to help improve the future design of WM pathway dissection protocols and to further inform neuroanatomical definitions by evaluating quality improvement data from a conducted course. This is a step toward creating standardized definitions and improving the way they are taught. Expertise in bundles reproducibility analysis from the mentioned prior work allows us to expand the current analysis into WM pathways spatial agreement. In addition to the original protocol (Rheault, De Benedictis et al., 2020 , which only included PYT), we add two bundles (the arcuate fasciculus [AF] and body of the corpus callosum [CC] ) to the project. The investigation of the efficacy of teaching an online course (as opposed to basing learning only on written instructions) aims to help to understand the complexity of anatomical and software descriptions and assess where difficulties are and where clarifications could be needed.

Magnetic resonance imaging (MRI) has become the tool of choice for the in vivo investigation of the brain in neuroimaging studies due to its high resolution and variety of available contrasts. MRI has become the gold standard for manual and automatic segmentation of cerebral structures in the hope of finding relevant biomarkers (Boccardi et al., 2011; Fennema-Notestine et al., 2009; Pagnozzi, Conti, Calderoni, Fripp, & Rose, 2018) . However, this quest highlighted the heterogeneity of anatomical definitions (Frisoni et al., 2015; Gasperini et al., 2001; Rosario et al., 2011; Visser et al., 2019) .

Diffusion MRI, and more specifically tractography, specializes in the virtual reconstruction of structural connectivity of the brain (Griffa, Baumann, Thiran, & Hagmann, 2013; Hagmann et al., 2008; Jones, Simmons, Williams, & Horsfield, 1998) . As opposed to locally defined gray matter structures, WM pathways connect distant regions (Catani & De Schotten, 2008; Yeh et al., 2018) , cross each other, and have a complex shape including fanning, torsion, long-distance curvature, and sharp turns (Maier-Hein et al., 2017; Rheault, Poulin, Caron, St-Onge, & Descoteaux, 2020) . Historically, anatomical definitions of WM pathways were scarce and came in a variety of languages, which led to coexisting definitions of the same, or similar, underlying structures. Disagreements in nomenclature (Mandonnet, Sarubbo, & Petit, 2018; Panesar & Fernandez-Miranda, 2019) , evolving knowledge of projection (Chenot et al., 2019; Nathan & Smith, 1955) , association (Catani et al., 2007; Geschwind, 1970) , or commissural (Benedictis et al., 2016; Witelson, 1985) pathways and debate over the existence (or lack thereof) of specific connections (Forkel et al., 2014; Meola, Comert, Yeh, Stefaneanu, & Fernandez-Miranda, 2015; Türe, Yaşargil, & Pait, 1997) all have contributed to variations in anatomical definitions which have led to discrepancies in the WM pathways bearing the same name (Schilling et al., 2020; Vavassori, Sarubbo, & Petit, 2021 ).

The complex shape and inherent representation of tractography (streamlines) make the interpretation of anatomical definitions and the subsequent dissection of WM pathway, also named virtual dissection (Catani, Howard, Pajevic, & Jones, 2002; Mori & van Zijl, 2002) , challenging. Additionally, the level of familiarity with software or with data and slight differences in decision-making can all influence the dissection protocols carried out by a specific individual (intrarater reproducibility). The way the virtual dissection is performed will inherently vary across individuals performing it (interrater reproducibility). Moreover, the widespread use of tractography in population studies (e.g., aging or development) and surgery planning (e.g., deep-brain stimulation or electrode placement for epilepsy) and the diversity of anatomical definitions made it difficult to interpret results and outcomes across publications (e.g., meta-analyses). However, the need for standardization of clinical protocols is not unique to tractography (Boccardi et al., 2011; Frisoni et al., 2015) Rheault, De Benedictis, et al. (2020) . Two new bundles of interest were added following the same template as the original document.

To respect the experimental design, the raters were instructed to strictly follow the instructions, to perform the tasks on their own time in the 2 months following the online session, on the provided data, to follow the same data set ordering, and to use the same software.

Raters performed virtual dissection of the body of the CC, left AF, and left PYT on 20 data sets. Unknown to our raters, the 20 data sets were in fact five Human Connectome Project (HCP; Glasser et al., 2013) subjects that were duplicated four times (subject 1-2-3-4-5, 1-2-3-4-5, …). In this work, the four duplicates will be referred to as "timepoints" due to the fact that our study design requires raters to perform a sequential annotation of data sets. The duplicated data sets were not scan-rescan, they were identical copies of tractograms and maps already processed by the authors. By providing identical tractograms, only the variability induced by the manual segmentations was targeted rather than variability induced by the processing pipelines. The project involved no processing from the collaborators and aimed to quantify only how consistent the segmentation obtained from a specific protocol was. The raters were instructed to save the regions of interest (ROIs) defined by the segmentation as well as the resulting bundles. For this work, the relevant data submitted by each rater was composed of 3 (bundles) Â 5 (HCP) Â 4 (timepoints) = 60 files (trk file format).

The original data provided to the raters was the same as described in Rheault, De Benedictis, et al. (2020) . Briefly, probabilistic particle filtering tractography (Girard, Whittingstall, Deriche, & Descoteaux, 2014) from constrained spherical deconvolution (Tournier et al., 2008) produced around 1.5 M streamlines for each data set. The decision to provide the same data was made to facilitate potential comparisons between both projects. Data quality and processing were adapted for the current study design. Since the data had already been used in a similar study, uncertainty related to computer performance during the virtual dissection was low.

Our goal is to evaluate the capacity of raters to perform repeated virtual dissection tasks. These tasks are limited to ROIs "drawing"

(i.e., shape, size, and position) on provided data. The raters only had to open the software and load the preprocessed data (tractograms and maps), then follow instructions to identify anatomical landmarks as described.

One of the limiting factors in the initial Tractostorm project was the use of only one bundle of interest. This was due to the initial complexity of the study design and the number of unknown variables.

Using the same template and aiming for the same level of clarity, a dissection protocol was defined for each of the three bundles of interest (CC, AF, and PYT). The prior work helped with refining the project and allowed us to expand the number of bundles. The decision to limit dissection to one hemisphere (left AF and left PYT) was made to reduce the workload for our raters.

As part of the protocol, 15 ROIs had to be drawn per data set. Then, three bundles had to be dissected using a subset of ROIs and rules such as inclusion and exclusion. Once a data set was dissected and the required files saved, modifications were not allowed. If a major mistake (e.g., mixing up left/right) was observed before the following data set was started, corrections were allowed.

To ensure a similar level of familiarity with the software used for the project among all raters, the software and protocol were introduced in a 2-hr online educational session. The recording of the online educational session and a document describing in detail the protocol were made available to the raters. Collaboration between raters was not allowed. Minimal interaction with the principal investigator was allowed to confirm tasks' interpretation (software installation, data set ordering, files to save, how to submit, etc.). Following the course, the principal investigator stayed available for questions as well as encouraged to practice the protocol and experience the software if needed.

However, due to time zone differences, some raters (in Europe) reached the end of their workday. Raters were allowed to ask general questions (that were emailed to everyone via email if necessary).

Raters had to complete the tasks on their own schedule within 2 months following the online course. This was considered a realistic timeline that would accommodate all collaborators considering the various stages of their academic career/schedule during the COVID-19 pandemic and the fact that the expected duration of the task was estimated to be 10-20 hr. This personal freedom in the submission timeline was also allowed in the first Tractostorm project and justified because of the difficulty to supervise/control the schedule of 20 international researchers.

To quantify the reproducibility between timepoints ( This is then compared to the binarized volume of another data set.

The number of streamlines does not influence the results outside the volume they occupy. This metric is highly sensitive to outliers because outliers quickly increase the nonoverlapping volume.

Quantify the agreement of the exact selection of streamlines. Since compared data sets were matched across raters, streamlines can be compared directly. The value for this metric lies between 0 and 1 and represents the ratio of streamlines that are identical in both data sets to the total number of streamlines in both data sets. A perfect score is much harder to achieve since this metric is inherently linked to streamline count while Dice score of voxels is not.

Measure the coherence between density maps. A large overlap between bundles' cores is more important than the sparse overlap of rare spurious streamline and/or outlier. The goal of this metric is to assess whether the distribution of streamlines in space is similar. This allows bundles with different streamline counts to reach high scores if their density maps are correlated.

Since no single rater can be said to have the right dissection, we rely on a group average (majority vote) to establish our gold standard.

In the first Tractostorm project, only the experts' group was used to generate the gold standard and establish that nonexperts were closely similar to the gold standard (both groups delineated bundles that were very similar on average). These results demonstrated that expertise in neuroanatomy is not required to follow our segmentation protocol and achieve a gold standard that is anatomically meaningful (see Figure 2 ). Similar to Rheault, De Benedictis, et al. (2020) , metrics that include true negatives in their computation were excluded as they tend to converge toward a perfect score because true positives are overrepresented by an order of magnitude or two. A typical volume (or tractogram) contains millions of voxels (or millions of streamlines), and the typical dissection contains only thousands of voxels (or thousands of streamlines).

The chosen binary classification metrics are kappa, precision, and sensitivity for both the voxels and the streamline representations.

F I G U R E 1 Representation of the study design. Twenty collaborators (raters) contributed by carrying out our protocol, each had three bundle dissections to perform for each of the 20 data sets. The 20 data sets were five HCP subjects (missing from the figure) each with four timepoints. The total submitted data consisted of 1,200 bundles and 6,000 ROIs. HCP, Human Connectome Project F I G U R E 2 Example of gold-standard generation obtained by using a voting approach. Each row shows the bundles of interest and represents a smooth isosurface at the selected threshold. From left to right, multiple voting ratios from 0.0125 (union) to 0.5125 (majority vote) to 1.0 (intersection) from 80 segmentations of the first subject. At each increase in the voting threshold, the number of voxels decreases. A minimal vote set at 1 out of 80 (1/80 or 0.0125; left) is equivalent to a union of all segmentations while a vote set at 80 out of 80 (right) is equivalent to an intersection between all segmentations. Both of these thresholds are prone to variations due to outliers in the submitted data. Thresholds at 25, 50, and 75% generate similar group averages due to the raters' high spatial consistency, a majority-vote approach was selected for its intuitiveness and coherence with the first Tractostorm project Statistical differences between HCP subjects or between bundles were tested using a Mann-Whitney rank test with a significance threshold of 0.01. Longitudinal trends were tested using a linear mixed model, using bundles as different groups and accounting for random effects from raters, where the null hypothesis is that the slope is zero (significance threshold of 0.01).

During the data-gathering phase of the project, despite the protocol requiring a strict filename convention, various naming errors demonstrated that following instructions, even simple ones, is prone to errors. However, these kinds of errors were easy, but time-consuming, to correct manually.

Upon reception of the data, each bundle was visually inspected (and naming convention verified). From prior experience, the PYTs seem to have been more consistently segmented than in the previous study (shown in the last row of Figure 3 ). Extreme variations were less common, and major outliers were rarer in the PYT than in the initial Tractostorm project.

The vast majority of submissions were close to what was expected from anatomical knowledge. The general shape and position matched with the known anatomy the protocol attempted to dissect.

As seen in Figure 3 , no major misinterpretation or obviously mistaken dissection was found. Despite the noisy nature of probabilistic tractography and the admittedly difficult task of interpreting and executing the tasks, the submitted data appeared consistent and rarely contained spurious streamlines.

When performing the tasks on the exact same data, consistent measures are expected, but as shown in Figure 4 , the resulting dissections cover a wide range of scalar measures. While some measures are consistent, that is, average fractional anisotropy (FA) or average length, others are much more variable, that is, streamline count and volume.

Scalar measurements are disconnected from the spatial agreement, which is why one measure can be extremely stable (e.g., average FA) and another extremely variable (e.g., volume). 

In Figure 5 , the intrarater reproducibility scores show a high level of consistency for voxel-based metrics (correlation of density maps and Dice score of voxels). The AF obtained lower scores on average for metrics that take into account streamlines (correlation of density maps and Dice score of streamlines), which indicates that the overall spatial agreement is good, but the streamlines themselves were not spatially distributed similarly.

On average, Dice scores of voxels achieve very close results for all bundles (CC 0.89 ± 0.06, AF 0.89 ± 0.08, and PYT 0.88 ± 0.07).

However, as seen in Figure 5 , these scores vary from subject to subject. This is particularly apparent for the streamline-based metrics.

When each rater was analyzed individually, it can be observed that reproducibility is not equal across all raters. However, no single rater systematically scored very high/low reproducibility.

No statistically significant longitudinal difference in interrater reproducibility is observable when data are analyzed longitudinally (in the chronological order of dissection, for each HCP subject). As shown in Figure 6 , no relationship between timepoints and any metrics can be distinguished.

No single rater was responsible for systematically different dissections. Submissions that are completely inconsistent with the group are rare. This leads to only a few reproducibility scores being much lower, which contributes to increasing the interquartile range. Voxelbased representation produces higher and more stable reproducibility scores.

Reproducibility scores do not vary across bundles for voxel-based metric (Dice score of voxels: CC 0.83 ± 0.08, AF 0.84 ± 0.10, and PYT 0.83 ± 0.07). The metrics that take streamline density into account do vary across bundles and across subjects. For example, the AF at timepoints associated with the first HCP subject achieve very high interrater scores for all metrics (e.g., a correlation of density maps of 0.97 ± 0.02). However, the timepoints associated with the last HCP subject are much lower and more variable across all metrics (e.g., a correlation of density maps of 0.64 ± 0.31). This is a similar observation to the patterns across bundles/subjects seen in the intrarater section.

To evaluate binary classification metrics, the dissection of each rater was compared to the group average (gold standard; Figure 7 ). Binary Results from the three bundles of interests show that reproducibility varies across pathways. This is in line with previous works (Boukadi et al., 2019; Cousineau et al., 2017; Wakana et al., 2007) . It is unknown whether the dissection rules and landmarks are inherently harder to define or if some bundles are simply more prone to spurious streamlines and outliers (e.g., more ROIs needed to be defined, and therefore the small variations or "mistakes" add up). Future work involving a formal analysis of ROIs (saved by raters as part of this protocol) will aim to disentangle this question and to provide insight into F I G U R E 5 Reproducibility scores for intrarater agreements for all subjects. There is no longitudinal/temporal component to this figure, and all timepoints (per HCP subject) are needed to compute the intrarater scores. As expected, the voxel representation (Dice score) shows high reproducibility across bundles and HCP subjects. Only one bundle (AF) was highly impacted by anatomical differences (across subjects) for the streamline representation. AF, Arculate fasciculus; CC, corpus callosum; HCP, Human Connectome Project; PYT, pyramidal tract good practice for future protocols development and/or inform anatomical definition at large. We believe such an investigation deserves its own line of analysis.

An interesting pattern is observable for the AF: The best and worst intrarater and interrater reproducibility scores were obtained in the first and the last HCP subject, respectively. Preliminary investigation shows more variability with some ROIs associated with the AF may be the cause. This indicates that anatomical differences can have an impact on the identification of landmarks and drastically influence the reproducibility/quality of dissection. Identifying the exact source of this unintuitive variability is crucial to improving the current protocol. It could be due to an ROI being misplaced and thousands of streamlines that generally overlap with the whole bundle to be discarded. This would affect the density map and the correlation metrics without a major impact on the overall volume of the bundle.

Raters' reproducibility scores were well distributed, but some outperformed others. Furthermore, some raters had closer similarities to the group average (which is considered anatomically meaningful). This could indicate there is such a thing as "good rater" and "bad rater".

Not only is a good rater expected to have a high intrarater reproducibility score, but they are expected to have a high agreement with the group average. This is referred to as master tracers/raters in European

Alzheimer's Disease Consortium -Alzheimer's Disease Neuroimaging

Initiative hippocampus project (Frisoni et al., 2015) .

The results from multiple bundles, as well as a modified teaching approach, confirmed the hypothesis from the first Tractostorm study F I G U R E 6 Reproducibility scores (interrater) for all timepoints showing agreement at the group level. The x-axis represents first-to-last subject (1-5) and first-to-last repetition (a-d). No discernable temporal pattern can be observed, and interrater agreement remains stable as the amount of "practice" increases. Similar to the intrarater agreement, the AF reproducibility scores (streamline representation) seem to be more difficult to segment consistently at the group level depending on the HCP subject. AF, Arculate fasciculus; CC, corpus callosum; HCP, Human Connectome Project; PYT, pyramidal tract that reproducibility scores cannot be easily generalized. Each protocol modification has the potential to drastically affect reproducibility. As hypothesized in Rheault, De Benedictis, et al. (2020) , we confirmed that different bundles have different reproducibility scores. This confirms that any modification (e.g., teaching method, software) or addition (e.g., new ROIs, new bundles) to the protocol will likely change the reproducibility scores and thus generalization is likely impossible.

The major differences of streamline representation metrics (e.g., Dice score of streamlines) between HCP subjects for the AF indicate that some anatomical structures are harder to define/find and can have a bundle-specific impact on reproducibility. This could be amplified when dealing with data sets with a wide range of ages or pathology. This further supports that generalization is extremely difficult and reproducibility should be studied independently for each bundle.

Modifications to protocols should trigger a reproducibility evaluation, and it should be targeted for a somewhat specific range of audiences, data sets, and populations. For example, this work was mainly designed for raters without neuroanatomy background on young/ healthy subjects from the HCP database. However, a silver lining is that some flexibility is possible when targeting the scope of a protocol. Rheault, De Benedictis, et al. (2020) demonstrated that the experts and nonexperts group distinction (with and without formal anatomy background) had a minimal effect for spatial agreement in voxel representation (Dice score of voxels). Furthermore, TractEM (Bayrak et al., 2019) showed that acquisition quality (angular/spatial resolution) did not have a major influence on the agreement (both Dice score of voxels and correlation of density map). Finally, this work showed that by leveling the familiarity with the software and the protocol with an online educational session, virtual dissection tasks can reach a very high spatial agreement for every rater and remain stable. This is reassuring for those aiming for standardized WM pathway dissection protocols or for automatic dissection methods that rely on curated bundles obtained from such protocols.

Widely different protocols are preventing the comparison across publications in the literature and limit the potential for meta-analysis. break down due to support variations in processing, etc.). This is why standardization is important, and such a resource-intensive investigation (e.g., the current work) repeated frequently for minor variations would be a waste.

A subsequent project is already planned, and it aims to investigate the ROIs submitted by our rater to inform future practical definitions.

The variability of ROIs across raters and the influence of shape and distance will provide insight into future protocol iterations. The general aim is to help to design future protocols that can vary in robustness, time restrictions, or complexity.

F I G U R E 8 Results from the survey that portray a general picture of our group of raters and how they experienced/conducted the tasks. The affirmations (bottom) are not quantitative and rely on a personal assessment only (e.g., "I am familiar", "I respected" Overall, the project was appreciated by our collaborators despite its heavy workload. During the planning phase, it was estimated (to plan workload) that each bundle dissection would take 5-10 min (15-30 min per data set, 5-10 hr in total). As seen in Figure 8 , these values were underestimated, and from the feedback we received, this is likely because the first subject or two took much longer and the 15 min per data set was achieved only toward the end for most raters.

Instructions' complexity was seen as "simple" while the software complexity was perceived as greater than the instructions. This reinforces the intuition that software could be a major source of variability.

The timeline of execution could also be an important variable to investigate. After the course and on their own time, raters were allowed to decide when to execute the tasks and how many data sets to do each time. This freedom was also allowed in the first Tractostorm project and mainly due to the difficulty to supervise or control the schedule of 20 researchers spread across North America and Europe. The allowed window for raters to submit their segmentation data was open for 2 months after the online course. The vast majority of raters submitted their delineation between Week 4 and

Week 8 with two exceptions. One rater finished the tasks within 1 week of the online course, and another finished the tasks 2 weeks after the allowed window (authorized by the principal investigator due to personal circumstances).

In this work, we quantified the effect of practicing and learning a protocol for WM pathway dissection. Using matched data and a large group of raters, we quantified their individual agreement (intrarater) as well as their group agreement (interrater). We demonstrated that as raters practice, their interpretation and execution remain stable.

Despite the global nature of WM pathways, high spatial/voxel reproducibility can be achieved. However, we observe that modifying the teaching method has a large effect. The online educational session on the software and protocol had a major positive impact (30% higher median and 50% lower interquartile range) on the reproducibility of the PYTs (only bundle in common across both studies).

It is important to note that variations between both Tractostorm projects indicate that bundle dissection, even if designed with a similar template and the same level of detail, cannot be easily generalized, and so, careful evaluation must be systematically performed. This evaluation of the impact of a teaching method on the protocol results is an essential step to improve the future design of WM dissection protocols. Each collaborator's contribution to the project was made possible by various sources of funding:

• Leon Cai was supported by NIH/NIGMS, grant number 5T32GM007347.

• Viljami Sairanen was supported the Orion Research Foundation sr, the Instrumentarium Science Foundation sr., and the Emil Aaltonen Foundation.

• Pietro Bontempi was supported by Verona Brain Research Foundation (high-field MRI for the study of peripheral nerve microstructure).

• Guido Guberman was supported by the Vanier Canada Graduate Scholarship.

• Maggie Roy was supported by The Mathematics of Information Technology and Complex Systems (MITACS).

• Charles Poirier, Gabrielle Grenier, and Philippe Karan were supported by the scholarship from Natural Sciences and Engineering Research Council (NSERC) and Fonds de Recherche du Québec Nature and Technologies (FRQNT).

• We thank the Universite de Sherbrooke institutional research chair in Neuroinformatics that supports Maxime Descoteaux and his team.

TractEM: Fast protocols for whole brain deterministic tractography-based white matter atlas. bio-Rxiv

New insights in the homotopic and heterotopic connectivity of the frontal portion of the human corpus callosum revealed by microdissection and diffusion tractography

Survey of protocols for the manual segmentation of the hippocampus: Preparatory steps towards a joint EADC-ADNI harmonized protocol

Test-retest reliability of diffusion measures extracted along white matter language fiber bundles using HARDI-based tractography

Symmetries in human brain language pathways correlate with verbal recall

A diffusion tensor imaging tractography atlas for virtual in vivo dissections

Virtual in vivo interactive dissection of white matter fasciculi in the human brain

A novel frontal pathway underlies verbal fluency in primary progressive aphasia

A population-based atlas of the human pyramidal tract in 410 healthy participants

A test-retest study on Parkinson's PPMI dataset yields statistically significant white matter fascicles

The superoanterior fasciculus (SAF): A novel white matter pathway in the human brain? Frontiers in Neuroanatomy

Structural MRI biomarkers for preclinical and mild Alzheimer's disease. Human Brain Mapping

The anatomy of fronto-occipital connections from early blunt dissections to contemporary tractography

The EADC-ADNI harmonized protocol for manual hippocampal segmentation on magnetic resonance: Evidence of validity

Intra-observer, inter-observer and interscanner variations in brain MRI volume measurements in multiple sclerosis

The organization of language and the brain

Towards quantitative connectivity analysis: Reducing tractography biases

The minimal preprocessing pipelines for the human connectome project

Structural connectomics in brain diseases

Mapping the structural core of human cerebral cortex

Non-invasive assessment of structural connectivity in white matter by diffusion tensor MRI

The challenge of mapping the human connectome based on diffusion tractography

The controversial existence of the human superior fronto-occipital fasciculus: Connectome-based tractographic study with microdissection validation

Fiber tracking: Principles and strategies-a technical review

Long descending tracts in man: I. Review of present knowledge

A systematic review of structural MRI biomarkers in autism spectrum disorder: A machine learning perspective

Commentary: The nomenclature of human white matter association pathways: Proposal for a systematic taxonomic anatomical classification

Tractostorm: The what, why, and how of tractography dissection reproducibility

MI-brain, a software to handle tractograms and perform interactive virtual dissection. ISMRM diffusion study group workshop

Common misconceptions, hidden biases and modern challenges of dMRI tractography

Inter-rater reliability of manual and automated region-of-interest delineation for PiB PET

Tractography dissection variability: What happens when 42 groups dissect 14 white matter bundles on the same dataset? bioRxiv

High interrater variability in intraoperative language testing and interpretation in awake brain mapping among neurosurgeons or neuropsychologists: An emerging need for standardization

Resolving crossing fibres using constrained spherical deconvolution: Validation using diffusion-weighted imaging phantom data

Is there a superior occipitofrontal fasciculus? A microsurgical anatomic study

Hodology of the superior longitudinal system of the human brain: A historical perspective, the current controversies, and a proposal

Inter-rater agreement in glioma segmentations on longitudinal MRI

Reproducibility of quantitative tractography methods applied to cerebral white matter

The brain connection: The corpus callosum is larger in left-handers

Population-averaged atlas of the macroscale human structural connectome and its network topology

Tractostorm 2: Optimizing tractography dissection reproducibility with segmentation protocol dissemination

• Sami Obaid was supported by the Savoy Foundation studentship and from scholarships from the Fonds de Recherche du Québec -Santé (277581).

The data provided to the raters are openly available on Zenodo at https://zenodo.org/record/5190145#.YY7qRnVKhH5 in 2022. The provided data are from the Human Connectome Project (HCP) The data that support our results (tracers annotations) will be available at the same link in 2022.

https://orcid.org/0000-0002-0097-8004Guido I.