key: cord-0174516-gglq2gtl
authors: Pati, Sarthak; Thakur, Siddhesh P.; Bhalerao, Megh; Thermos, Spyridon; Baid, Ujjwal; Gotkowski, Karol; Gonzalez, Camila; Guley, Orhun; Hamamci, Ibrahim Ethem; Er, Sezgin; Grenko, Caleb; Edwards, Brandon; Sheller, Micah; Agraz, Jose; Baheti, Bhakti; Bashyam, Vishnu; Sharma, Parth; Haghighi, Babak; Gastounioti, Aimilia; Bergman, Mark; Mukhopadhyay, Anirban; Tsaftaris, Sotirios A.; Menze, Bjoern; Kontos, Despina; Davatzikos, Christos; Bakas, Spyridon
title: GaNDLF: A Generally Nuanced Deep Learning Framework for Scalable End-to-End Clinical Workflows in Medical Imaging
date: 2021-02-26
journal: nan
DOI: nan
sha: 2d9978d3d17a02d03d5fedccc388a2ba5a3a2367
doc_id: 174516
cord_uid: gglq2gtl

Deep Learning (DL) has greatly highlighted the potential impact of optimized machine learning in both the scientific and clinical communities. The advent of open-source DL libraries from major industrial entities, such as TensorFlow (Google), PyTorch (Facebook), and MXNet (Apache), further contributes to DL promises on the democratization of computational analytics. However, increased technical and specialized background is required to develop DL algorithms, and the variability of implementation details hinders their reproducibility. Towards lowering the barrier and making the mechanism of DL development, training, and inference more stable, reproducible, and scalable, without requiring an extensive technical background, this manuscript proposes the Generally Nuanced Deep Learning Framework (GaNDLF). With built-in support for $k$-fold cross-validation, data augmentation, multiple modalities and output classes, and multi-GPU training, as well as the ability to work with both radiographic and histologic imaging, GaNDLF aims to provide an end-to-end solution for all DL-related tasks, to tackle problems in medical imaging and provide a robust application framework for deployment in clinical workflows.

Deep Learning (DL) describes a subset of Machine Learning (ML) algorithms built upon the concepts of neural networks [1] . Over the last decade, DL has shown great promise in various problem domains such as semantic segmentation [2] [3] [4] [5] , quantum physics [6] , segmentation of regions of interest (such as tumors) in medical images [7] [8] [9] [10] [11] [12] , medical landmark detection [13, 14] , image registration [15, 16] , predictive modelling [17] , among many others [18] [19] [20] [21] [22] . The majority of this vast research was enabled by the abundance of DL libraries made open source and publicly available, with some of the major ones being TensorFlow (developed by Google) [23] and PyTorch [24] by Facebook (originally developed as Caffe [25] by the University of California at Berkeley), which represent the most widely used libraries that facilitated DL research. Among the currently available libraries, PyTorch has proved itself to be one of the most customizable and easily deployable on local workstations through its robust and efficient C++ backend [26] .

There have been various efforts by the medical imaging community towards addressing the clinical end-points of academic research, and packaging pre-coded/pre-trained models for data scientists to leverage and address clinical requirements. However, all these efforts, resulting in numerous software packages, can confuse the less experienced user and result in endless hours of searching for the appropriate tool to use. To alleviate this situation, we hereby stratify these efforts into a set of well-defined categories to deepen the community's understanding. Some of these efforts reside on one side of the spectrum and can be classified as applications, since they focus on the end-user, with powerful user interfaces (either graphical, or otherwise). Software packages on the other end of the spectrum can be stratified as libraries, since they are built as a mechanism to access low-level machine functionality, while other packages that fall in the middle layer between these two ends, provide a layer of abstraction to enable research and can be classified as toolkits. Finally, other software packages can be classified as frameworks, since they fulfil various roles and attempt to provide a multitude of functions targeting both developers and end-users. Examples of such packages are the Medical Imaging Interaction Toolkit (MITK) [27] and the Cancer Imaging Phenomics Toolkit (CaPTk) [28] [29] [30] [31] [32] . GaNDLF also falls into this latter category, with a notably unique emphasis to DL. Figure 1 illustrates this stratification, while also providing some pertinent examples. Ensuring DL algorithms follow similar paradigms would make them more accessible to clinical researchers and this would greatly benefit the current scientific paradigm by increasing their clinical impact. Towards this end, developing and making publicly available an open source user-focused framework, is expected to allow training and inferring DL algorithms without the need to code, while adhering to best practices established by the greater ML community. Specifically, these practices include (i) nested cross-validation [33] [34] [35] , and (ii) artificial augmentation of training data [36, 37] . Such a framework, incorporating capabilities to handle end-to-end processing (i.e., pre-and post-processing steps) in a cohesive and reproducible manner, would be tremendously useful, and would contribute greatly in democratizing ML (and particularly DL) in the field of medical imaging. Additionally, having a framework that can work across different tasks (segmentation, regression, and classification) and varying diagnostic modalities (radiographic and histologic) would take precision diagnostics to the next frontier of quantitative integrative diagnostics.

Some of these efforts are non-DL based such as MITK [27] , 3D Slicer [38] , ITK-SNAP [39] , and CaPTk [28] [29] [30] [31] [32] , which have been lauded for their generalizability but fall short when it comes to competitive performance for specific challenges. Towards obtaining superior performance, various efforts concentrating on DL have been devised by the community, such as NiftyNet [40] 1 , DeepNeuro [41] 2 , ANTsPyNet [42] 3 , and DLTK [43] 4 that are implemented in TensorFlow, as well as pymia 5 [44] , InnerEye 6 [45] , and MONAI 7 that are implemented in PyTorch. However, all these applications, toolkits, and frameworks (i) describe developer-focused tools targeting members of the advanced computational research community, (ii) can be difficult to grasp by researchers without sufficient experience in DL, (iii) do not make it easy for DL scientific developers to write their architectures in a generalizable way, allowing their application on problems spanning across domains, (iv) make it difficult to write training pipelines for different problem domains, (v) they put the onus of training robust and generalizable models to the user's knowledge of the training mechanism and the dataset in question, and (vi) they lack a single end-to-end application programming interface (API) for training and inference that can span across various problem domains.

In this manuscript, we introduce GaNDLF (a GenerAlly Nuanced Deep Learning Framework) to enable researchers to solve problems involving segmentation, regression, classification, and synthesis, while producing robust DL models without requiring much knowledge of DL or coding experience. We have developed GaNDLF in PyTorch/Python, as an abstraction layer that incorporates widely used open source libraries (such as Insight Toolkit (ITK) 8 [46] for handling most I/O operations and basic image processing routines, and TorchIO 9 [47] for data augmentation) that can help researchers generate robust DL models quickly and reliably, facilitating reproducibility [48, 49] and being consistent with the criteria of findability, accessibility, interoperability, and reusability (FAIR) [50] . Furthermore, the flexibility of its codebase allows GaNDLF to be used across medical imaging modalities (e.g., 3D radiology scans, and 2D histology whole slide images), with scope for integrating other clinical data (such as genomics and electronic health records) in the future.

The intention behind the development of GaNDLF is to provide an end-to-end solution for training robust DL models to tackle a variety of tasks, such as segmentation, regression, and classification, in both 2D and 3D image datasets. It is also designed to work across radiology data (e.g., Magnetic Resonance Imaging (MRI), Computed Tomography (CT), Positron Emission Tomography (PET)) and digitized histology whole slide images (WSI) (e.g., Hematoxylin and Eosin (H&E) stained tissue sections), including specialized image pre-processing functionalities for both. The notable difference between these images is the relatively small resolution and size of radiology images (typically occupying a few megabytes of disk space), compared with the histology WSI that are images of relatively large resolution (150K × 150K pixels), where a single image can occupy 30-40 gigabytes. This enables researchers to use a single package across virtually all medical imaging modalities without performing any additional coding, thereby enabling future studies that rely on integrative diagnostics [51] . Owing to the flexibility of the data loading mechanism in GaNDLF, it could also be possible to integrate genomic data into a model towards further contributing in the field of personalized medicine.

Providing robust pre-processing techniques, widely applicable to medical imaging data, is critical for such a general-purpose framework to succeed. GaNDLF offers most of the pre-processing techniques reported in the literature, leveraging the capabilities of basic standardized pre-processing routines from ITK [46] , and advanced pre-processing functionality from the Cancer Imaging Phenomics Toolkit (CaPTk) 10 [28] [29] [30] [31] [32] . The main pre-processing steps for data curation (including harmonization and normalization) are described below.

• Voxel-resolution harmonization: To ensure that the physical definition of the input data is in a common space (for example, all images can have the voxel resolution of I res = [1.0, 1.0, 2.0]).

• Image-resolution harmonization: To ensure that the input data has the same image dimensions (for example, all images can be resampled to I dim = [240, 240, 155]).

• Thresholding: To consider pixel/voxel values that belong to a specific intensity range and ignore values below/above this range, by making them equal to zero (Eq.1):

• Clipping: To consider pixel/voxel values that belong to a specific intensity range and convert values below/above this range, by making them equal to the minimum/maximum threshold, respectively (Eq.2):

• Rescaling: To consider all pixel/voxel values after converting them to a common profile (for example, all input images are rescaled to [0, 1]).

• Z-score normalization: A widely used technique for data normalization in medical imaging [52, 53] , that preserves the complete signal of the input image by subtracting the mean and then dividing by the standard deviation of the complete intensity range found in this image. Notably, the application of z-score normalization through GaNDLF can occur either on the full image or only within a masked region of interest, adding to the overall flexibility of this transform.

DL methods are well-known for being extremely data hungry [54, 55] and in medical imaging, data is scarce because of various technical, privacy, cultural/ownership concerns, as well as data protection regulatory requirements, such as those set by the Health Insurance Portability and Accountability Act (HIPAA) of the United States [56] and the European General Data Protection Regulation (GDPR) [57] . This necessitates the addition of robust data augmentation techniques [58] into the training data, so that models can gain knowledge from larger datasets and hence be more generalizable to unseen data [59] .

GaNDLF leverages an existing robust data augmentation package, namely TorchIO [47] 11 , which provides augmentation transformations in a PyTorch-based mechanism ( Table 1) . GaNDLF specifically focuses on TorchIO over other packages (such as Batch Generators [60] and Albumentations [61] ), because TorchIO provides an easier API for storing extraneous image information. Such information include the affine transform of the image that is critical for maintaining correct physical definition of radiology scans. Furthermore, TorchIO also maintains a strong tie-in with PyTorch's Tensor data structure, thereby making it relatively easier to extend the data loading with different data formats. More details on the available types of augmentations through GaNDLF are shown in Table 1 , and examples of their effects are illustrated in Figure 2 , using a brain tumor T2-weighted-Fluid-Attenuated Inversion Recovery (T2-FLAIR) MRI scan from the BraTS challenge's dataset [9, 11, [62] [63] [64] . [65, 66] is a useful technique in ML that ensures reporting unbiased performance estimates and helps capture information from an entire given dataset, by training k different models on corresponding folds of the complete training data. Specifically, to ensure robust model training, and providing unbiased performance estimates and quantitative validation of the generalization performance of the implemented algorithms (i.e., by evaluating results on new unseen data), GaNDLF offers a nested k-fold cross validation schema [67] , i.e., a well-established way of evaluating algorithms on new datasets [68] . Specifically, during this nested k-fold cross validation, all data are combined into a single cohort with a model configuration of three sets. Initially cases of the complete cohort are proportionally and randomly divided into k non-overlapping equally-sized subsets and during each fold, k − 1 of these subsets are considered as the retrospective/discovery cohort and 1 as the prospective/replication cohort, which is unseen for this specific fold. Note that during each fold, the prospective/replication cohort is a different subset. This cross-validation scheme is analyzing the given data as if it had independent discovery and replication cohorts, but in a more statistically robust manner by randomly permuting across all given data. Put differently, in this cross-validation scheme, the implemented algorithms are trained in the discovery set (90% of the data) and then evaluated to the replication set (10% of the data). In other words, the first level of split is to ensure randomization of the testing data (also known as replication cohort), and the second level of split is to ensure randomization of validation data (also known as discovery cohort). The number of folds for each level of split is specified in the configuration file, and the models for different folds can be trained in parallel (in accordance with the user's computation environment). GaNDLF also offers the option of specifying single fold training, if so desired.

GaNDLF provides a flexible mechanism for training DL models, with support for multiple imaging modalities and output classes, along with support for both 2D and 3D datasets. The main entry point of GaNDLF's training mechanism is a CSV file provided by the user, through the command line interface. The expected CSV file should comprise the subject identifiers along with the corresponding full paths of all required input images and masks (i.e., for segmentation tasks) and the values required for training and follow up predictions (i.e., for regression and classification tasks). The subject identifiers are used to randomly split the entire dataset into training, validation, and testing subsets, using k-fold cross-validation [69] (see Section 2.3 for more details). Furthermore, a YAML-based configuration file is used to control and parameterize all aspects of the training, such as the proportional split of the cross-validation, data pre-processing, data augmentations (e.g., type, parameters, and probabilities), model parameters (e.g., architecture, list of classes, final convolution layer, optimizer type, loss function, number of epochs, scheduler, learning rate, batch size), along with the training queue parameters (i.e., samples to extract per volume, maximum queue length, and number of threads to use). The YAML-based configuration file requires an indication of the GaNDLF version used to create the trained model, and the actual trained model, with the intention of ensuring coherence between these two. Figure 2 : Data augmentation abilities in GaNDLF using TorchIO [47] . Illustrated examples on a brain tumor T2-FLAIR MRI scan, from the BraTS challenge's dataset. GaNDLF also supports mixed precision training [70] to save computational resources and reduce training time. A single epoch comprises training the model using the training portion of the data and backpropagating the generated loss, followed by evaluating the model performance on the validation portion of the data. In addition to saving the model trained after every epoch, each model corresponding to the best global losses for the training, validation, and testing datasets is also saved. These saved models can be used for subsequent inference, either using a single independent model or in a aggregated fashion utilizing label fusion. Training statistics (such as the Dice similarity coefficient and loss) are stored for each epoch, for both the validation and the testing data, in the form of a comma-separated text file (CSV), with the intention of facilitating simplified results reporting and detailed debugging. The overall pipeline of the training procedure offered in GaNDLF is illustrated in Figure 3 .

GaNDLF's inference mechanism follows the same paradigm as its training mechanism, where the user needs a CSV file comprising of the subjects' identifiers and the full paths of images, along with a YAML configuration file and the location of the trained models. For each trained model, the corresponding estimated output is stored and (depending on the user's parameterization) a final predicted output is generated by aggregating the outputs of the independent models. This aggregation happens through different approaches, subject to the prediction task, e.g., a label fusion approach may be used for segmentation tasks, averaging for regression tasks, and majority voting for classification. If the full paths of the ground truth labels are given in the input CSV, then the overall metrics (e.g., Dice and loss) of the model's performance are also calculated and stored.

As soon as the data is read into memory, GaNDLF applies the pre-processing steps defined in the configuration file to each input dataset (see Section 2.1 for examples of these steps). Then TorchIO's [47] inference mechanism is used to enable patch-based inference for radiology images. This entails patch extraction, usually of the same size as the one that the corresponding model has been trained on, from the image(s) on which the model needs to infer on. The forward pass of the model is then applied and the result is stored in the corresponding location ( Figure 4 ). This enables models to be trained and inferred on varied patch sizes based on the available hardware resources. Overlapping patches can be stitched by either cropping or taking an average of the predictions at the overlapping area, and the amount of overlap can be specified to ensure that dense inference can occur [47] . Although patch-based training and inference is being widely used, we note that various potential adverse effects of this process have been reported [71] .

Histology WSIs need a different mechanism for inference than that for training, primarily due to their increased hardware requirements, i.e., WSIs can reach 40GB while on-memory. Figure 5 (a) illustrates this inference mechanism, which starts with the extraction of a WSI's imaging data at the maximum magnification/resolution (e.g., 40×) and its conversion to a TIFF with 9-10 layers of tiled images with different magnification levels (i.e., Figure 5 "Data Fixing Pipeline"). The background area is then filtered out through the generation of a 'tissue mask' ( Figure  5 (a).(iv)), using Red-Blue-Green (RGB) and Otsu-based thresholding [72, 73] , which is necessitated by the need to correctly tackle image reading issues occurring when trying to buffer any magnification level other than the lowest ( Figure 5(b) ). This 'tissue mask' also reduces the search space for downstream analyses, and hence reduces the overall computational footprint. This mask is used to calculate foreground coordinates ( Figure 5 

This section describes the modularity and extendability of GaNDLF's functionality. A description of the software stack of GaNDLF is provided, as well as how the lower level libraries are utilized to create an abstract user interface, which can be customized based on the application at hand. Following this, the flexibility of the framework from a technical point-of-view is chronicled, which illustrates the ease with which new functionality can be added.

The software stack of GaNDLF is illustrated in Figure 6 , depicting the inter-connections between the lower level libraries and more abstract functionalities exposed to the user via the command line interface. This ensures that a researcher can perform DL training and/or inference without having to write a single line of code. Furthermore, the flexibility of the stack is demonstrated by the ease with which a new component (e.g., a pre-processing step, or a new network architecture) can be incorporated into the framework, and subsequently applied to new types of data/applications with minimal effort. Specifically, the framework's flexibility affects components listed in the following subsections.

To ensure maximum flexibility and applicability across various types of medical data, GaNDLF supports both 3D and 2D datasets. Using the same codebase, GaNDLF has the ability to apply various architectures across diverse modalities such as MRI, CT, retinal, and digitized histology images, including both immunohistochemical (IHC), Hematoxylin and Eosin (H&E) stained tissue sections.

GaNDLF supports multiple input channels/modalities/sequences and output classes, for either segmentation or regression, to ensure maximal applicability across various problem domains, whether it involves a binary task (e.g., brain extraction) or multi-class tasks (e.g., brain tumor sub-region segmentation).

(a) Specialized inference mechanism for WSIs.

(b) Example of a problematic WSI, illustrating data loss in certain patches. Figure 5 : GaNDLF's histology inference mechanism in (a) and illustration of a problematic example in (b), issuing the need for specialized pre-processing. (a) Starting with the raw WSI, multiple specialized pre-processing steps (ii-vi) are performed before a patch can be given as input to a trained model. The coordinates of each patch need to be saved along with the overlap information in order to obtain the final result. (b) Reading data across magnification levels can cause artificial regional data loss. • Radiology images require the ability to process both 2D and 3D data. Although imaging examples that GaNDLF has been applied and evaluated so far describe CT, MRI, and tomosynthesis scans, it offers support for almost every radiologic image via ITK.

• Histology images, on the contrary, require specialized handling along the following criteria:

-Input: The use of OpenSlide [74] allows GaNDLF to read a fraction of the entire WSI data at the resolution closest to the requested magnification level, thereby ensuring memory-efficiency. -Patch-extraction: Since a WSI cannot always be processed on its entirety due to hardware constraints, a patch-based mechanism considering multiple resolutions is essential. This mechanism is offered through our open-source Open Patch Miner (OPM) 12 , which has been configured within GaNDLF for simple and rapid batch-processing of patches. OPM can automatically mask tissue in a WSI and convert the highest available resolution to square patches, given a pre-defined overlap amount and patch dimensions. Specifically, it extracts patches with the pre-defined overlap using a pseudo-grid and parallel sampling adjustable for tissue inclusion, in proportion to different tissue classes (for classification tasks), and while omitting the background region.

• UNet without residual connections: The UNet [75] (Figure 7 (a), Figure 9 ) is one of the most well known architectures of Convolutional Neural Networks (CNN) used for 2D and 3D segmentation. The UNet consists of an encoder, comprising convolutional layers and downsampling layers, and a decoder offering upsampling layers (applying transpose convolution layers) and convolutional layers. The encoder-decoder structure contributed in automatically capturing information at multiple scales/resolutions. The UNet further includes skip connections, which consist of concatenated feature maps paired across the encoder and the decoder layer, to improve context and feature re-usability.

• ResUNet: This describes a UNet extended with residual connections between the encoder and the decoder to improve the backpropagation process [7, [75] [76] [77] [78] (Figure 7 (a), Figure 9 ). Our implementation follows the principle of the UNet architecture, while including skip connections in every convolution block. The residual connections utilize additional information from previous layers (across the encoder and decoder) that enables a boost in the performance.

• Inception UNet (UInc): The Inception module [2, 79] can be used to substitute the standard convolutional block (which is a simple series of convolutional layers) of the UNet to create the UInc architecture (Figure 7 (d), Figure  12 ). This module describes parallel pathways of convolutional layers of different kernel sizes, to improve the representation of multi-scale features. UInc has been applied towards semantic segmentation tasks [80] .

• Fully Convolutional Network (FCN): The FCN architecture [81] (Figure 7 (b), Figure 10 ) introduced in 2017, utilizes hierarchical feature extraction with an encoder recognizing both imaging patterns and spatial information of each input image, with varying receptive fields. FCN has smaller computational requirements compared to UNet, due to the absence of the decoding module, incorporating convolution and transpose convolution operations. FCN simply upsamples the encoded features to the required output segmentations to generate masks. It hence provides faster, yet coarser, segmentations for various domains [82] .

• VGG: The VGG [83, 84] (Figure 7 (c), Figure 11 ) is a well-known network for performing classification and regression tasks. VGG16 has 16 convolutional layers and 3 dense layers. It is well known for its performance on the ImageNet classification challenge [85] . VGG reinforced the idea that networks should be simple and deep. VGG uses 3 × 3 convolution filters and 2 × 2 max-pooling layers with a stride of 2 throughout the architecture. The original architecture uses ReLU activation function [86] and categorical cross-entropy loss function. The initial layers of the VGG16 perform feature extraction and the last softmax layers act as the classifier. GaNDLF supports multiple variants of the VGG, namely, VGG-11, VGG-13, VGG-16, VGG-19, with and without batch normalization for both 2D and 3D datasets to maximize flexibility. 

As already mentioned, GaNDLF can train DL models for various tasks, including segmentation, regression, and classification. Depending on available resources, most models can be extended for all these tasks (such as UNet) and there are specialized models that perform specific tasks, such as the brain age prediction model [88] , which starts from a VGG-16 model pre-trained on ImageNet weights and is only defined for regression. The flexibility of GaNDLF's framework makes it possible for all these models to co-exist and to leverage the robustly designed data loading and augmentation mechanisms for future extensions of studies. Having a common API for all these tasks also makes it relatively easy for researchers to start applying well-defined network architectures towards various problems and datasets, thereby helping to get DL-based pipelines into clinical workflows.

It is an ongoing problem that deep neural networks lack the interpretability or explainability necessary for medical practitioners to trust into the networks decisions, hindering the practical application of such models in clinical practice [89, 90] . To counter this, GaNDLF integrated the PyTorch library M3D-CAM [91] , which enables the easy generation of attention maps of CNN-based models for both 2D and 3D data, and is applicable to both classification and segmentation models. The attention maps can be generated with multiple methods: Guided Back-propagation [92] , Grad-CAM [93] , Guided Grad-CAM [93] and Grad-CAM++ [94] . The maps visualize the regions in the input data that most heavily influence the model prediction at a certain layer.

For segmentation tasks, we use the Dice Similarity Coefficient [95] (Equation 3) as the performance evaluation metric, and all related models were trained to maximize it. Dice is a common metric used to evaluate the performance of segmentation tasks. It measures the extent of spatial overlap, while taking into account the intersection between the predicted masks (PM) and the provided ground truth (GT ), hence handles over-and under-segmentation.

For regression and classification tasks, we use the Mean Squared Error (MSE) [96] as our evaluation metric and all models were trained to minimize it. MSE measures the statistical difference between the target prediction T and the output of the model P for the entire sample size n, as illustrated by Equation 4 :

Brain extraction is an essential pre-processing step in the realm of neuroimaging, and has an immediate impact on the quality of all subsequent processing and analyses steps. We have used a multi-institutional dataset of n = 2520 MRI scans along with their corresponding manually annotated brain masks. We trained on n = 1320 scans in a modality-agnostic manner (i.e., each structural MRI scan was treated as a separate input) as described in [7] and setting a internal validation set of n = 180 scans, with an independent testing cohort of n = 360 scans to ascertain the model performance. We trained by resampling the data from an isotropic resolution of 1mm 3 with a shape of 240 × 240 × 160 to a anisotropic resolution of 1.825 × 1.825 × 1.25 mm 3 with a shape of 128 × 128 × 128 [7] . The reason for this resampling was GPU memory limitations, i.e., 11GB VRAM. We trained multiple architectures (UNet, ResUNet, FCN) with only z-score normalization by discarding the zero-voxels, with no augmentations enabled.

Gliomas are among the most common and aggressive brain malignancies and accurate delineation of these regions can provide valuable clinical insights. We have used the publicly available MRI data from the International Brain Tumor Segmentation (BraTS) challenge of 2020 [9, 11, [62] [63] [64] to train multiple models to segment the various brain tumor sub-regions. Specifically, we used the full cohort of n = 371 training subjects, which we iteratively split it into n = 74 testing, n = 60 validation, and n = 237 training subjects following the k-fold cross-validation schema mentioned in Section 2.3, with all the 4 structural MRI sequences making up a single input data-point [7] . In total, 25 models are trained for each architecture (UNet, ResUNet, UInc, and FCN). For each model, we used a set of common hyperparameters that runs in a GPU with 11GB of memory, namely, patch size of 128 × 128 × 128, 30 base filters, Dice loss, with stochastic gradient descent as the optimizer. For pre-processing, we used z-score normalization by discarding the zero-voxels and cropping of the zero-planes. For data augmentation, we used noise, flipping, affine, rotation and blur, each with a probability of getting picked as 0.35. In each case, the model is trained to maximize the performance evaluation criteria, which is constructed by following the instructions in the BraTS challenge [9, 11] , i.e., averaging the Dice across the enhancing tumor, the tumor core (formed by combining necrosis and enhancing tumor), and the whole tumor (formed by combining the tumor core and the peritumoral edematous/infiltrated tissue).

We used the dataset from the PALM challenge [97] , which consists of segmentation of lesions in retinal fundus images and replicated the results for a ResUNet architecture from [98] . Additionally, we trained on FCN, UNet, and UInc to show results from a diversified set of architectures from the same dataset. We used the full cohort of n = 400 training subjects, and iteratively split into n = 80 testing, n = 64 validation, and n = 256 training subjects following the k-fold cross-validation schema mentioned in Section 2.3. In total, 25 models are trained for each architecture (UNet, ResUNet, UInc, and FCN). For each model, we used a set of common hyperparameters options that runs in a GPU with 11GB of memory, namely, patch size of 2048 × 1024, 30 base filters, Dice loss, with stochastic gradient descent as the optimizer. For pre-processing, we used full-image normalization, and data augmentation was performed using flipping, rotation, noise and blur, each with a probability of 0.5. The performance is evaluated in comparison with the ground truth binary masks of the fundus in the testing set.

An accurate volumetric estimation of the lung field would be crucial towards furthering the clinical goals of tackling respiratory illnesses, such as influenza, pneumonia, and speciality pathologies such as COVID-19. We have used an internal dataset to extract the Lung Field from chest CT images, where we identified n = 488 scans images with their corresponding ground truth. The ground truth annotations were generated under a semi-automatic procedure leveraging 2-cluster k-means, followed by manual qualitative refinements. We then trained on n = 244 scans and internally validated with n = 60 cases. We performed windowed pre-processing and clipped the intensities from −900 to −300 Hounsfield Units (HU). We also resampled the data down to 128 × 128 × 128 in order to consider the entire chest region and to ensure that the trained model remained agnostic to the original image resolution. We trained the ResUNet architecture with clipping and z-score normalization by discarding the zero-voxels with no augmentations enabled. We use Dice as our evaluation metric and trained the model to maximize the Dice score.

Dental enumeration from panoramic dental X-Ray images has a crucial role in the identification of dental diseases. Performing that task with deep learning provides an extensive advantage for the clinician to number the dentition quickly and point out the teeth that need care more accurately. Quadrant segmentation from those panoramic images is the first and the most critical step of numbering the dentition accurately, and a previous study by Yuksel et al. has used an UNet model to achieve that task [99] . Here, we replicated those results by training a segmentation model with GaNDLF that extracts quadrants from the dental X-ray images. To do that, we have used n = 900 dental X-ray images with their corresponding five classes (one for each quadrant plus the background). Class annotations have been generated by the experts and the images were resized down to 128 × 128 in order to consider the entire mouth region. We trained the UNet, the ResUNet, and the FCN architectures with 30 base filters with z-score normalization with no augmentations enabled. We used Dice as the evaluation metric and trained the model to maximize it.

Colonoscopy pathology examination can find cells of early-stage colon tumor from small tissue slices, and pathologists need to examine hundreds of tissue slices on a day-to-day basis, which is an extremely time consuming and tedious work. The DigestPath challenge [100] motivated participants to automate this process and thereby contribute to potentially improved diagnostics. The data provided in the DigestPath challenge includes slides containing colorectal cancer in JPEG format. The dimensions of the provided images range from 3000 × 3000 to 30000 × 30000. 180000 patches of the shape 512 × 512 at 10× resolution were extracted for training and 30000 for validation, with a set of n = 30 WSIs being kept separate as independent testing dataset. We trained the ResUNet architecture, and prior to training we normalized the training values to 0 − 1 by dividing each pixel by the maximum possible intensity, i.e., 255. To account for model generalizability, we employed the flip, rotate, blur, noise, gamma, and brightness data augmentations. We used Dice as our evaluation metric and trained the model to maximize it. Inference was then done on the testing dataset and the output of the model was evaluated against the ground truth binary masks to calculate the Dice score.

The human brain ages differently because of various environment factors. Quantifying the difference between actual age and predicted age can provide a useful insight into the overlap of aging signatures with various neurodegenerative pathologies. A 2020 study by Bashyam et al. [88] has used common 2D CNN architectures, borrowed from the computer vision community, to predict brain age from T1-weighted MRI scans across a wide age range. Methodologically, the original fully connected layers of the VGG-Net was replaced by a global average pooling, followed by a new fully connected layer of size 1024 with 80% dropout, and then a single output node with a linear activation was added. The network was then trained with the Adam optimizer, while using MSE. This study was evaluated on 10, 000 diverse structural brain MRI scans, pooling data from various studies, including the UK Biobank [101] and a multisite schizophrenia consortium [102] , thereby representing various subject populations and acquisition protocols. This inherent variability of the collective dataset allowed to successfully learn a regression model generalizable across sites. The Bashyam el al. study [88] goes on to examine using the learned age prediction weights as a starting point for transfer learning to other neuroimaging tasks. It is shown that the age prediction weights serve as a superior basis for transfer learning compared to ImageNet, particularly in neuroimaging tasks where the new training data is limited [88] .

Leveraging the modular nature of GaNDLF, we replicated the age prediction results of that study [88] using the same model architecture, training schema, and dataset as in the original study, while following GaNDLF's procedures. Using the VGG-16 model architecture and GaNDLF's built-in cross-validation functionality, we trained regression models using the intermediate n = 80 axial slices of each subject, with input data being split on the subject level. The same network hyperparameters were used, as those specified in the original study [88] ..

Following the procedure described in the experimental design section above, we perform evaluation specific to each application and organ system. For each application, the performance metrics are generated as an average of all the models trained across the cross-validated (see Section 2.3 for details) data splits, which ensures stable model performance without overfitting to a specific data split.

In accordance with the experimental protocol highlighted in Section 2.8.2, we have trained 25 models across various data splits for each architecture under consideration. In our analysis, we observed ( Table 2 ) that the ResUNet architecture gave the best results, with the average Dice coming to 0.98 ± 0.01. The UNet and FCN performed favorably as well, each producing an average Dice of 0.97 ± 0.01.

According to the experimental design described in Section 2.8.3, we have trained 25 models across various data splits for each architecture under consideration. We calculated the Dice of the predicted multi-label mask with the ground truth annotation for each model by averaging the Dice from the 3 clinically relevant regions [11, 62] , namely, the enhancing tumor, the tumor core, and the whole tumor. In our analysis, we observed ( Table 2 ) that the ResUNet architecture gave the best results, with the average Dice coming to 0.71 ± 0.05, followed by UNet with 0.65 ± 0.05, and then UInc with 0.64 ± 0.05, with FCN showing the worst performance with Dice equal to 0.62 ± 0.05 across all the cross-validation folds.

Following the experimental protocol described in Section 2.8.4, we have trained 25 models across various data splits for each architecture under consideration. In our analysis, we observed ( Table 2 ) that the ResUNet architecture gave the best results, with the average Dice coming to 0.71 ± 0.05, followed by UNet with 0.65 ± 0.05, and then UInc with 0.64 ± 0.05, with FCN being inferior with 0.62 ± 0.05 across all the cross-validation folds.

Pursuing the experimental protocol described in Section 2.8.5, we have trained 5 models across a various data splits for the ResUNet architecture under consideration. In our analysis, we observed ( Table 2 ) that the ResUNet gave satisfactory results for the problem with average Dice of 0.95 ± 0.02 across all cross-validation folds.

Following the experimental protocol described in Section 2.8.6, we have trained n = 25 models across various data splits for each architecture under consideration. As shown in Table 2 , our analysis showed that the UNet architecture gave the best results, with the average Dice coming to 0.91 ± 0.01. The ResUNet and FCN performed favorably as well, producing an average Dice of 0.88 ± 0.01, and 0.85 ± 0.02, respectively.

Following the experimental protocol described in Section 2.8.7, we have trained 1 model across a single data split for the ResUNet architecture. As shown in Table 2 , our analysis showed that the ResUNet gave the most satisfactory results for the problem with average Dice of 0.78 ± 0.03 for the unseen testing data.

Following the experimental protocol described in Section 2.8.8, we have trained a custom VGG16 (see Section 2.6 for details) network to predict the age of a brain from a single MR slice, by replicating the results shown in [88] . With an average MSE of 0.0141 (Table 3) , the prediction quality of the models trained by GaNDLF was in line with the original publication, showcasing the flexibility of GaNDLF to successfully adapt to various problem domains. 

The integration of M3D-CAM into GaNDLF allows the generation of these attention maps by simply enabling M3D-CAM in the configuration file. Examples of attention maps generated by M3D-CAM are illustrated in Figure 8 . [91] with the Grad-CAM backend. The left image shows the attention from a 2D classification network, the middle image from a 2D segmentation network and the right image from a 3D segmentation network.

We have presented the Generally Nuanced Deep Learning Framework (GaNDLF) as an end-to-end solution for scalable clinical workflows in medical imaging. GaNDLF's contribution spans across its ability to: i) speak directly to the non-DL experts, by providing building blocks for method development; ii) speak to the DL developers allowing for harmonized I/O (i.e., common data loaders enabling the main focus kept on the algorithmic development); iii) process images of various domains, including both radiology scans and digitized histology WSIs; iv) enable work on segmentation, regression, and classification; v) offer built-in general-purpose functionality for augmentations, and cross-validation; vi) be evaluated on a multitude of applications; vii) enable parallel training by using generic high performance computing protocols; viii) integrate tools to promote the interpretability and explainability of DL networks.

The central premise of GaNDLF is to enable researchers to conduct DL research across problem domains (segmentation, regression, classification, synthesis) and modalities (radiology, histology) without worrying about the details such as robust data splitting for training, validation, and testing, and implementing various loss functions. For DL method/architecture developers, GaNDLF provides a mechanism for robust evaluation of their architecture across a wide array of medical datasets that span across various dimensions, channels/modalities, and prediction classes, as well as to compare their algorithm's performance with well-established built-in architectures, including UNet [75, 78] , UNet with residual connections [76] , and Inception modules (UInc) [2, 79] , as well as a fully convolutional network [81] . For a novice researcher, this framework ensures the easy creation of robust models using various architectures that can be used for scientific research and method discovery, including the potential for aggregating results from various models, which has been shown to provide greater accuracy [9, 11] . For an experienced computational researcher, GaNDLF provides a platform to distribute their methods that allows for their application across various problem domains with relative ease.

Furthermore, we envision the model zoo (potentially in collaboration with DAGsHub 13 ) in GaNDLF to potentially be a phenomenal resource for pre-trained models and corresponding configurations to replicate training parameters for the scientific community in general. GaNDLF is a fully self-contained DL framework that contains various abstraction layers to enable researchers to produce and contribute robust DL models with absolutely zero knowledge of DL or coding experience.

Although GaNDLF has been evaluated across imaging modalities (on radiology and histology images), so far it has been limited to tasks of segmentation and regression, and not for classification or synthesis. Expanding the application areas would bolster the applicability of the framework. Additionally, application to datasets representing analysis of longitudinal images (such as perfusion imaging) has not yet been evaluated. Also, a mechanism to ensure cascading of models (i.e., train/infer different models of same/different architectures sequentially) or aggregation (i.e., train/infer models of different architectures concurrently) is not present, which have generally shown to produce better results [103, 104] . Mechanisms that enable AutoML [105] [106] [107] and other network architecture search (NAS) techniques [108] are tremendously powerful tools that create robust models, but are currently not supported in GaNDLF. Finally, application of GaNDLF to other modalities, such as genomics or electronic health records (EHR), has not been explored yet but it is considered as current work in progress.

As a stand-alone package, GaNDLF provides end-to-end functionality that facilitates DL research. Because of its flexible architecture, it is possible to leverage either certain parts of GaNDLF in other packages or leverage other tools and packages such as the community-based Project MONAI 14 to further extend the functionality of GaNDLF. Furthermore, GANLDF could partner with container-based platforms (such as the BraTS algorithmic repository 15,16 , or ModelHub.AI 17 [20] ) towards a structured dissemination of DL models to the research community. As all development is open-sourced 18 , with robust continuous integration and testing courtesy Azure DevOps 19 , contributions from the community will ensure that this framework builds ties to other packages quickly and in a reliable manner for end users. 

Neural network ensembles

Going deeper with convolutions

A survey on deep learning techniques for image and video semantic segmentation

Survey on semantic segmentation using deep learning techniques

Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning, ISPRS journal of photogrammetry and remote sensing

Searching for exotic particles in high-energy physics with deep learning

Brain extraction on mri scans in presence of diffuse glioma: Multi-institutional performance evaluation of deep learning methods and robust modality-agnostic training

Multi-disease segmentation of gliomas and white matter hyperintensities in the brats data using a 3d convolutional neural network

Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge

Deep learning in medical image analysis

The multimodal brain tumor image segmentation benchmark (brats)

O-net: An overall convolutional network for segmentation tasks

An artificial agent for anatomical landmark detection in medical images

Detecting anatomical landmarks from limited medical imaging data using two-stage task-oriented deep neural networks

Anhir: automatic non-rigid histological image registration challenge

Non-rigid image registration using self-supervised fully convolutional networks without training data

Histopathology-validated machine learning radiographic biomarker for noninvasive discrimination between true progression and pseudo-progression in glioblastoma

Deep learning: methods and applications, Foundations and trends in signal processing

A survey on deep learning: Algorithms, techniques, and applications

Modelhub. ai: Dissemination platform for deep learning models

Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation

Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data

Tensorflow: A system for large-scale machine learning

Pytorch: An imperative style, high-performance deep learning library

Convolutional architecture for fast feature embedding

Bonnet: An open-source training and deployment framework for semantic segmentation in robotics using cnns

The medical imaging interaction toolkit (mitk): a toolkit facilitating the creation of interactive software by extending vtk and itk

Cancer imaging phenomics toolkit: quantitative imaging analytics for precision diagnostics and predictive modeling of clinical outcome

The cancer imaging phenomics toolkit (captk): Technical overview

Brain cancer imaging phenomics toolkit (brain-captk): an interactive platform for quantitative analysis of glioblastoma

Multi-institutional noninvasive in vivo characterization of idh, 1p/19q, and egfrviii in glioma using neuro-cancer imaging phenomics toolkit (neuro-captk)

Cancer imaging phenomics via captk: multi-institutional prediction of progression-free survival and pattern of recurrence in glioblastoma

A study of cross-validation and bootstrap for accuracy estimation and model selection

Improvements on cross-validation: the 632+ bootstrap method

Cross-validation methods

Data augmentation for improving deep learning in image classification problem

Autoaugment: Learning augmentation strategies from data

3d slicer: a platform for subject-specific image analysis, visualization, and clinical support

Itk-snap: An interactive tool for semi-automatic segmentation of multi-modality biomedical images

Niftynet: a deep-learning platform for medical imaging

Deepneuro: an open-source deep learning toolbox for neuroimaging

Antsx: A dynamic ecosystem for quantitative biological and medical imaging

Dltk: State of the art reference implementations for deep learning on medical images

Balsiger, pymia: A python package for data handling and evaluation in deep learning-based medical image analysis

Evaluation of deep learning to augment image-guided radiotherapy for head and neck and prostate cancers

Itk: enabling reproducible research and open science

Torchio: a python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning

Setting the default to reproducible, computational science research

Reproducible research in computational science

The fair guiding principles for scientific data management and stewardship

Integrating pathology and radiology disciplines: an emerging opportunity?

Comparison between intensity normalization techniques for dynamic susceptibility contrast (dsc)-mri estimates of cerebral blood volume (cbv) in human gliomas

Evaluating the impact of intensity normalization on mr image synthesis

Deep learning: a primer for radiologists

Deep learning: A critical appraisal

Hipaa regulations-a new era of medical-record privacy?

The eu general data protection regulation (gdpr), A Practical Guide

A survey on image data augmentation for deep learning

The effectiveness of data augmentation in image classification using deep learning

batchgenerators -a python framework for data augmentation

Albumentations: fast and flexible image augmentations

Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features

Segmentation labels and radiomic features for the pre-operative scans of the tcga-gbm collection, The cancer imaging archive 286

Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection

The relationship between variable selection and data agumentation and a method for prediction

Prediction error estimation: a comparison of resampling methods

On over-fitting in model selection and subsequent selection bias in performance evaluation

The elements of statistical learning

Systematic evaluation of image tiling adverse effects on deep learning semantic segmentation

Appearance normalization of histology slides

Towards generalized nuclear segmentation in histological images

Openslide: A vendor-neutral software foundation for digital pathology

U-net: Convolutional networks for biomedical image segmentation

The importance of skip connections in biomedical image segmentation

Deep residual learning for image recognition

net: learning dense volumetric segmentation from sparse annotation

Inception-v4, inception-resnet and the impact of residual connections on learning

Deepmrseg: a convolutional deep neural network for anatomy and abnormality segmentation on mr images

Fully convolutional networks for semantic segmentation

A survey on deep learning in medical image analysis

Very deep convolutional networks for large-scale image recognition

Fully convolutional network for liver segmentation and lesions detection, in: Deep learning and data labeling for medical applications

2009 IEEE conference on computer vision and pattern recognition

Deep learning using rectified linear units (relu)

Harisiqbal88/plotneuralnet

Mri signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide

From machine learning to explainable ai

Is it time to get rid of black boxes and cultivate trust in ai?

M3d-cam: A pytorch library to generate 3d data attention maps for medical deep learning

Striving for simplicity: The all convolutional net

Grad-cam: Visual explanations from deep networks via gradient-based localization

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks

Morphometric analysis of white matter lesions in mr images: method and validation

Statistical decision theory and Bayesian analysis

Pathologic myopia challenge

Detection of pathological myopia and optic disc segmentation with deep convolutional neural networks

Dental enumeration and multiple treatment detection on panoramic x-rays using deep learning

Signet ring cell detection with a semi-supervised learning framework

Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age

Multisite machine learning analysis provides a robust structural imaging signature of schizophrenia detectable across diverse patient populations and within individuals

Enhancing deep learning sentiment analysis with ensemble techniques in social applications

A survey on ensemble learning

Auto-weka: Combined selection and hyperparameter optimization of classification algorithms

Auto-pytorch tabular: Multi-fidelity metalearning for efficient and robust autodl, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) 1-12IEEE early access

Towards automatically-tuned deep neural networks

Neural architecture search: A survey

Supplementary Material

Research reported in this publication was partly supported by the National Cancer Institute (NCI), the National Institute of Neurological Disorders and Stroke (NINDS), the National Institute on Aging (NIA), the National Institute of Mental Health (NIMH), and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH), under award numbers NCI:U01CA242871, NCI:U24CA189523, NINDS:R01NS042645, NIA:RF1AG054409, NIA:U01AG068057, NIMH:R01MH112070, and NIBIB:R01EB022573. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH.