key: cord-0602140-4qp8d4i9
authors: Maleki, Farhad; Ovens, Katie; Gupta, Rajiv; Reinhold, Caroline; Spatz, Alan; Forghani, Reza
title: Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
date: 2022-02-01
journal: nan
DOI: nan
sha: b0dfc427b05199a8ec893dc8026ce6ceafe3ccc0
doc_id: 602140
cord_uid: 4qp8d4i9

Despite the great potential of machine learning, the lack of generalizability has hindered the widespread adoption of these technologies in routine clinical practice. We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator, and (3) batch effect and how these pitfalls could affect the generalizability of machine learning models. We implement random forest and deep convolutional neural network models using several medical imaging datasets, including head and neck CT, lung CT, chest X-Ray, and histopathological images, to quantify and illustrate the effect of these pitfalls. We develop these models with and without the pitfall and compare the performance of the resulting models in terms of accuracy, precision, recall, and F1 score. Our results showed that violation of the independence assumption could substantially affect model generalizability. More specifically, (I) applying oversampling before splitting data into train, validation and test sets; (II) performing data augmentation before splitting data; (III) distributing data points for a subject across training, validation, and test sets; and (IV) applying feature selection before splitting data led to superficial boosts in model performance. We also observed that inappropriate performance indicators could lead to erroneous conclusions. Also, batch effect could lead to developing models that lack generalizability. The aforementioned methodological pitfalls lead to machine learning models with over-optimistic performance. These errors, if made, cannot be captured using internal model evaluation, and the inaccurate predictions made by the model may lead to wrong conclusions and interpretations. Therefore, avoiding these pitfalls is a necessary condition for developing generalizable models.

Medical images such as computed tomography (CT), magnetic resonance imaging (MRI), and digitized histopathology slides are widely used for diagnosis and treatment planning. Manual qualitative evaluation of these images by domain experts is the most common method for analyzing such data. Besides being time-consuming, manual evaluation has several shortcomings, such as subjective nature and associated intra-observer and interobserver variability. 1, 2 Human interpretation also may not fully leverage many quantitative features that may not be immediately apparent to the naked eye. Quantitative methods such as machine learning (ML) and deep learning (DL) have great potential for supplementing and augmenting expert human assessment by acting as a clinical assistant or decision support tool. There is a growing interest in utilizing ML and DL methods in medical applications. Examples include classification of benign and malignant tumors, grading of tumors, prognosis/prediction, and treatment planning. [38] [39] [40] [41] [42] [43] [44] [45] [46] Despite a large body of published work on applications of ML and DL in medicine, very few are clinically deployed. 8 Lack of generalizability of trained models is an important reason why the deployment of the ML and DL methods is lagging behind. 8 Factors that affect generalizability include technical variations and lack of standardization in medical practice, differences in patient demographics from one center to another, patient genotypic and phenotypic characteristics, and differences in tools and methodologies used for medical data processing and model development. 9 Multiple guidelines for conducting and presenting research to ensure rigor, quality, and reproducibility in DL/ML have been published. [10] [11] [12] [13] [14] QUADAS and its extension QUADAS-2 were developed by Whiting et al. for a systematic review of diagnostic studies. 10 QUADAS-2 assesses the risk of bias in patient selection, index test, reference standard, and flow and timing of a diagnostic study to ensure generalizability. Wolff et al. designed PROBAST as a series of questions to facilitate systematic review and assessment of potential bias in clinical prediction models. 13 Collins et al. developed the TRIPOD guideline to encourage transparency in reporting prediction models. 11 TRIPOD contains recommendations for expected content and characteristics of abstract, introduction, methods, results, and discussion sections of scientific papers on ML and DL. Mongan et al. published the CLAIM checklist to aid authors and reviewers with best practices in artificial intelligence research in medical imaging. 14 Similar to TRIPOD, CLAIM provides highlevel recommendations for preparing scientific manuscripts focused on medical imaging.

The aforementioned guidelines mainly focus on reporting and reproducibility aspects of research findings. However, they offer minimal to no guidance regarding good methodological practices in medical machine learning. Even when suggested guidelines are followed and the results are reproducible, there is still a risk of methodological errors in the study design and execution. Only a limited number of explicit technical guidelines exist for avoiding methodologic mistakes that lead to a lack of generalizability of ML and DL applications. In addition, they are often presented in a manner that is not readily accessible to practitioners in the medical domain. Clear, scientifically-backed guidelines are essential to promote the development of generalizable ML and DL models that may be clinically deployed.

In this paper, we identify and experimentally investigate the following three major categories of methodological errors in developing machine learning and deep learning models: (1) violation of independence assumption, (2) the use of inappropriate performance indicator for model evaluation, and (3) the introduction of batch effect. These pitfalls cannot be detected in an internal evaluation of models, leading to an over-optimistic estimation of model performance and consequently a lack of generalizability.

To show that the aforementioned methodological pitfalls are not specific to one data modality, in this study, we use several imaging modalities.

We used a CT dataset of 137 head and neck squamous cell carcinoma (HNSCC) patients treated by radiotherapy. 15, 16 Hereafter, we refer to this dataset as the "HNSCC" dataset. This dataset is available from The Cancer Imaging Archive (TCIA). 17 We used the pretreatment CT scans where the gross tumor volume was manually delineated by an experienced radiation oncologist. 16 Figure 1 shows examples of CT images for larynx and oropharynx tumors with their contours overlaid. Table 1 provides a summary of the clinical endpoints of the HNSCC dataset. 

We used the images from the Lung CT Segmentation Challenge 2017, extracted from TCIA. 18, 19 The goal of this challenge was to provide a means for comparing auto-segmentation methods used to segment organs at risk in CT images. This dataset contains 120 CT scan series from 60 patients. We use this dataset to demonstrate methodological pitfalls related to performance metrics for auto-segmentation.

We used a pathology dataset containing 143 hematoxylin and eosin (H&E)-stained formalin-fixed paraffin-embedded whole-slide images of lung adenocarcinoma provided by the Department of Pathology and Laboratory Medicine at Dartmouth-Hitchcock Medical Center. 20 The dataset contains five histopathological patterns: solid (51 slides), lepidic (19 slides), acinar (59 slides), micropapillary (9 slides), and papillary (5 slides). We used a subset of 110 slides from patients with solid and acinar predominant histopathological patterns. This subset was used because solid and acinar samples are numerous and the categories relatively balanced.

Due to the very large resolution of the histopathology images, it is computationally impractical to analyze them as a whole image. 47 Therefore, we first downscaled each image by a factor of 4. Then, using color thresholding, we extracted the foreground, i.e., the tissue segments on each slide. Next, for each image, we extracted random patches sized 1024 by 1024 pixels. Patches with 75% or more background were excluded during the patch extraction process. The patch extraction process continued until 200 patches were extracted from each image resulting in 22000 patches. Figure 3 illustrates an example wholeslide image as well as a selection of random patches. 

To demonstrate the impact of batch effects, we used two X-ray datasets: 8851 normal chest X-rays with no findings from the Radiological Society of North America (RSNA) pneumonia detection challenge dataset, which is available on Kaggle, and a chest X-rays dataset from Kermany et al. 22 , which included 1349 normal X-rays with no findings and 3883 X-rays demonstrating pneumonia in pediatric patients. Figure 2 illustrates samples from each dataset.

For radiomics analysis, 1652 features were extracted for each tumor in the HNSCC dataset using the pyradiomics package, 23 which is an open-source python package for extracting radiomics features from medical images. The extracted features include shape-based, firstorder statistics, gray level co-occurrence matrix, gray level run length matrix, gray level size zone matrix, gray level difference matrix, neighborhood grey tone difference matrix, and gray level dependence matrix.

We split the data into train, validation, and test sets in a stratified manner to preserve class distribution across these sets. The data splitting was accomplished according to predetermined ratios for each scenario. To deal with class imbalance, we used oversampling on samples in the training set. We used a sequence of feature selection operations to deal with the high dimensionality of the radiomic features. First, we removed all constant features, as they do not offer any predictive value for model building. Then we used a univariate feature selection approach to select the top 100 features. Next, we followed a feature elimination approach to select the top 10 features for model building. Using the resulting 10 radiomic features, we built a random forest model for endpoint prediction. Appendix A provides detailed information regarding the feature selection process and model building.

For deep learning analyses, we utilized a ResNet-50 architecture 24 pretrained on ImageNet. 25 The classifier layer of the ResNet model was replaced to build a binary classifier. In addition, a Drop Out layer 36 with a probability of 0.5 was added after the ResNet-50 Backbone and before the classifier layer. A cross-entropy loss was used for all experiments. 26 We also used Adam optimizer 37 with a learning rate of 0.0001, β1 = 0.9, and β2 = 0.999. A batch size of 32 was used for all experiments. All models were trained for 100 epochs, and the model with the lowest loss was selected as the best model. The Albumentations package version 0.4.5 was used for image augmentation. 27 We conducted all experiments using Python 3.7 and PyTorch version 1.6 on a Titan RTX GPU machine. Figure  4 illustrates a schematic view of a DL pipeline for image classification.

This section presents the experiments designed to investigate the effect of the three major categories of methodological errors on model generalizability. Appendix B provides context for the technical terms used in the rest of the paper for a reader unfamiliar with these terms.

Adhering to the assumption of independence is essential for developing generalizable models. We investigated the impact of different designs and executions of four common practices on the assumption of independence: (I) over-sampling before splitting data into train, validation, and test sets; (II) data augmentation before splitting data into train, validation, and test sets; (III) distributing data points for a patient across training and test sets; and (IV) performing feature selection before splitting data into train, validation, and test sets.

(I) Over-sampling before splitting data into training, validation, and test sets: Conducting oversampling is a common and effective practice in medical data analysis. However, if conducted inaccurately, oversampling may break the assumptions of independence between data used for model training and data used for model evaluation. To quantitatively show how applying oversampling before splitting data could affect model generalizability, we evaluated two binary classifiers-as described in the Conventional Radiomics Analysis section-for predicting local recurrence by employing a radiomics approach using the HNSCC dataset. The first model (model A) was developed by conducting oversampling before data splitting. The second model (model B) was developed by conducting oversampling after data splitting. The only difference between the pipeline used for developing models A and B is the order of applying the oversampling step. For model A, oversampling is applied before splitting data and for model B after splitting data.

(II) data augmentation before splitting data into training, validation, and test sets: Data augmentation is widely used for medical image analysis. To quantitatively show how applying data augmentation before splitting data could affect model generalizability, we developed two DLbased binary classifiers-as described in the Deep Learning Image Analysis section-for distinguishing solid and acinar predominant histopathological patterns in patients with lung adenocarcinoma. The first model (model C) was developed by conducting data augmentation before data splitting. The second model (model D) was developed by conducting data augmentation after data splitting. When developing these models, every component of the model building and evaluation pipeline is kept the same, other than the order of applying the data augmentation step.

(III) distributing data points for a patient across training and test sets: We experimentally investigate how distributing data points for a patient across training, validation, and test sets could impact the model generalizability. Using a pathology dataset, we built two deep learning classifiers-as described in the Deep Learning Image Analysis section-for distinguishing solid and acinar predominant histopathological patterns for patients with lung adenocarcinoma. Model E was developed by randomly distributing the image patches across training, validation, and test sets, breaking the independence assumption. Model F was developed by assigning image patches for each patient to either training, validation, or test sets. Everything else was kept the same for developing these models.

(IV) performing feature selection before splitting data: Feature selection is an essential step in developing ML models, where a subset of a large number of available features is selected for model building. To demonstrate the impact of applying feature selection before splitting data on model generalizability, using the HNSCC dataset, we developed two binary classifiers-as described in the Conventional Radiomics Analysis section-to predict overall survival using a radiomics approach. The first model (model G) was developed by conducting feature selection before data splitting. The second model (model H) was developed by conducting feature selection after data splitting. For the sake of accurate comparison, all other steps were kept the same for developing these models.

We empirically investigated how an inappropriate choice of performance indicator or baseline could lead to misleading results and erroneous conclusions. Dice score and Intersection over Union (IoU) are widely used for evaluating segmentation models. To show the role of a baseline expectation for evaluating the result of a segmentation model, we developed a threshold-based approach to segment air inside the body for samples in the lung CT dataset as a proxy for lung segmentation. We used any voxel with a Hounsfield unit less than -400 as air. Then we removed the segment of air outside the bodies of the patients. We compared the performance of this simple model-which could be used as a simple baseline-with the ground truth (lung contours). Any model with a performance less than such a baseline model should be considered irrelevant.

The source or origin of the selected dataset for developing and evaluating ML and DL models plays an essential role in the applicability of the resulting model in a clinical setting. We hypothesize that batch effect can be used by ML and DL models to superficially boost performance measures, and the resulting model could learn factors that characterize each batch rather than the condition under study. For example, suppose all malignant tumors in a cohort were scanned on MRI scanner X and all benign tumors on MRI scanner Y. In that case, the model may have learned to differentiate the malignant from the benign tumors not based on intrinsic tumor characteristics but rather due to MRI scanner attributed differences.

In order to experimentally investigate how batch effects could impact model generalizability, we simulated a dataset with batch effect by extracting pneumonia samples from the dataset by Kermany et al. 22 and the normal samples (i.e., images with no findings) from the RSNA dataset. Hereafter we refer to this dataset as Batch X-Ray. We trained a model (model I) on the Batch X-Ray dataset. Then we tested this model on an external dataset of normal chest X-Ray images from Kermany et al. 22 to show how the batch effect could affect the generalizability of deep learning models. Table 2 shows the effect of applying oversampling before splitting data into training, validation, and test sets on models built for predicting local recurrence using the HNSCC dataset. For model A, oversampling was conducted before splitting data. For model B, oversampling was applied after splitting data. The results showed a substantial gap between the performance of these two models. While model B showed poor performance, model A seemed to offer promising results. Table 3 illustrates the effect of applying data augmentation before and after splitting data into training, validation, and test sets on the models built for distinguishing solid and acinar predominant histopathological patterns in patients with lung adenocarcinoma. In model C, data augmentation was applied before splitting data and in model D after splitting data. The results showed a superficial boost in the performance measures for model C, while model D showed poor performance. Table 4 shows the effect of breaking the independence assumption by distributing data points for a patient across training and test sets. For model E, data point has been randomly distributed across training, validation, and test sets. Therefore, data points for a patient could appear in both training and test sets. For model F, the independence assumption has been preserved by assigning data points for each patient to either training, validation, or test set. Note that model F is the same as model D. These results indicate that distributing data points of patients across training and test sets leads to a superficial boost in the performance of model E. In contrast, the performance measures for model F are substantially lower than that of model E. Table 5 shows how applying feature selection prior to splitting data into training, validation, and test sets could lead to violation of the independence assumption. For model G, feature selection has been conducted before splitting data and for model H after splitting data. The result indicates that applying feature selection before splitting data into training, validation, and test sets could lead to a superficial boost in the model performance. Figure 5 illustrates the ground truth segmentation as well as predicted segmentation of a randomly chosen image from the lung CT dataset, where the prediction has been made by a simple baseline model that detects air inside the body. While the predicted segmentation for lung is not medically acceptable, it achieves a Dice score of 0.94 and an IoU of 0.88. Table 6 shows the summary of the Dice score and IoU for the result of a simple model detecting air within the body on the lung CT dataset as an estimate for lung segmentation. The results of this simple model achieved a high Dice score and IoU, while a visual inspection (see Figure 5 ) reveals the inferiority of the segmentation from a medical perspective. Therefore, models with performance measures lower than this baseline model should not be utilized. 

(

Our results also showed that batch effects could be a substantial barrier to model generalizability. We observed that model I-which was trained and tested on a dataset with batch effect (Batch X-Ray dataset)-achieved accuracy, precision, recall, and F1 score of 0.997, 0.979, 0.995, and 0.987, respectively. However, when this model was applied to the normal pediatric chest X-ray samples from the dataset by Kermany et al. 22 , only 3.855% of samples were classified correctly as normal.

The attribution of each pixel of an image on the model prediction for that image can be calculated using Integrated Gradient method 48 . Figure 6 overlays the attribution values for each pixel of an X-Ray of a normal pediatric sample. As depicted in Figure 6 , the pneumonia prediction model trained using the Batch X-Ray dataset focuses on anatomical structures and body position rather than image characteristics in the lung. 

With the countless proof of concepts highlighting the potential of ML and DL approaches for medical image analysis, the natural expectation is the widespread use of these approaches in clinical settings. However, when applied prospectively, the lack of generalizability is the main challenge facing these technologies. In this paper, we investigate and highlight some of the key methodological errors that lead to models that suffer from a lack of generalizability but achieve deceptively promising results during internal evaluation. These errors, if made, may be difficult to capture by readers, reviewers, or authors. There is often insufficient information provided regarding the origin of datasets as well as the pipeline used for model development. This makes it difficult to detect these errors in peer-reviewed publications, even for experts in ML and DL. Although not generalizable, these models define erroneous state-of-the-art models that often cannot be outperformed by generalizable DL and ML models. Therefore, understanding how these methodological errors occur is essential to the reader, reviewers, and authors of ML and DL approaches.

Most ML and DL approaches assume that data used for model development are independent and identically distributed. Based on this assumption, samples used for model training and evaluation have the same probability distribution and are mutually independent. Dependency between samples from training and test sets could drastically affect the generalizability of developed models.

The performance of most machine learning models is evaluated using an internal evaluation, in which the available data is partitioned into training, validation, and test sets. To achieve an unbiased estimate of performance measures for a model, the data from the training and test sets must be independent. When this assumption is violated, the internal test set does not provide an unbiased estimate of the generalization error. Therefore, any violation of the independence assumption should be avoided.

Our results demonstrated that when oversampling was incorrectly applied to the HNSCC dataset, the model achieved superficially high performance measures; however, a correct approach led to poor results as expected due to the very small number of local recurrences in the dataset (12 for larynx and 12 for oropharynx). When oversampling is conducted first, and then data is randomly split into training, validation, and test sets, the same copy of a data point could appear in training and test sets. Therefore, the training and test set are no longer independent.

Similar superficial boosts were observed for incorrect application of data augmentation. When data augmentation is applied to an image, some of its characteristics change. However, there will still be many characteristics that the original image and the augmented one will share. If data augmentation is applied before data splitting, just as with oversampling, these highly correlated samples can be spread across the training, validation, and test sets. Therefore, samples highly similar to those observed during model training may be seen again in the testing or validation phases, potentially leading to high performance on the internal test set but with poor performance on external data.

Often, there are several data points associated with each patient. For instance, extracting image patches is a common practice when analyzing histopathology imaging or 3D analysis of other medical images such as MRI or CT images. 30 Due to the high resolution of digital pathology images, whole image analysis is impractical. Consequently, small portions of these images are extracted as image patches and used for further analysis. These patches might share some characteristics irrelevant to the study goal. Distributing the different patches derived from a single patient between training, validation, and test sets could artificially boost model performance. The resulting model will not be generalizable when applied to external data. Also, whenever there exist several data points-e.g., several MRI images-for a given patient, those data points should be assigned to only training, validation, or test set. For example, one should not assign a T2W sequence of a patient to the training and the corresponding T1W sequence of the same patient to the test set. In radiomics, there are often a large number of features representing the statistical characteristics, shape, and texture of a region of interest. Often the number of features is much larger than the number of available samples. Feature selection is an important step in developing machine learning models with highdimensional features. If feature selection is applied before splitting data, the information from all samples in the dataset is used to select a subset of features that work best for all samples. This partially exposes the test set, which will be selected in the next step, to the model and breaks the independence assumption. In this work, we showed that exposing the test samples to the feature selection methods can lead to a superficial boost in performance measures due to the violation of independence assumptions. In such cases, the selected features are chosen to be discriminative for the given test set. These features are often less discriminative when applied to unseen data. This leads to the degradation of performance measures compared to the over-optimistic measures achieved when the test set was exposed during the feature selection step.

Another key consideration in developing ML and DL models is the choice of performance indicators. The choice of an indicator is critical for developing generalizable models that can be used in clinical settings. Such a choice should be made with the utility of the predictive model in mind. For example, using accuracy for a diagnostic model for a rare condition is often misleading, as a model that disregards all cases of the rare condition in a dataset with high-class imbalance still achieves high accuracy. For example, consider a tumor malignancy prediction using the HNSCC dataset. If we simply predict all samples as non-malignant, then tumor prediction accuracy is equal to 94%. However, the recall in such a case is zero. This highlights the need for proper performance indicators to evaluate predictive models. In addition, the cost associated with misclassification should be considered in model development. Failure to correctly diagnose a life-threatening condition should be considered when developing and evaluating predictive models. A method with high accuracy but with low recall (sensitivity) cannot be used in such situations. On the other hand, falsepositive results that could lead to highly invasive and unnecessary procedures or those with a significant negative impact on productivity are prohibitive for the deployment of predictive models in clinical settings. In such scenarios, precision should also be considered as a performance indicator to guide the development and evaluation of predictive models.

Dice score and IoU are commonly used as performance measures for evaluating segmentation models. From a mathematical perspective, the Dice score is always larger than or equal to IoU (see the Supplementary Materials for a mathematical proof), which encourages reporting the Dice score as the metric for evaluating segmentation models. In our example, we observed that both of these metrics achieved a high value (IoU: 0.88 and Dice: 0.94), despite obvious flaws in the segmentation of the lung. Therefore, we encourage the visual inspection of the outcome of a segmentation model as a qualitative analysis. Pixel accuracy, as another measure, should be avoided when evaluating small regions/volumes of interest.

Another consideration in developing generalizable models is the size of the dataset used for model evaluation. Evaluation of models using small sample sizes has a high variation depending on the composition of the samples in the test set. If the use of large internal test sets is not practical, repeated K-fold cross-validation or nested crossvalidation could be used to partially alleviate the effect of the test set composition on the performance metrics. This provides more reliable estimates of the average performance metrics. When using repeated K-fold crossvalidation or nested cross-validation, the average and variance of the performance metrics should be reported.

It is desirable to collect and analyze imaging data from different sites, which may lead to a variety of technical differences, such as the settings used for the medical devices for acquiring diagnostic imaging. 32, 33 This is beneficial for increasing sample size, helping to increase the generalizability of models as the data used for model development better represent the condition under study. However, if the class distribution of samples from different sites substantially varies, the aggregated dataset and the resulting models could suffer from a batch effect. It should be noted that disregarding batch effect is not rare. In a study of machine learning models used for COVID-19 diagnosis, Roberts et al. reported that some studies had used healthy images from pediatric patients while the COVID-19 samples came from adults. 34 Availability of public datasets is essential for benchmarking and evaluating medical image analysis approaches. Platforms such as Kaggle encourage publishing datasets. However, often there is little to no evaluation on the quality or origin of the data. Therefore, some users assemble datasets from different sources, often erasing the metadata associated with these datasets. This makes tracking the original source challenging, if not impossible. For example, Roberts et al. reported a publication where the used test set was a subset of the assembled dataset used for training the model. We simulated a batch effect, present in some published research, and showed how it could lead to models that are medically irrelevant and not generalizable. Therefore, care should be taken when using assembled datasets.

Besides avoiding the aforementioned methodological errors, there are other considerations and challenges in the ML and DL domain, such as data quality and availability, bias, and explainability, which are beyond the scope of this paper. Other recent literature covers the potentials of artificial intelligence (AI) for misuse and provides suggestions and guidelines for how AI research can be utilized responsibly. 34, 35 In addition, other current literature offers guidelines for presenting ML and DL research to ensure the reproducibility of the results. 14 We would also recommend consideration of this literature by any researcher who wishes to read, review, develop, or utilize ML and DL models. The recommendations in this paper are complementary to these works. Figure 7 presents a guideline to avoid the methodological errors covered in this paper.

Medical image analysis is interdisciplinary, which requires contribution from both imaging, computational and medical experts. Lack of expertise in one of these domains might lead to developing models that suffer from a lack of generalizability. For example, if medical expertise is present, it is unlikely that a comparison between pediatric samples and adult patients would be considered a valid experimental design as there are substantial differences in the anatomical and imaging components between these two groups of patients. Furthermore, a model built to classify COVID-19 versus normal using lung X-rays would immediately be recognized by a medical expert as requiring a more rigorous evaluation to ensure the model does not falsely detect other lung abnormalities as COVID-19. Collaboration and cooperation amongst various experts at each stage of medical image analysis is essential for the development of ML and DL models that can ultimately be applied in a clinical setting.

In this work, we demonstrated how three categories of design errors could lead to misleading results for conventional radiomic and deep learning studies performed on medical images. The insights and the guidelines provided in this work can be used for designing machine learning studies that increase algorithm generalizability and could be of interest to researchers involved in developing ML and DL models for use in a clinical setting.

training set is used to learn model parameters, the validation set is used to select model hyper-parameters, and the test set is used to provide an unbiased estimate of model generalization error. To provide an unbiased estimate of the generalization error, the test set needs to be independent of the training set. However, the validity of this design is contingent on the assumption of independence.

Oversampling: Class imbalance happens when there is a substantial difference between the number of samples from one class versus the others. Often, models developed using imbalanced datasets tend to undermine minority classes and focus on the majority classes. Oversampling is a technique used to alleviate this challenge by extracting a larger number of samples using sampling with replacement from the original minority classes to artificially increase the number of samples from the minority class(es). Class imbalance is common in medical imaging datasets due to factors such as the rarity of some diseases and difficulties in imaging certain conditions. 28 Data augmentation: Data augmentation refers to computational methods used to generate new data points from the existing ones. Data augmentation is commonly used when developing ML and DL models for image analysis, and it has been shown to improve the performance and generalizability of the resulting models. The use of data augmentation is essential in medical image analysis, where developing large-scale datasets is often impractical. 29 Data augmentation also could help with alleviating class imbalance which is common in the medical domain.

Batch effect: A batch effect happens when data that comes from several sources are aggregated to develop a larger dataset, and the class distribution samples from these sources substantially vary-e.g., normal samples come from one MRI scanner, and diseased samples come from another scanner.

Dice score and Intersection over Union (IoU): Dice score is a measure of relative overlap and is defined as follows:

( , ) = 2 ‖ ∩ ‖ ‖ ‖ + ‖ ‖ where X and Y are two segmentations, e.g., the ground truth and the model prediction. Also, IoU is calculated as follows:

( , ) = ‖ ∩ ‖ ‖ ∪ ‖ Dice score and IoU result in values between 0 and 1, where a value of 1 represents a perfect overlap between X and Y, and 0 represents no overlap.

Intra-and interobserver variability in CT measurements in oncology

Interobserver variability in quality assessment of magnetic resonance images

Machine learning study of several classifiers trained with texture analysis features to differentiate benign from malignant soft-tissue tumors in T1MRI images

Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme

Automated brain histology classification using machine learning

Machine learning applications in cancer prognosis and prediction

Clinical decision support of radiotherapy treatment planning: a data-driven machine learning strategy for patient-specific dosimetric decision making

Key challenges for delivering clinical impact with artificial intelligence

The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health

QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement

Radiomics: the bridge between medical imaging and personalized medicine

PROBAST: a tool to assess the risk of bias and applicability of prediction model studies

Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers

Data from Head-Neck-Radiomics-HN1 [Data set]. The Cancer Imaging Archive

Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach

The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository

Data from lung CT segmentation challenge. The Cancer Imaging Archive

Autosegmentation for thoracic radiation treatment planning: a grand challenge at AAPM 2017

Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks

Pneumonia detection in chest X-ray images using an ensemble of deep learning models

Computational radiomics system to decode the radiographic phenotype

Deep residual learning for image recognition

Imagenet: A large-scale hierarchical image database

Deep learning

Albumentations: fast and flexible image augmentations

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

Overview of machine learning: part 2: deep learning for medical image analysis

Improving patch-based convolutional neural networks for MRI brain tumor segmentation by leveraging location information

Machine learning algorithm validation: from essentials to advanced applications and implications for regulatory certification and deployment

Minimizing acquisition-related radiomics variability by image resampling and batch effect correction to allow for large-scale data analysis

The Impact of Digital Histopathology Batch Effect on Deep Learning Model Accuracy and Bias

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

How to be responsible in AI publication

Improving neural networks by preventing co-adaptation of feature detectors

Adam: A method for stochastic optimization

Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis

Artificial intelligence and machine learning in radiology: current state and considerations for routine clinical implementation

Brain tumor segmentation in MR images using a sparse constrained level set algorithm

Precision digital oncology: emerging role of radiomics-based biomarkers and artificial intelligence for advanced imaging and characterization of brain tumors

A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health

Prediction models-development, evaluation, and clinical application

Radiomic analysis reveals prognostic information in T1-weighted baseline magnetic resonance imaging in patients with glioblastoma

Clinical value of radiomics and machine learning in breast ultrasound: a multicenter study for differential diagnosis of benign and malignant lesions

Deep learning in histopathology: the path to the clinic

Captum: A unified and generic model interpretability library for pytorch

We used the scikit-learn Python package for feature selection and building random forest models. From the feature_selection module, we used VarianceThreshold for removing constant features, i.e., features with zero variance. Also, we used SelectKBest for selecting the 100 best features, where we used the f_classif function to assign a score to each feature. The RFECV with a support vector classifier (SVC) were used for recursive feature elimination to select the final 10 features used for model building. We used RandomForestClassifier from the ensemble module for building the random forest models. The hyperparameters were tuned using GridSearchCV.

Assumption of Independence: To develop machine learning models, it is a common practice that the available data is split into training, validation, and test sets. The

Here we show that Dice score is always greater than or equal to IoU. For arbitrary sets A and B, we have:∥A∥−∥A ∩ B∥≥ 0 and ∥B∥−∥A ∩ B∥≥ 0 (∥A∥−∥A ∩ B∥) + (∥B∥−∥A ∩ B∥) ≥ 0 Where, ‖ ‖ represents the size of X. Multiplying both sides of the inequality by a positive number ∥A ∩ B∥ gives us the following: