key: cord-0821621-hglp1vpm
authors: Peña-Solórzano, Carlos A.; Albrecht, David W.; Bassed, Richard B.; Burke, Michael D.; Dimmock, Matthew R.
title: Findings from machine learning in clinical medical imaging applications – Lessons for translation to the forensic setting
date: 2020-10-18
journal: Forensic Sci Int
DOI: 10.1016/j.forsciint.2020.110538
sha: 1cbc613b90491d9db25d3bb6ef16e150b42ca6c0
doc_id: 821621
cord_uid: hglp1vpm

Machine learning (ML) techniques are increasingly being used in clinical medical imaging to automate distinct processing tasks. In post-mortem forensic radiology, the use of these algorithms presents significant challenges due to variability in organ position, structural changes from decomposition, inconsistent body placement in the scanner, and the presence of foreign bodies. Existing ML approaches in clinical imaging can likely be transferred to the forensic setting with careful consideration to account for the increased variability and temporal factors that affect the data used to train these algorithms. Additional steps are required to deal with these issues, by incorporating the possible variability into the training data through data augmentation, or by using atlases as a pre-processing step to account for death-related factors. A key application of ML would be then to highlight anatomical and gross pathological features of interest, or present information to help optimally determine the cause of death. In this review, we highlight results and limitations of applications in clinical medical imaging that use ML to determine key implications for their application in the forensic setting.

Forensic radiology is not clinical radiology applied to a deceased person. In the forensic setting, findings that a clinical radiologist may not typically have encountered are commonplace [1] , e.g. post-mortem gas formation [2] . Post-mortem computed tomography (PMCT) is widely used in forensic investigations, where acquisition protocols used during clinical CT are not applicable due to rigor mortis and aversion to repositioning the decedent to avoid tampering with evidence. However, CT scans can be acquired with higher doses and there is no patient motion, therefore improving image quality. Additionally, recent developments such as PMCT angiography (PMCTA) with specialized pumps allows the diagnosis of vascular lesions whilst maintaining the integrity of anatomic structures, thus preserving evidence integrity [3, 4] .

In order to overcome the limitations of soft tissue contrast and a lack of vascular visualization provided by PMCT [5] , postmortem magnetic resonance imaging (PMMRI) is increasing in impact, albeit in a small way thus far. Whilst PMMRI offers improved soft tissue contrast, for vascular diagnoses it presents similar performance to PMCTA, with higher associated cost. However, applications to cardiac imaging are an exception, due to improved visualization of the coronary arteries and myocardium [5] .

Image processing typically involves segmentation, feature extraction, and classification. Image segmentation refers to the partitioning of a digital image into multiple segments that are sets of pixels (or voxels) which usually represent discrete structures. Approaches to image segmentation prior to ML included probabilistic atlases [12, 13] , statistical shape models (SSMs) [14, 15] , graph-cut (GC) algorithms [16, 17] , and multi-atlas segmentation (MAS) [18] . Feature extraction is a dimensionality reduction technique used to efficiently represent parts of an image as a compact feature vector. Feature extraction was traditionally performed through determining properties such as first order textures (e.g. mean or entropy) or correlations [19, 20] . Image classification is the process of taking an image or volume and predicting whether it belongs to a list of predefined classes. Traditional approaches to classification included linear-and normal-discriminant analysis [21, 22] . A variety of ML alternatives to each of these image processing tasks have now been proposed and pipelines that can automate many diagnostic and prognostic tasks have been introduced to reduce the burden on radiologists [23, 24] .

ML techniques can be categorized as supervised learning, unsupervised learning, and reinforcement learning. In supervised environments, data is composed of input-output patterns, and the task is to find a deterministic function that can predict the output from an observed input. Unsupervised techniques are a type of self-organized learning that extracts structures from the training samples directly, without pre-existing labels [25] . More recently, self-supervised techniques, a type of unsupervised learning where the training data is automatically labelled by exploiting the relations between different input signals, are being studied for better utilizing unlabeled data [26] . Reinforcement learning on the other hand is based on trial-and-error, where the algorithm evaluates a current situation, takes an action, and receives feedback from the environment; this feedback can be positive or negative [27] . The most common ML techniques used in medical applications are summarized below.

RFs operate by creating a multitude of decision trees (Fig. 1 ) that can be trained for classification and regression tasks [28, 29] , where the output is obtained by majority vote. Majority vote is a technique utilized to combine the outputs from multiple classifiers, with the voting rule following one of three forms: (i) unanimous voting, where all the individual votes must agree in one output class, (ii) simple majority, where the class with one more than 50% of votes is selected, and (iii) plurality or majority voting, where the class with the highest number of votes is chosen [30] . In k-NN, the training samples are divided into classes, and the prediction of a new sample or test point is classified by a majority vote of its neighbors (Fig. 2) . The algorithm uses a distance measurement function to search the (defined by the user) closest training samples in the feature space, and assigns the case of the class that is the most common in the subset. Artificial neural networks (ANNs) ANNs are inspired by the biological nervous system. ANNs contain a large number of highly interconnected nodes (called neurons) separated into layers (Fig. 4) , enabling the network to process different pieces of information while considering constraints to coordinate internal processing, and to optimize its final output [25, 33] . 

CNNs were inspired by the connectivity pattern of the animal visual cortex. Neurons respond to stimuli only in a restricted region (receptive field) of the previous layer, where receptive fields of different neurons partially overlap until they cover the entire visual field (Fig. 5 ). Unlike other ML techniques, the network learns the filters that are usually "hand crafted". Also, CNNs exploit the strong spatially local correlation found on images, allowing the features to be detected regardless of their position. In recent years, Deep Neural Networks (DNNs), which differ from ANNs by their depth (the number of neuron layers), have proven to be successful in solving diverse problems, mainly for their capacity to learn features from large datasets [34] . It should be noted that in the following discussion, algorithmic performance is assessed in terms of Dice's coefficient (DC), the modified Hausdorff distance (MHD) and the area under the receiver operating characteristic curve (AUC or AUROC), where possible. The quantification usually starts with calculation of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). TP refers to cases correctly classified as pertaining to the class, opposite to FP, when the case is wrongly classified. Inversely, TN and FN refer to a case correctly and incorrectly classified as not belonging to the class, respectively. The DC quantifies overlap between the processed image from the technique with a defined ground truth, ranging from zero (no overlap) to unity (identical segmentation). MHD is a measure of similarity between two objects based on their shape attributes. AUC combines information of the true positive rate or sensitivity, and false positive rate or fall-out. Sensitivity measures the proportion of actual positives that are correctly identified, while fall-out indicates the proportion of cases wrongly classified as positives. Inversely, the specificity measures the proportion of negatives that are correctly identified. Recall is the ratio of TPs to the sum of TPs and FNs, indicating the proportion of actual positives that are correctly identified. Precision is defined as the ratio of TPs to the sum of TPs and FPs, indicating the proportion of identified positives that are correct [35] .

In terms of the currently reported use of ML in forensic post-mortem imaging, it is in its infancy. ML has only been trialed in a few specific forensic applications including automatic forensic dental identification [36] ; sex determination [37, 38, 39] ; the automation of bone age assessment [40, 41] ; prediction of bone fractures [42] ; and the automatic detection of hemorrhagic pericardial effusion [43] . As far as we are aware, none of these studies has translated into daily forensic practice, despite the potential to streamline case-work.

The legally robust identification of a decedent is the first objective when their body is triaged for a postmortem. Dental analysis and comparison of ante-mortem and post-mortem information is one of the recognized tools for determining a decedent's identity. This traditionally requires an odontologist to find the best match to an ante-mortem database, using features such as dental restorations, pathologies, and tooth and bone morphologies. Zhang et al. [36] proposed a new descriptor that encodes the local shape of a person's dental features. They subsequently used an RF classifier to match the features of the unknown person to those in the database (n=200). The result yielded 100% accuracy for complete (n=20) and incomplete (n=20) feature datasets. Incomplete datasets were derived from cases involving trauma. The method presented was shown to be rotationally and translationally invariant, and was orders of magnitude faster than conventional 2D methods. It is important to note that the database was constructed using a surface laser scanner on plaster samples in contrast to PMCT scans.

Accurate determination of the sex of a decedent also aides in the identification process. Several different approaches have been used for sex estimation. Arigbabu et al. [37] utilized 100 head PMCT scans. They combined and evaluated six local feature representations, two feature learning, and three classification algorithms. This technique of combining multiple features and classifiers is often used in ML pipelines as it has been shown to improve accuracy and reliability. The best prediction rate was 86%, which was within the reported sex prediction range for applications that use cranial features. The small number of cases obtained only from South East Asia limited the generalizability of the results. Anderson et al. [38] utilized morphological gray matter differences on MRIs to differentiate between male and female incarcerated offenders, with implications to cognitive neuroscience research. Preprocessing steps were described, including realignment and image registration, to obtain the volume and density of the gray matter on each case utilizing Statistical Parametric Mapping software (SPM12; http://www.fil.ion.ucl.ac.uk/spm). Source-based morphometry (SBM) was utilized to extract features from the gray matter spatial information, with SBM being able to identify distinct regions with common covariation between subjects. A number of ML classification approaches were trialed, however, only an SVM and logistic regression were described due to present the highest classification accuracy of 94%. Limitations included the use of volumetric brain data only, without accounting for other moderating variables and quantitative methods, such as age, functional activity, and structural and functional connectivity. Ortiz et al. [39] compared five different ML techniques in the assessment of panoramic radiographs. The ANN outperformed the rest of the models, including k-NNs and logistic regression, with an accuracy of 89%. Only 100 panoramic radiographs were used, limiting the statistical significance of the results.

As with the identification of a decedent's sex, their estimated age is also an important parameter for streamlining the identification process. Štern, Payer and Urschler [40] compared two ML approaches, RFs and DCNNs to determine age (through regression) and distinguish minors from adults (classification) using bone ossification from MRI scans of the hand/wrist. As a general note, DCNNs are often compared with RFs as the DCNN can determine the most important features itself whereas the RF must be supplied with those deemed important by the user. To better study the impact of different input information on the decision process, three strategies were tested: the use of the whole hand, a cropped image with age relevant bones, or the hand-crafted filter-based enhanced epiphyseal gap. The best mean absolute error and standard deviation results with respect to the biological age (as estimated by radiologists) were 0.20±0.42 and 0.23±0.45 years for the DCNN using cropped structures and the RFs using enhanced images, respectively. The results were reported to achieve the new state-of-the-art accuracy compared with previous MRI-based methods and their earlier work. Furthermore, when the technique was adapted for 2D MRI, the method was in line with state-of-the-art methods using X-ray data. Limitations of this work included the requirement for age-relevant anatomical information, which implies a labor-intensive pre-processing step, and decreased accuracy for cases with biological ages greater than 18 years. In an alternative approach, Li et al. [41] utilized pelvic X-ray images and a DCNN to create a bone age assessment pipeline which yielded a mean error of 0.94 years, 0.36 years better than the existing reference standard. This work used transfer learning from a CNN pre-trained on the ImageNet database [44] , achieving an appropriate accuracy for this type of input data. Transfer learning is widely used in ML applications and is particularly useful when small or unbalanced datasets are available. Limitations acknowledged by the authors included the lack of diversity in ethnicity of patients, and the exclusion of images with artefacts and diseases.

Many forensic institutions utilize PMCT to guide the pathologist in their approach to the autopsy. PMCT is particularly useful for identifying fractures due to the high attenuation of bone. Heimer et al. [42] used an undisclosed DCNN from a dedicated software (VIDI, Cognex, Natick, MA, USA) to predict the presence of skull fractures using 150 head PMCT scans (75 scans for each case: with and without fractures). The skulls were preprocessed through the generation of curved maximum intensity projections, so that the skull's surface could be unfolded onto a single image. Deep learning was applied and the best-performing selected network yielded an AUROC of 0.965, a sensitivity of 91.4% and a specificity of 87.5%. An AUROC of 0.5 defines a model that classifies at random, while 1.0 is a completely accurate model.

PMCT is also useful for assessing many aspects of cardiac condition prior to autopsy, e.g. the appearance of discontinuities of the aortic wall can be a direct sign of injury in the aorta, whereas the appearance of a blood collection within the chest cavity (hemothorax or hemopericardium) can be an indirect sign [45, 46] . These signs, observed on plain film X-ray or PMCT must be interpreted by radiologists and forensic pathologists. Ebert et al. [43] used two separated and undisclosed DCNNs from a dedicated software (VIDI) for the classification of images with or without hemopericardium and also the corresponding segmentation of the blood content in PMCT. The average DC, recall, and precision for the classification task were 77%, 77%, and 85% respectively. For segmentation, the values obtained were 78%, 78%, and 79%, respectively. Limitations of this study include the small number of training cases (n=14 cases with hemopericardium), while the use of a dedicated software restricted the training data to individual slices, losing sometimes crucial volumetric information.

Due to the dearth of information relating to the application of ML to forensic imaging, it is important to review the state-of-the-art and establish lessons learned from the significant body of literature describing its application to clinical image analysis.

Current clinical applications ML techniques have been used in the diagnosis and prognosis of diseases, as well as for segmentation, classification, and measurement of anatomical structures [24, 47] . In this review, the ML applications have been grouped according to the tissue or organ studied, where brain, lungs, and skeleton were chosen to highlight results and limitations. Each anatomical section concludes with a summary evaluating the key implications determined from the clinical literature and their application in the forensic setting.

Traditional atlas-based segmentations require registration to align the atlas images to the unseen image. Whereas, ML approaches can learn the variability between patients, making them especially useful in forensics, where variance is greater than for clinical imaging. ML can also be used in combination with atlas-based approaches or in its own right. As an example of the former, Srhoj-Egekher et al. [48] used atlas-based segmentation for pre-processing T2-weighted MRI neonatal brain images to obtain initial probabilities, subsequently refined using a k-NN approach. Whilst this approach achieved DCs and MHDs ranging from 77% to 93%, and 0.35 to 2.86 respectively, the assignment of a tissue classification to each voxel independently, post atlas registration, meant some voxels were attributed to more than one class, while background voxels were unclassified. Conversely, Zhang et al. [49] opted for purely ML approaches that analyzed image patches for segmentation into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) of infant brains (n=10). Four network architectures were tested and, in most cases, the CNN method significantly outperformed SVMs and RFs with overall DC scores and MHDs of 85% and 0.32, respectively. The CNN method also outperformed two other common image segmentation methods: coupled level sets (CLS) and majority voting (MV).

Three further publications were found where the authors segmented similar structures within adult brains. Van Opbroek et al. [50] applied an SVM for pixel-wise classification to registered volumes from a variety of MRI sequences for patients with diabetes and controls. The resulting segmentation of eight different tissue types demonstrated limited success ( Table 1) . The SVM showed poor performance in low contrast areas, while atlas misregistration caused voxels to be improperly classified. Moeskops et al. [51] used CNNs to process T1-weighted scans to segment the same eight tissue types. With CNNs, the use of different sized patches during training allowed for a smooth segmentation and analysis of local texture. In general, CNNs delivered better segmentation (Table 1) , although this was a different patient cohort. A more recent application of 3D DCNNs [52] was used to identify 25 brain structures in T1-weighted MRI scans (n=30). Again, image patches were utilized as input to the network. However, spectral and Cartesian coordinate information relating to the patches was added after the convolutional layers (e.g. see arrow in Fig. 6 ) in order to introduce spatial information, which substantially increased the segmentation accuracy.

ML can also be used for the assisted diagnosis of neurodegenerative diseases. Salvatore et al. [53] used a combination of principal component analysis (PCA) with an SVM to classify morphological MRI sequences as patients with Parkinson's disease (n=28), progressive supranuclear palsy (PSP) (n=28), or controls (n=28). The large cohort sizes, inter-class cohort balance, and separation between PSP patients and other parkinsonian variants were identified as particular strengths, compared to other papers. The performance (accuracy, specificity and sensitivity were all > 80%) of the model was shown to be limited by the number of principal components (16 to 26) utilized for classification. This dependence is an important consideration when using dimensionality reduction techniques and was also demonstrated for approaches that classified Alzheimer's disease [54] .

Finally, ML techniques have also been used to segment and classify brain tumors. Zacharaki et al. [55] used conventional and perfusion MRI from patients with a diagnosis of intra-cranial neoplasm to classify them by type and grade of tumor (n=98). Their approach consisted of region of interest (ROI) definition, feature extraction, feature selection, and classification by SVMs. For comparison, linear discrimination analysis (LDA) and k-NN were also implemented. The mean classification accuracy was 91% for the SVM approach, compared with 81% for LDA and 90% for k-NN. Some of the limitations were related to the lack of features selected that described deformation of healthy structures due to the tumor, and the utilization of ROIs which yielded inter-observer variability.

Once the presence of tumors is verified, one possible subsequent step would be segmentation of the pathology, which is challenging even for experienced neuroradiologists [56] . To address this segmentation problem, a variant of CNNs named U-net is often employed [57] . Beers et al. [58] utilized two 3D U-nets connected sequentially to perform whole tumor, enhancing tumor, and tumor core segmentation, achieving mean DCs for the test set (n=95) of 84%, 70%, and 71%, respectively. When the methodology was implemented on patients from ongoing clinical trials, the mean DCs decreased to 66%, 54%, and 45%, respectively. The lower performances on the clinical trial patients were attributed to scans being post-operative, highlighting the importance of case selection for training.

Studies on brain tissues used mostly MRI data due to the multi-modality information and a good softtissue contrast. Whilst the specific pathologies discussed are not all relevant to the forensic setting, the general conclusions deduced from the segmentation and localization of anatomical abnormalities are. Models that utilized dimensionality reduction techniques prior to classification were shown to yield performances dependent on the number of selected components. In addition, the identification of abnormalities in biological tissues required features capable of describing complicated deformations of the healthy structures. For CNNs, the performance of the pipeline depended significantly on the training set adequately representing expected cases. In general, CNNs outperformed algorithms such as SVMs, RFs, CLSs, and MV in segmentation and classification tasks. Note that some studies used small datasets, which limited statistical power. In addition, as will be demonstrated throughout this review, a combination of the variability in reporting of metrics, the lack of reporting of a diagnostic odds ratio [59] , the unavailability of datasets and reference implementations, and the effect of imbalanced data in the classification accuracy, common in medical datasets [60, 61] , made it difficult to compare papers quantitatively.

In forensics, PMCT does not provide good resolution of internal cranial structures or brain metastases, and in general, the resolution is not sufficient to identify neurodegenerative issues, but degeneration can sometimes be observed in defined structures, e.g. in the caudate nucleus in Huntington's disease. On the other hand, PMCT is adequate in showing evolving brain infarcts and in displaying collections of blood, e.g. subdural hemorrhages (which are reasonably common). PMCT can also show intraparenchymal hemorrhages and parenchymal hemorrhagic contusions. Intra-parenchymal hemorrhages, e.g. hypertensive hemorrhage, classically involve distinct areas in the brain: basal ganglia, thalamus, pons, and cerebellar hemispheres. Parenchymal hemorrhagic contusions are classically seen with contra-coup basal frontal lobe contusions (bleed within brain tissue occurring on the opposite side of the head to the primary injury site) when someone falls onto the back of their head (often associated with a skull fractureoccipital).

In ML, feature learning refers to the automatic discovery of meaningful representations from raw data, in contrast to manual feature engineering, where the features have to be chosen by a domain expert. Feature learning allows for end-to-end learning, where a complex system can be represented by a single model, bypassing the intermediate layers present in traditional workflow designs. Learning a representation of any tissue is a useful process if subsequent classification is required, or if the goal is to find differences between samples in the training data. The representation quality is highly dependent on the learned features.

A restricted Boltzmann machine (RBM) is a generative neural network that can be used to perform automatic feature learning. Li et al. [62] used a Gaussian RBM with a training dataset consisting of different sized patches obtained from high-resolution lung CT images (n=92), with the purpose of classifying five tissue types using SVMs. The best accuracy obtained was 84%, with a high rate of FPs caused by the similarity between tissues. Van Tulder and de Bruijne [63] utilized convolutional RBMs, adding learning objectives that helped the algorithm to extract features for description and training data classification. The training data consisted of CT scans (n=73) with five types of tissues classified. Resulting accuracies were <75% and 85-90% for the classification of lung patches and airway centerlines, respectively. The low accuracies were attributed to small training sets and number of extracted filters due to computational restrictions.

Netto et al. [64] utilized examinations (n=50) with 198 identified nodules and an SVM to classify the structure as nodule or non-nodule. The resulting accuracy was 91%, with a sensitivity of 86%. The largest errors were reported when the feature was very large or very small, where it could be mistaken for other structures or for being the continuation of one. Hua et al. [65] used images containing nodules from the Lung Image Database Consortium (LIDC) CT dataset to train both a CNN and a deep belief network (DBN) constructed by stacking RBMs. The performance of the two networks was then compared with two feature-based methods ( Table 2 ). The major limitation reported was resizing of the input images, which discarded size cues that were important indicators of malignancy.

Kumar et al. [66] also classified the lung nodules in the LIDC images (Table 2) using an autoencoder (AE) and a binary decision tree classifier (BDT). An AE is an unsupervised deep learning technique utilized for feature extraction, while a binary decision tree is a specialized implementation for classification where every node has only two branches. The false positive rate of 39% was attributed to the visual similarity between benign and malignant cases, which can be compared to a 27% rate obtained on The National Lung Screening Trial (NLST) using low-dose CT (LDCT) [67] .

A more recent study compared massive-training artificial neural networks (MTANNs) against CNNs [68] using a database of LDCT scans (n=38), consisting of 1057 slices. MTANNs are an extension of ANNs, where a large number of overlapping sub-regions are created for each voxel of the original image and used as inputs to the network. The reported AUROC was 0.88 for the MTANN, and 0.78 for the best of the four CNN architectures. The MTANN required a smaller number of training samples than the CNNs for a better classification performance. This was attributed to the hierarchies of the learned features, where the MTANN learned to detect lesions utilizing low-level features, while the CNNs extracted low-, mid-and high-level features, increasing their reliance on irrelevant characteristics.

A recent focus of attention was related to the use of ML for early diagnosis, assessment of severity, and differentiation between the novel coronavirus (COVID-19) and community acquired pneumonia (CAP) from CT scans. Barstugan et al. [69] utilized n=150 CT abdominal images from 53 infected patients, five feature extraction methods, and an SVM for the final classification, achieving a maximum accuracy of 99.7%. The main limitation of their work was the manual selection of the patches obtained from the original images and used for the training, which restricts the usability and reproducibility of this approach. Tang et al. [70] assessed the severity (severe, non-severe) of the disease from chest CT images from 176 patients, utilizing quantitative measures, e.g. the ratio between the volume of the whole lung and the volume of ground-glass opaque regions, with several RF models. The best performing RF yielded results of 93%, 75%, 88%, and 91% for the sensitivity, specificity, accuracy, and AUC, respectively. To differentiate between COVID-19, CAP, or non-pneumonia, Li et al. [71] collected 4356 chest CT exams from 3322 patients. A DCNN was utilized, with an architecture denoted COVNet, able to classify the volumetric data with a sensitivity, specificity, and AUC of 90%, 96%, and 96% for COVID-19 cases, 87%, 92%, and 95% for CAP cases, and 94%, 96%, and 98% for non-pneumonia cases, respectively. A limitation of this work included the lack of laboratory confirmation for each case, where COVID-19 could have similar imaging characteristics as other viral pneumonias.

Studies on lungs generally used CT scans for the segmentation of tissues and tumors, and classification of nodules for early cancer diagnosis. Due to the low contrast between different tissues in the lungs, the approaches reported were reliant on shape, texture, and feature size. The segmentation performance was poor for nodules at the size extremes. Major findings included lower performances due to image resizing, and the importance of reporting FP rates, which can yield high values in applications that intend to determine nodule malignancy.

Potential applications to the forensic setting include detection of emphysema, consolidation of lung parenchyma (pneumonia), and if appropriate windows are used, interstitial changes. Of crucial forensic interest is the presence of blood and fluid in the chest. Furthermore, establishing the presence of a lung lesion (and especially more than one) independently of the cause of death may indicate the presence of occult malignancy. In such cases, the deceased's next of kin can be alerted, and the family contact nurses can organize appropriate follow up for family members if a cancer is found. It is important to note that the appearance of the lungs in PMCTs can be affected by aspiration of gastric content that may occur in the process of dying, e.g. from a 'heart attack'.

Skeleton Skeletal segmentation usually occurs before measurement and/or diagnosis of bone or articular diseases. Koch et al. [72] segmented MRIs (n=110) of the wrist using marginal space learning (MSL) and RFs, where MSL incrementally learned classifiers in marginal spaces of lower dimensions [73] . The segmented images were used to compute the 3D model of every carpal bone, with AUCs of 0.88 for both scan modalities. The approach was an order of magnitude faster than previous work using a semiautomatic method. Similar literature did not report segmentation errors and could not be used for comparison.

Bone age assessment from plain X-rays is used in pediatrics by comparing the results to chronological age for the evaluation of endocrine and metabolic disorders. A fully automated pipeline was presented by Lee et al. [74] using a pre-trained CNN (transfer learning). Both male and female test X-rays were assigned a bone age within 1 year of the correct value over 90% of the time, and over 98% within 2 years.

X-rays have also been widely used for fracture detection, e.g. of the tibia [75] , where texture and shape features were fed into three different ML algorithms: an ANN, k-NN, and SVM, and the outputs fused using a majority vote scheme. The combination of the classifiers using both types of features presented a significant improvement over using just one classifier, or only one feature type. Reported accuracies, precisions, and sensitivities were above 97%. Instead of fusing the results from the classifiers, multistage classifiers have also been used. Wels et al. [76] reported a fully automatic system using several RF stages, capable of detecting osteolytic spinal bone lesions from CT volumes, with an average sensitivity of 75%. The performance was affected by differences in contrast and noise characteristics in the data used for training and testing, however, values for accuracy were not presented for further interrogation.

Sharma et al. [77] measured trabecular bone microarchitecture and used the information to discriminate between healthy cases (n=10) and patients with Type 1 Gaucher disease (n=20). SVMs were used to classify different genotypes of the disease, achieving an average 70% classification accuracy, 74% sensitivity, and 85% precision. The structure of the trabecular bone obtained from MRI have also been used classify knees with osteoarthritis [78] . The characteristics found to relate to the disease were useful in classifying healthy from affected patients (n=159) with an AUC of 0.92, as well as predicting the risk of cartilage loss. In a similar study, the fractal analysis of X-ray images with SVMs enabled the automatic classification of osteoporotic patients (n=39) versus controls (n=38) with accuracies of up to 95% [79] . Reported limitations from the papers in this section include the small number of cases and the high percentages of patients at early stages of the disease.

Orthopedic ML applications include disease diagnosis, age assessment, and risk prediction e.g. osteoporosis, osteoarthritis. Plain film X-ray and CT were most common; however, MRI studies of joints are being increasingly reported. The performance of ML applications was shown to be affected by the number and selected features, which is significantly influenced by differences in contrast and noise characteristics in the datasets. Comparison or ranking of the results was limited by reported performance metrics and the use of databases that were not representative of the disease stages studied.

Other limitations included small patient cohorts and the processing times.

The most common skeletal disorders that could be picked up on PMCT scans are osteoporosis and Paget's disease, while fracture diagnosis, and then pattern of fracture diagnosis, e.g. a "hangman's fracture", extension/tear-drop fractures of the cervical spine, and spiral fracture of a long bone in an infant are of significant forensic interest.

J o u r n a l P r e -p r o o f Discussion Typical goals of ML techniques in medical imaging include the differentiation of healthy from diseased patients or tissues and the localization of pathologies in anatomic structures. Algorithmic performance can be significantly affected when trying to process a new sample that differs significantly from the training dataset. This characteristic is especially important when it comes to applications in forensic medicine, where there is a high variability in the structures and image acquisition protocols, and unclear definition of what normal implies, due to changes occurring because of circumstances of death, tissue decomposition, trauma, or incineration. However, some applications e.g. organ localization, can be immediately translated to the forensic setting by using the appropriate training data, or by using the clinical medical images for the initial training of CNNs and then fine-tuning using forensic information. This is usually referred to as transfer learning. On the other hand, due to the size and availability of forensic databases, the opposite is also possible, with applications being trained in forensic data and then fine-tuned to the clinical setting.

To improve the capabilities of ML techniques, the training data can be modified, or more informative features can be used as inputs to the algorithms. The selection of features can be optimized using learning objectives [63] or by utilizing an unsupervised technique as a preprocessing step to the classification task [66, 80] . The features selected can also be used to alleviate human labelling, by selecting more representative training data for the medical expert [81, 82] . Another approach to the improvement of ML performance is the combination of several techniques using a majority vote scheme [75] , or the use of multi-stage classifiers [58] for segmentation of different spatially related tissues.

A wide range of implemented algorithms were found during the review process, where SVMs outperformed techniques such as LDA and k-NN [55] , however the trend in recent works has been the high performance of CNNs [49, 51] . The main disadvantage of classic ML approaches compared to CNNs is the performance variability due to the quality of the features [53] that must be hand-crafted by an expert according to the goal and dataset. The selected feature pool is commonly processed to lower its dimensionality before training the classifier by using techniques such as PCA. It is important to note that the number of principal components or features selected at the end of this step plays a key role in the classification performance [53] .

The performance of the algorithms can also be significantly affected if the labelling process (diagnosis) is prone to error [54] . Furthermore, for medical and forensic applications, the common practice of resizing input images can yield to a loss of information that could be essential for diagnostic purposes [65] . An additional consideration is that some authors use for example a radiologist to classify cases, then benchmark the performance of the algorithm against radiologists. Rajpurkar et al. [83] , for instance, presented a CNN that achieved radiologist-level pneumonia detection on a database [84] for which no gold-standard label existed, and listed as limitation the lack of information in the database that affects the radiologists' accuracy. It is also important to note that the lack of reporting of a diagnostic odds ratio [59] and the variability in reporting of metrics makes it difficult to compare papers.

For the task of segmentation, both multi-atlas algorithms and DCNNs with multiple patch sizes showed comparable results [48, 49] , demonstrating CNNs were most successful. Patch-based techniques could be a good approach in forensic cases were organs or structures are not localized in the usual anatomic positions [63] . Furthermore, the use of different sized patches in segmentation tasks allows for both a smoother separation and the detailed analysis of local texture [51] .

Three important results for the use of ML in clinically-related applications were found that can also be applied in the forensic setting: firstly, temporal efficiency through the use of transfer learning; secondly, improved accuracy through the combination of ML classifiers using majority voting techniques or multi-stage approaches; and finally, the addition of an active learning phase, where the human labor can be alleviated during labeling.

One of the main issues that affects both the clinical and forensic settings is the lack of interpretability of predictions by black-box approaches such as neural networks. This is active area of current research and a current approach to addressing this concern is the use of visual explanations for the class label under consideration, obtained from the convolutional layer feature maps [85, 86] , and attention mechanisms [87] , able to determine the parts of the input images more relevant for a particular classification. Furthermore, depending on the application, it is not required and could be counterproductive to completely automate a task, for which a human-in-the-loop can be beneficial by reducing the complexity through human input and assistance [82] .

Some applications of ML already found in clinical medicine, that could be repurposed for forensic medicine, include segmentation and classification of organs and structures, including arteries, tiny blood vessels, the liver, spleen, stomach, gallbladder, and pancreas [88, 89] ; computation of organ 3D models [72] for virtual autopsies; detection of lesions and calcification on vascular cross-sections [90]; identification of bone and joint atrophies or disorders [81, 77, 78, 79] ; fluid volume and composition on body cavities (blood, pus, ascites) [91] ; and organ volume estimation, e.g. heart size with respect to body size [92] .

Tasks in forensic radiology that to our knowledge have not been tackled using ML include: segmentation and classification of foreign bodies, differentiation between ante-mortem and postmortem gases, calculation of body mass index, and determination of skeletal completeness after accidents.

For the segmentation and classification of foreign bodies, e.g. bullets, metallic dental fillings, the main challenge becomes finding the object that does not belong inside the body. Furthermore, metallic components can create artefacts such as beam-hardening on CT scans or field distortions in MRI [93] , which can also be addressed using deep learning [94] .

Differentiation between ante-mortem and post-mortem gases can be difficult using the voxel values of CT scans or MRI, so emphasis should be placed on understanding the expected location and evolution of these gases at different points in time [95] ; also, differentiation between acute and remote infarction on the brain, which on a CT scan can be characterized by voxel values and tissue volume changes, can be tackled utilizing existing tissue classification techniques [50, 51, 54] , with the addition of new classes to differentiate the types of infarction.

In forensic anthropology, tasks that could be addressed using ML include: determination of skeletal completeness after accidents [96] , e.g. plane crashes; 3D reconstruction of incomplete bones, that could be extrapolated from the work by Hermoza and Sipiran [97] on incomplete archaeological objects; and 3D reconstruction of fractured skulls [98, 99, 100] , used to infer a cause of death, or to perform facial reconstruction.

In addition to the aforementioned applications traditionally related to medical imaging, there is the potential for the use of CT scans for facial identification [101, 102] . As a final note, the release this year of the New Mexico Decedent Image Database (NMDID, https://nmdid.unm.edu/) [103] should be acknowledged as a significant step forward for the development of tools that can be used to enhance the post-mortem workflow.

J o u r n a l P r e -p r o o f Conclusions ML techniques have been applied to a large number of tasks that can be used in clinical medicine, where the algorithms most widely utilized in applications with medical images include RFs, SVMs, and CNNs. CNNs have shown better performance in the literature.

Techniques to improve the ML performance in radiology include data augmentation, improved feature selection and algorithmic combination, e.g. majority voting. Performance was shown to be affected by resizing of the input images and the accuracy of the labels provided with the training data. In addition, benchmarking was found to be difficult due to the lack of gold-standard labels, as well as the variability in reporting of metrics, and lack of reporting of a diagnostic odds ratio.

ML applications investigated for clinical medicine could be repurposed to the forensic domain with careful consideration to account for the increased variability and temporal factors, e.g. decomposition, that affect the data used to train the ML techniques. Due to the complexity of the autopsy process, a key application of ML to forensic radiology would be to streamline decedent identification and highlight and annotate areas of forensic interest. ML pipelines could be used to present information to optimally determine the cause of death, including differentiation between body cavity fluid accumulations (blood, pus, ascites) and their corresponding volumes, calculation of organ volumes and weights, percentage of coronary artery calcification, identification of subtle fractures especially in critical areas such as the cervical spine, and determination of skeletal completeness and skeletal commingling after mass fatality incidents.

The artefacts of death: CT post-mortem findings

Imaging and virtual autopsy: looking back and forward

Postmortem CT angiography: capabilities and limitations in traumatic and natural causes of death

Post-mortem computed tomography angiography: past, present and future

Future prospects of forensic imaging

NiftyNet: a deep-learning platform for medical imaging

Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration

Essentials of radiographic physics and imaging

Computed Tomography: Physical Principles, Clinical Applications, and Quality Control

The applicability of Dual-Energy Computed Tomography (DECT) in forensic odontology-A review

Magnetic Resonance Imaging: The Basics

Probabilistic liver atlas construction

Construction of an abdominal probabilistic atlas and its application in segmentation

Automated segmentation of the liver from 3D CT images using probabilistic atlas and multilevel statistical shape model

Automatic segmentation of the pelvic bones from CT data based on a statistical shape model

Interactive graph cuts for optimal boundary & region segmentation of objects in ND images

Hierarchical scale-based multiobject recognition of 3-D anatomical structures

Efficient multi-atlas abdominal segmentation on clinically acquired CT with SIMPLE context learning

Handcrafted vs. non-handcrafted features for computer vision classification

A Comparison of Texture Features Versus Deep Learning for Image Classification in Interstitial Lung Disease

Application of Linear Discriminant Analysis in Dimensionality Reduction for Hand Motion Classification

Classification of non-tumorous skin pigmentation disorders using voting based probabilistic linear discriminant analysis

Workload of radiologists in United States in 2006-2007 and trends since 1991-1992

Machine learning and radiology

Machine learning for audio, image and video analysis

Self-supervised learning for medical image analysis using image context restoration

Deep Reinforcement Learning: A Brief Survey

Predicting Long-Term Cognitive Outcome Following Breast Cancer with Pre-Treatment Resting State fMRI and Random Forest Machine Learning

Enhancing interpretability of automatically extracted machine learning features: application to a RBM-Random Forest system on brain lesion segmentation

Improving the prediction accuracy of heart disease with ensemble learning and majority voting rule. In: U-Healthcare Monitoring Systems

The Nature of Statistical Learning Theory

Classification of magnetic resonance brain images using wavelets as input to support vector machine and neural network

Medical image analysis with artificial neural networks

Automated age estimation from hand MRI volumes using deep learning

Classification assessment methods

Efficient 3D dental identification via signed feature histogram and learning keypoint detection

Computer vision methods for cranial sex estimation

Machine learning of brain gray matter differentiates sex in a large forensic sample

Sex estimation: Anatomical references on panoramic radiographs using Machine Learning. Forensic Imaging

Automated age estimation from MRI volumes of the hand

Forensic age estimation for pelvic X-ray images using deep learning

Classification based on the presence of skull fractures on curved maximum intensity skull projections by means of deep learning

Automatic detection of hemorrhagic pericardial effusion on PMCT using deep learning-a feasibility study

ImageNet classification with deep convolutional neural networks

Guidelines for autopsy investigation of sudden cardiac death: 2017 update from the Association for European Cardiovascular Pathology

Evaluation of unenhanced post-mortem computed tomography to detect chest injuries in violent death

Pixel-based machine learning in medical imaging

Automatic segmentation of neonatal brain MRI using atlas based segmentation and machine learning approach. In: MICCAI Grand Challenge: Neonatal Brain Segmentation

Deep convolutional neural networks for multimodality isointense infant brain image segmentation

Automated brain-tissue segmentation by multi-feature SVM classification

Automatic segmentation of MR brain images with a convolutional neural network

Deep convolutional neural network for segmenting neuroanatomy

Machine learning on brain MRI data for differential diagnosis of Parkinson's disease and Progressive Supranuclear Palsy

Early diagnosis of Alzheimer's disease based on partial least squares, principal component analysis and support vector machine using segmented MRI images

Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme

Machine learning based brain tumour segmentation on limited data using local texture and abnormality

U-net: Convolutional networks for biomedical image segmentation

Sequential neural networks for biologically-informed glioma segmentation

The diagnostic odds ratio: a single indicator of test performance

A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset

Data imbalance in classification: Experimental evaluation

Combining generative and discriminative representation learning for lung CT analysis with convolutional restricted boltzmann machines

Automatic segmentation of lung nodules with growing neural gas and support vector machine

Computer-aided classification of lung nodules on computed tomography images via deep learning technique

Lung nodule classification using deep features in CT images

The American College of Radiology Lung Imaging Reporting and Data System: potential drawbacks and need for revision

Comparing two classes of end-to-end machine-learning models in lung nodule detection and classification: MTANNs vs

COVID-19) Classification using CT Images by Machine Learning Methods. arXiv e-prints

COVID-19) Using Quantitative Features from Chest CT Images

Fully automatic segmentation of wrist bones for arthritis patients

Marginal Space Learning. In: Marginal Space Learning for Medical Image Analysis: Efficient Detection and Segmentation of Anatomical Structures

Fully automated deep learning system for bone age assessment

Multiple classification system for fracture detection in human bone X-ray images

Multi-stage osteolytic spinal bone lesion detection from CT data with internal sensitivity control

Machine learning based analytics of micro-MRI trabecular bone microarchitecture and texture in Type 1 Gaucher disease

Diagnosis of osteoarthritis and prognosis of tibial cartilage loss by quantification of tibia trabecular bone from MRI

Osteoporosis diagnosis using fractal analysis and support vector machine

Deep features learning for medical image analysis with convolutional autoencoder neural network

Active learning based intervertebral disk classification combining shape and texture similarities

Interactive machine learning for health informatics: when do we need the human-inthe-loop?

Radiologist-level pneumonia detection on chest X-rays with deep learning

ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Grad-CAM: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks

Visual interpretability for deep learning: a survey

An Investigation of Interpretable Deep Learning for Adverse Drug Event Prediction

Hierarchical 3D fully convolutional networks for multi-organ segmentation

Automatic detection of abnormal vascular cross-sections based on density level detection and support vector machines

Training and validating a deep convolutional neural network for computer-aided detection and classification of abnormalities on frontal chest radiographs

Multi-scale deep networks and regression forests for direct bi-ventricular volume estimation

3D surface and body documentation in forensic medicine: 3-D/CAD Photogrammetry merged with 3D radiological scanning

Deep learning methods to guide CT image reconstruction and reduce metal artifacts

Post-mortem CT and MRI: appropriate post-mortem imaging appearances and changes related to cardiopulmonary resuscitation

The human skeleton in forensic medicine

3D reconstruction of incomplete archaeological objects using a generative adversarial network

Reverse engineering-rapid prototyping of the skull in forensic trauma analysis

Fragmented skull modeling using heat kernels

Virtual reconstruction of paranasal sinuses from CT data: A feasibility study for forensic application

Case Study: 3D Application of the Anatomical Method of Forensic Facial Reconstruction

Development of three-dimensional facial approximation system using head CT scans of Japanese living individuals

Standardizing Data from the Dead

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

J o u r n a l P r e -p r o o f