key: cord-0294003-yl6ertbs authors: Arnaout, R.; Curran, L.; Zhao, Y.; Levine, J.; Chinn, E.; Moon-Grady, A. title: Expert-level prenatal detection of complex congenital heart disease from screening ultrasound using deep learning date: 2020-06-24 journal: nan DOI: 10.1101/2020.06.22.20137786 sha: 34ce19453fb285ce50c6af7aff5368d4fdc57a93 doc_id: 294003 cord_uid: yl6ertbs nan All datasets were obtained and de-identified, with waived consent in compliance with the Institutional Review Board (IRB) at the University of California, San Francisco (UCSF) and the IRB at Boston Children's Hospital. Inclusion, exclusion, and definitions of normal and CHD. Fetal echocardiograms and fetal surveys (second-trimester obstetric anatomy scans performed by sonographers, radiologists and/or maternal-fetal-medicine physicians) performed between 2000 and 2019 were utilized. Images came from GE (67%), Siemens (27%), Philips (5%), and Hitachi (<1%) ultrasound machines. Inclusion criteria were fetuses of 18-24 weeks of gestational age. Presence of significant non-cardiac malformations (e.g. congenital diaphragmatic hernia, congenital airway malformation, congenital cystic adenomatoid malformation, meningomyelocele) were excluded. Gold-standard definitions of normal vs. CHD were made as follows. CHD pathology was determined by review of the clinical report as well as visual verification of the CHD lesion for each ultrasound by clinician experts (Drs. Grady, Levine, and Zhao with over 60 years combined experience in fetal cardiology). Additionally, for studies performed in or after 2012, we were able to validate the presence, absence, and type of cardiac findings in the ultrasound studies with electronic health record codes for CHD in the resulting neonates (ICD-9 codes 745*, 746*, and 747* and ICD-10 codes Q2*, and ICD procedure codes 02*, 35*, 36*, 37* 38*, 88*, and 89*). Studies where clinician experts did not agree on the lesion and no post-natal diagnosis was present, were not included. Normal fetal hearts were defined as negative for structural heart disease, fetal arrhythmia, maternal diabetes, maternal lupus, maternal Sjögren syndrome, or presence or history of abnormal nuchal translucency measurement, non-cardiac congenital All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. Study design, training, and test sets. We analyzed images from a retrospective cohort. The total number of CHD echocardiograms, and the need to limit class imbalance between normal and CHD studies in training, were constraints guiding development of training and test datasets ( Figure S1d) . We first took all fetal echocardiograms with CHD fitting inclusion/exclusion criteria above (437 studies). To reduce class imbalance in training, we then took a sample of normal fetal echocardiograms (875 studies) such that CHD was ~30 percent of the dataset. From this overall UCSF dataset, we created UCSF training and test sets as follows. We found those fetal echocardiograms which had a corresponding fetal survey in the UCSF system; a random sample of ~10 percent from each lesion class made up FETAL-125 and OB-125, respectively (corresponding echocardiograms and surveys, respectively, from the same patients). FETAL-125 comprised 11,445 normal images and 8,377 abnormal images; OB-125 comprised 220,990 normal images and 108,415 abnormal images. The remaining ~90 percent of fetal All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06. 22 .20137786 doi: medRxiv preprint echocardiograms (1,187) were used for training, supplemented by 139 normal fetal surveys (1,326 studies total). For a population-based UCSF testing set, we started with OB-125 and added additional 3,983 normal fetal surveys such that the CHD lesions in OB-125 comprised 0.9% of an overall dataset totaling 4,108 surveys. The result was OB-4000, which comprised 4,473,852 images. As an external testing set, we received 423 fetal echocardiograms (4,389 images from 32 normal studies and 40,123 images from 391 abnormal studies) from Boston Children's Hospital. These training and test sets are summarized in Table 1 and Figure S1d . Separately, we obtained a test set of 10 twin ultrasounds between 18-24 weeks of gestational age (5,754 echocardiogram images, 36,355 fetal survey images). Eight sets of twins had normal hearts; one set of twins had one normal, one TOF heart; and one set of twins had one normal, one HLHS heart. For all trainings, roughly equal proportions of data classes were used. Every image frame of the training set, FETAL-125, OB-125, and BCH-400 were view-labeled by clinician experts (approximately 20% of the dataset was independently scored by both labelers to ensure All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint agreement). Because OB-4000 was too large for this approach, experts instead verified only that the top five predictions from the view classifier did in fact contain views of interest before that study underwent diagnostic classification. To maintain sample independence, training and test sets did not overlap by image, patient, or study. DICOM-formatted images were deidentified as previously described 26 . Axial sweeps of the thorax were split into constituent frames at 300 by 400-pixel resolution. For view classification tasks, images were labeled as 3-vessel trachea (3VT), 3-vessel view (3VV), apical 5-chamber (A5C), apical 4-chamber (A4C), and abdomen (ABDO). A sixth category, called non-target (NT), comprised any fetal image that was not one of the five cardiac views of interest. For disease classification tasks, studies were labeled by normal or CHD lesions mentioned above. For input into classification networks, each image was cropped to 240 x 240 pixels and downsampled to 80 x 80 pixels and scaled with respect to greyscale value (rescale intensity). For input into segmentation networks, images were cropped to 272 x 272 pixels and scaled with respect to greyscale value. All preprocessing steps made use of open-source Python libraries OpenCV (https://opencv.org/), Scikit-image (https://scikit-image.org/), SciPy (https://www.scipy.org/), and NumPy (https://numpy.org). For training fetal structural and functional measurements, OpenCV was used to label thorax, heart, right atrium, right ventricle, left atrium, left ventricle and spine from A4C images. Classification models. Classification models were based on the ResNet architecture 27 , with the following modifications. For view classification, batch size was 32 samples and training was All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint over 175 epochs using the Adam optimizer and an adaptive learning rate (0.0005 for epochs 1-99; 0.0001 for epochs 100-149, and 0.00005 at 150+ epochs). Dropout of 50% was applied prior to the final fully-connected layer. Data were augmented at run-time by randomly applying rotations of up to 10 degrees, width and height shifts of up to 20 percent of total length, zooms of up to 50 percent, and vertical/horizontal flips. For diagnostic classification, transfer learning was applied to the previously described view classification model as follows: the first 18 layers were frozen. Additional training used the above settings except epochs ranged from 12 to 60, learning rate was constant for each model, no adaptive learning was used, and learning rate ranged from 0.00001 to 0.0001. Loss function was categorical cross-entropy (view classifier) or binary crossentropy (diagnostic classifiers). Classification network architecture is shown in Figure S1a . Training and validation datasets in which view labels were randomized were used as a negative control, resulting in an F-score commensurate with random chance among classes. Segmentation model. A4C images with clinician-labeled cardiothoracic structures (thorax, heart, spine, and each of the four cardiac chambers) were used as training inputs to a U-Net 28 neural network architecture with modifications as in Figure S1b . Two different models were trained to detect (i) heart, spine, and thorax, and (ii) the four cardiac chambers. Batch size was 2, models were trained for 300-500 epochs, and an Adam optimizer was used with adaptive learning rates of 0.0001 to 0.00001. For data augmentation, width/shift was set at 20 percent, zoom was 15 percent, random rotations of up to 25 degrees, and horizontal/vertical flips were used. Loss function was categorical cross-entropy. Framework and training and prediction times. All models were implemented in Python using Keras 29 (https://keras.io/) and a Tensorflow (https://www.tensorflow.org/) backend. Trainings were performed on Amazon's EC2 platform with a GPU instance p2.xlarge and took about 1.95 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint -5h for segmentation models and 6 minutes -4.6h for classification models. Prediction times per image averaged 3 ms for classification and 50 ms for segmentation on a standard laptop (2.6 GHz Intel core, 16GB RAM). Use of prediction probabilities in classification. For each classification decision on a given image, the model calculates a probability of the image belonging to each of the possible output classes; as a default, the image is automatically assigned to the class with the highest probability. In certain testing scenarios, a threshold of acceptable prediction probability was applied to view classifications: namely, for OB-4000 "high confidence" views, diagnostic classification was performed only on images with view prediction probabilities greater than the first quartile for each view, and for OB-125 "low-quality" views, views with a model-predicted probability ≥0.9, but that human labelers did not choose as diagnostic quality, were used (Results, Table S1). A probability threshold for diagnostic classifications was also used in the rules-based composite diagnostic classifier, described below. Quantification of cardiothoracic ratio, chamber fractional area change, and cardiac axis. Cardiothoracic ratio was measured as the ratio of the heart circumference to the thorax circumference. Fractional area change for each of the four cardiac chambers was calculated as results. Concordance of predicted quantitative measurements were compared to ground truth All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint Overall accuracy, per-class accuracy, average accuracy, confusion matrices, F-scores, receiver operator characteristics, C-statistics, and saliency maps (guided backpropagation) were calculated as previously described 26, 30 . GradCAM was also used as previously described 31 . For performance analysis of segmentation models, Jaccard similarities were calculated in the standard fashion as the intersection of predicted and labeled structures divided by their union. Clinicians with expertise in fetal cardiology (fetal cardiology and maternal-fetal medicine attendings, experienced fetal cardiology sonographers, fetal cardiology fellows, n=7), were shown up to one image per view for the studies in the OB-125 test set and asked whether that study was normal or not. For segmentation, clinical labelers segmented a subset of images multiple times, and intra-labeler Jaccard similarities were calculated as a benchmark. Use of clinicians for validation was deemed exempt research by the UCSF CHR. Due to the sensitive nature of patient data (and especially fetuses as a vulnerable population), we are not able to make these data publicly available at this time. ResNet and UNet are publicly available (e.g. at https://keras.io/examples/cifar10_resnet/ and https://github.com/zizhaozhang/unet-tensorflow-keras/blob/master/model.py) and can be used with the settings described above and in Figure S1 . Additional code will be available upon peerreviewed publication at https://github.com/ArnaoutLabUCSF/cardioML All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. To test whether DL can improve fetal CHD detection, utilizing multi-modal imaging and experts in fetal cardiology, we implemented an ensemble of neural networks (Figure 1b) to (i) identify five diagnostic-quality, guidelines-recommended cardiac views (Figure 1a ) from among all images in a fetal ultrasound (survey or echocardiogram), (ii) use these views to provide classification of normal vs. any of 16 complex CHD lesions (Table 1) , and (iii) calculate cardiothoracic ratio (CTR), cardiac axis (CA), and fractional area change (FAC) for each cardiac chamber. To train the various components in the ensemble, up to 107,823 images from 1,326 studies were used ( Table 1) . Several independent test datasets were used for evaluating model performance: Identifying the five views of the heart recommended in fetal CHD screening 12 -3-vessel-trachea (3VT), 3-vessel view (3VV), apical-5-chamber (A5C), apical-4-chamber (A4C), and abdomen (ABDO)-was a prerequisite for diagnosis. We therefore trained a convolutional neural network 27 (Figure S1a) view classifier ("DL view classifier", Figure 1b ) to pick the five All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint screening views from fetal ultrasound, where any image that was not one of the five guidelinesrecommended views was classified as "non-target" (NT; e.g. head, foot, placenta). Training data was multi-modal including both fetal echocardiograms, which naturally contain more and higherquality views of the heart, and fetal surveys, offering a full range of non-target images. Notably, only views of sufficient quality to be used for diagnosis (as deemed by expert labelers, see We then tested the view classifier on OB-125 (Figure 2d, 2e) . When diagnostic-quality target views were present, the view classifier found them with 90% sensitivity (95%CI, 90%) and 78% specificity (95%CI, 77-78%). Using only images with prediction probabilities at or above the first quartile, sensitivity and specificity increased to 96% and 92% (95%CI, 96% and 92-93%). Recommended views were not always present in each fetal survey and were more commonly present in normal studies (Figure 2f) . The view classifier's greatest confusion was between 3VT and 3VV (Figure 2d) , adjacent views that often cause clinical uncertainty also 12,17,32 . To validate that the view classifier utilized clinically relevant features, we performed both saliency mapping and gradient-weighted class activation mapping (Grad-CAM) experiments 26,31 on test images to show the pixels (saliency mapping) or region (Grad-CAM) most important to All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. We also wished to compare model performance on fetal surveys (OB-125) directly against clinicians. Therefore, we gave each the following test: one, full-resolution image per view, only five images in total per heart (Figure 3f) . This test was chosen both to make the task feasible for humans, and, given the potential regional variation in image acquisition protocols, to simulate a "lean protocol" in which only minimal recommended views are acquired. Thirty-eight of the 125 fetal surveys (30%) in OB-125 contained all five views. On this test, the model achieved 88% sensitivity (95% CI, 47-100%) and 90% specificity (95% CI, 73-98%). Clinicians (n=7) achieved an average sensitivity of 86% (95% CI, 82-90%) and specificity of 68% (95% CI, 64-72%). The model was comparable to clinicians (p=0.3) in sensitivity and superior (p=0.04) in specificity. To validate that the model generalizes beyond the medical center where it was trained 36 , we tested it on fetal echocardiograms from an unaffiliated, geographically remote medical center (BCH-400; Table 1 ). AUCs for view detection ranged from 0.95-0.99 (not shown). AUC for composite classification of normal vs. abnormal hearts was 0.89, despite a high prevalence of abnormal hearts in this test set (Figure 3e , Table S1 ). Multifetal pregnancies have a higher risk of CHD than the general population 1 . Therefore, a CHD detection model applicable to ultrasounds of twins and other multiples would be useful. Based on saliency mapping and Grad-CAM experiments (Figures 2g, 3g) , we hypothesized our model could perform adequately on surveys of twins. We used our model to predict views and diagnoses for 10 sets of twins (n=20 fetuses) including TOF and HLHS. Sensitivity and specificity were 100% and 72% (Table S1) . Models should be robust to minor variation in image quality to be useful for a range of patients and medical centers. We therefore assessed model performance on images within OB-125 that expert clinicians did not label as high-quality views, but that the model did classify as target All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint views (Figure 2d, 2f) . We inspected these "false-positive" images directly and analyzed their prediction probabilities. Of images with probability ≥ 0.9, two thirds (66%) were in fact target views, but of lower quality (e.g. slightly off-axis, heavily shadowed) than ones chosen by experts, and most (59%) of these low-quality target views had probabilities ≥ 0.9 ( Figure S3) . Therefore, the model can appropriately detect target views of lower quality. We submitted these lower-quality target images for diagnostic prediction and found sensitivity of 95% (95% CI, 83-99%) and specificity of 39% (95% CI, 28-50%). Thus, the ensemble model can make use of suboptimal images in fetal surveys to detect complex CHD, albeit with lower specificity. As with view classification above, we performed several analyses to determine whether the diagnostic classifications were based on clinically relevant image features. We trained a set of per-view binary classifiers for each of the two most common lesions in our dataset-TOF and HLHS-and examined ROC curves, saliency maps, and Grad-CAMs. For TOF, AUCs were highest for the two views from which TOF is most easily clinically appreciable: 3VT and 3VV (Figure 3b) . For HLHS, 3VT, 3VV, A5C, and A4C are all abnormal, consistent with higher AUCs in Figure 3c . Saliency mapping and Grad-CAM highlighted pixels and image regions relevant to distinguishing these lesions from normal (Figure 3g) . In clinical practice, reported sensitivity in detecting TOF and HLHS is as low as 50 and 30%, respectively 37 . With our model, sensitivity is 71% for TOF and 89% for HLHS (specificity 89% and 92%; Table S1 ). Biometric measurements aid in fetal CHD screening and diagnosis 12 . We therefore trained a modified U-Net 28 (Figure S1b, Methods) to find cardiothoracic structures in A4C images and used these segmented structures to calculate CTR, CA, and FAC for each cardiac chamber (Table 2, Figure 4) . Normal, TOF, and HLHS hearts were represented in training and testing. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. Per-class Jaccard similarities measuring overlap of labeled and predicted segmentations are found in Table S2 . Predictably, Jaccards were higher for more highly represented pixel classes (e.g., background) and were similar to intra-labeler Jaccards (range 0.53-0.98, mean 0.76). Example labels and predictions for segmented structures are shown in Figure 4 . Normal cardiothoracic circumference ratios range from 0.5-0.6 1 . Mann-Whitney U (MWU) testing showed no statistical differences among clinically measured and labeled CTR for normal hearts, nor between labeled and model-predicted CTR. CTR for TOF and HLHS hearts were normal, as previously reported 1 . A normal cardiac axis is 45 ±20 degrees 12 . Consistent with the literature 38 , mean cardiac axis was increased in TOF at 63±16 degrees (range 54-80; p-value 0.007). CA for HLHS was not found in the literature, but model-predicted CA was 49±2 degrees (range 33-72; p-value 0.04). In addition to the five still-image views, it is best practice to also obtain a video of the A4C view to assess cardiac function 1 . FAC quantifies this assessment. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06. 22.20137786 doi: medRxiv preprint Taken together, the data show that fetal cardiothoracic biometrics can be derived from image segmentation, showing good agreement between previously reported values and the potential to provide additional metrics not yet benchmarked. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. With clear benefit to early diagnosis and treatment of CHD, and growing research on in utero interventions, the need for accurate, scalable fetal screening for CHD has never been stronger 40 , while sensitivity and specificity for CHD detection remain low at centers and clinics worldwide 1 . To address this, we investigated the impact of combining real-world fetal ultrasound and trusted clinical guidelines with cutting-edge deep learning to achieve expert-level CHD detection from fetal surveys, one of the most difficult diagnostic challenges in ultrasound. In over 4000 fetal surveys (over 4M images), the ensemble model achieved an AUC of 0.99. Deep learning has been used on various medical tasks 21,23,41 , but to our knowledge, this is the first use of deep learning to approximately double community-level sensitivity and specificity on a global diagnostic challenge in a population-based test set. The model's performance and speed allow its integration into clinical practice as software onboard ultrasound machines to improve real-time acquisition and to facilitate telehealth approaches to prenatal care which are so sorely needed 9 . As a key benefit, the view classifier could be used on its own to help ensure adequate view acquisition. For retrospectively collected images, the model could be used as standalone software where a user uploads a study and receives model-chosen views and diagnostic predictions. Generalizability. To ensure our model could work robustly in real-world settings, we used twodimensional ultrasound and standard recommended fetal views rather than rather than specialized or vendor-specific image acquisitions 42,43 . Furthermore, we tested our model in a range of different scenarios and on different independent test datasets. Importantly, the model All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint maintained high sensitivity on external imaging, sub-optimal imaging, imaging from fetal surveys, from fetal echocardiograms, on datasets with community-level CHD prevalence, and with high CHD prevalence. Where a test dataset approximately 10% of the size of the training dataset has arisen as an informal rule of thumb for adequate testing in the data science community, we tested on over 350% of the number of studies in the training set, and over 4000% the number of images. structures. The prominence of the aorta, the right heart, and the stomach as distinguishing features among the five target views is both novel and sensible. A comparison of the different testing scenarios (Table S1) suggests that both the quality of images and the number of available images per study contribute to the best overall performance. Novel approaches to training. As mentioned above, we incorporated two similar study typesfetal echocardiograms and fetal surveys-in a multi-modal approach to model training that harnessed more specialized imaging in service of improving performance on screening imaging. By feeding only target views into the diagnostic classifier step, we took a more data-efficient All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint approach to the diagnostic classifier compared to using the entire ultrasound. We also took a novel approach to addressing variation in image quality that relied on human experts to agree only on labeling diagnostic-quality images for training (in testing, the model analyzed all images). This approach economized on human capital, consolidating inter-expert agreement on diagnostic-quality images, while providing fewer constraints to the model training, since some aspects that make an image low-quality to a human eye may not matter as much to a computer "eye" (image contrast is a good example of this). We found that prediction probability was an indirect representation of the model's quality assessment, and that using cutoffs for highprediction-probability images improved model performance. Diagnostic signals from small/lean datasets and rare diseases. While it is the most common birth defect, CHD is still relatively rare. Moreover, unlike modalities like photographs 21,23 , ECG 41 or chest X-ray, each ultrasound study contains thousands of image frames. Therefore, designing a model that could work on a large number of non-independent images from a relatively small subject dataset was an important challenge to overcome. Taken together, the strengths above allowed us to find diagnostic signals for rare diseases and allowed computational efficiency both in training and in subsequent predictions on new data, which is key to translating this work toward real-world and resource-poor settings where it is needed 44 . While 4,108 fetal surveys a significant test set especially when considering the size of each ultrasound, hundreds of millions of fetal surveys are performed annually at many thousands of medical centers and clinics worldwide. Therefore, expanded testing of the model prospectively All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . and in multiple centers, including community/non-expert centers, will be important going forward. It will also be important to test the model on imaging that includes a range of noncardiac malformations. Several small improvements in model algorithms, as well as more training data from more centers, may further boost performance and may allow for diagnosis of specific lesion types. Similarly, more training data for image segmentation, including segmenting additional CHD lesions, will improve segmentation model performance and allow those results to be integrated into the composite diagnostic classifier. Further clinical validation of segmentation-derived fetal biometrics will be needed, particularly where metrics on particular CHD lesions have not yet been described elsewhere. We look forward to testing and refining ensemble learning models in larger populations in an effort to democratize the expertise of fetal cardiology experts to providers and patients worldwide, and to applying similar techniques to other diagnostic challenges in medical imaging. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. . https://doi.org/10.1101/2020.06.22.20137786 doi: medRxiv preprint Tables Table 1. Demographics of Training and Test Sets. In small groups where there is a dash, information was withheld to protect patient privacy. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. purple, heart; red, left ventricle; pink, left atrium; blue, right ventricle; light blue, right atrium. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. Tables Table S1 . Summary of diagnostic performance in different test cases. Test threshold chosen from OB-4000 § ROC curve (Figure 3e ) to optimize sensitivity. CHD prevalence is again shown to aid in interpretation of predictive values. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. is too large to be drawn to scale. Normal fetal echocardiograms, dark blue; normal fetal surveys, light blue; CHD fetal echocardiograms, dark orange; CHD fetal surveys, light orange. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. Figure S3 . Model confidence on sub-optimal images. Examples of sub-optimal quality images (target views found by the model but deemed low-quality by human experts) are shown for each view, along with violin plots showing prediction probabilities assigned to the sub-optimal target images (White dots signify mean, thick black line signifies 1 st to 3 rd quartiles). All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 24, 2020. Fetal Cardiac Screening: What Are We (and Our Guidelines) Doing Wrong? Prenatal detection of critical cardiac outflow tract anomalies remains suboptimal despite revised obstetrical imaging guidelines Prenatal detection of major congenital heart disease -optimising resources to improve outcomes Advancing Prenatal Detection of Congenital Heart Disease: A Novel Screening Protocol Improves Early Diagnosis of Complex Congenital Heart Disease Deep learning Dermatologist-level classification of skin cancer with deep neural networks Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study Development and Validation of a Deep Learning Algorithm for Fetal echocardiography for congenital heart disease diagnosis: a metaanalysis, power analysis and missing data analysis Accuracy of Prenatal Diagnosis of Congenital Cardiac MalformationsAcuracia do diagnostico pre-natal de cardiopatias congenitas Prenatal diagnosis of congenital heart diseases by fetal echocardiography in second trimester: a Chinese multicenter study Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study A review of the diagnostic accuracy of fetal cardiac anomalies Fetal cardiac axis in tetralogy of Fallot: associations with prenatal findings, genetic anomalies and postnatal outcome Evaluation of fetal cardiac contractility by two-dimensional ultrasonography Long-Term Survival of Individuals Born With Congenital Heart