key: cord-1011442-s8neq5x4 authors: Arntfield, R.; VanBerlo, B.; Alaifan, T.; Phelps, N.; White, M.; Chaudhary, R.; Ho, J.; Wu, D. title: Development of a deep learning classifier to accurately distinguish COVID-19 from look-a-like pathology on lung ultrasound date: 2020-10-15 journal: nan DOI: 10.1101/2020.10.13.20212258 sha: ff6b2e15d9396edd66e0fb823b4e65bf8bcc023b doc_id: 1011442 cord_uid: s8neq5x4 Objectives Lung ultrasound (LUS) is a portable, low cost respiratory imaging tool but is challenged by user dependence and lack of diagnostic specificity. It is unknown whether the advantages of LUS implementation could be paired with deep learning techniques to match or exceed human-level, diagnostic specificity among similar appearing, pathological LUS images. Design A convolutional neural network was trained on LUS images with B lines of different etiologies. CNN diagnostic performance, as validated using a 10% data holdback set was compared to surveyed LUS-competent physicians. Setting Two tertiary Canadian hospitals. Participants 600 LUS videos (121,381 frames) of B lines from 243 distinct patients with either 1) COVID-19, Non-COVID acute respiratory distress syndrome (NCOVID) and 3) Hydrostatic pulmonary edema (HPE). Results The trained CNN performance on the independent dataset showed an ability to discriminate between COVID (AUC 1.0), NCOVID (AUC 0.934) and HPE (AUC 1.0) pathologies. This was significantly better than physician ability (AUCs of 0.697, 0.704, 0.967 for the COVID, NCOVID and HPE classes, respectively), p < 0.01. Conclusions A deep learning model can distinguish similar appearing LUS pathology, including COVID-19, that cannot be distinguished by humans. The performance gap between humans and the model suggests that subvisible biomarkers within ultrasound images could exist and multi-center research is merited. Lung ultrasound (LUS) is an imaging technique deployed by clinicians at the point-of-care to aid in the diagnosis and management of acute respiratory failure. With accuracy matching or exceeding chest X-ray (CXR) for most acute respiratory illnesses, 1-3 LUS additionally lacks the radiation and laborious workflow of computed tomography (CT). Further, as a low cost, battery operated modality, LUS can be delivered at large scale in any environment and is ideally suited for pandemic conditions. 4 B lines are the characteristic pathological feature on LUS, created by either pulmonary edema or non-cardiac causes of interstitial syndromes. The latter includes a broad list of conditions ranging from pneumonia, pneumonitis, acute respiratory distress syndrome (ARDS) or fibrosis. 5 While an accompanying thick pleural line is helpful in differentiating cardiogenic from non-cardiogenic causes of B lines, 6 reliable methods to differentiate noncardiogenic causes from one another on LUS have not been established. Additionally, user dependent interpretation of LUS contributes to wide variation in disease classification, 7, 8 creating urgency for techniques that improve diagnostic precision and reducing user-dependence. Deep learning (DL), a foundational strategy within present-day artificial intelligence (AI) techniques, has been shown to meet or exceed clinician performance across most visual fields of medicine. [9] [10] [11] Without cognitive bias or reliance on spatial relationships between pixels, DL ingests images as numeric sequences and evaluates for quantitative patterns that may reveal information that is unavailable to human analysis. 12 With CT and CXR research maturing, [13] [14] [15] LUS remains comparably understudied with DL due to a paucity of organized, well labelled LUS data sets and the seeming lack of rich information in its minimalistic, artifact-based images. In this study, we trained a neural network using LUS images of B lines from 3 different etiologies (hydrostatic pulmonary edema (HPE), ARDS and COVID-19). Using LUS-fluent physicians as comparison, we sought to determine if subvisible features in LUS images are available to a DL model that would allow it to exceed human limits of interpretation. After University of Western Ontario Research Ethics Board (REB 115723) approval, LUS exams performed at London Health Sciences Centre's 2 tertiary hospitals were identified within our database of over 100,000 point-ofcare ultrasound exams. The curation and oversight of this archive have previously been described. 16 The goal of this study was to determine if a deep neural network could distinguish between the B line profiles of 3 different disease profiles, namely 1) hydrostatic pulmonary edema (HPE); 2) non-COVID ARDS (NCOVID) causes; and 3) COVID-19 ARDS (COVID). These profiles were chosen deliberately to challenge the neural network to classify images with obvious qualitative differences (HPE vs ARDS) and with no obvious differences (NCOVID vs COVID) between their B lines patterns ( Figure 1 , Videos 1, 2, 3). The COVID class consisted of confirmed cases of COVID-19 via reverse-transcriptase polymerase chain reaction test. The NCOVID class consisted of an assortment of causes: aspiration, community acquired pneumonia, hospital acquired pneumonia and viral pneumonias. Exams were conducted as part of patient encounters in the emergency department, intensive care unit and medical wards across the 2 hospitals. Candidate exams for inclusion were identified using a sequential search by 2 critical care physicians, ultrasound experts (RA, TA) from within the finalized clinical reports of our database of LUS cases ( Figure 2 ). Videos from our dataset represented a variety of ultrasound systems with phased array probe predominantly used for acquisition. Videos of the costophrenic region (which included solid abdominal organs, diaphragm, or other pleural pathologies such as effusions or trans-lobar consolidations) were excluded as 1) these regions did not contribute greatly to alveolar diagnoses, 2) this would introduce heterogeneity into the still image data, and 3) a trained clinician can easily distinguish between these pathologies and B lines. Duplicate studies were discarded to avoid overfitting. From each encounter, de-identified mp4 loops of B lines, ranging from 3-6 seconds in length with a frame rate ranging from 30-60/second (depending on the ultrasound system), were extracted. As COVID was the newest class available to our database, its comparably smaller number of encounters governed the number of encounters we extracted from HPE and NCOVID. A balanced volume of data for each class of image is important to avoid model over training on a single image class and/or overfitting. The images used to train the model were all frames from the extracted LUS clips. Hereafter, a clip refers to a LUS video that consists of several frames. An encounter is considered to be a set of one or more clips that were acquired during the same LUS examination. Preprocessing of each frame consisted of a conversion to grayscale followed by a script written by one of our team (JH) to scrub the image of extraneous information (index marks, logos, and manufacturer-specific user interface). See Supplementary Appendix for full details. Data augmentation techniques were applied to images to each batch of training data during training experiments to combat overfitting. Augmentation transformations included random zooming in/out by ≤ 10%, horizontal flipping, horizontal stretching/contracting by ≤ 20%, vertical stretching/contracting (≤5%), and bi-directional rotation by ≤ 10°. In choosing an optimal architecture for our model, we investigated training from scratch on custom implementation of feedforward CNNs, residual CNNs as well as transfer learning methods. 17 Ultimately, Xception architecture 18 achieved the highest performance among the custom and 7 common architectures evaluated. Individual preprocessed images were fed into the network as a tensor with dimensions 600×600×3. Although the images were originally greyscale, they were converted to RGB representation to ensure that the model input shape was compatible with the pre-trained weights. The output tensor of the final convolutional layer of the Xception model was subject to 2D global average pooling, resulting in a 1-dimensional tensor. Dropout at a rate of 0.6 was applied to introduce heavy regularization to the model and provided a noticeable reduction in overfitting. The final layer was a 3-node fully connected layer with softmax activation. The output of the model represents the probabilities that the model assigned to each of the 3 classes, all summing to 1.0. The argmax of this probability distribution was considered to be the model's decision. To further combat overfitting, early stopping was applied by halting training if the loss on the validation set did not decrease over the most recent 15 epochs. 20 For additional details on model selection, training, coding practice, our GitHub repository and hardware used in this project, please see Supplementary Appendix. A modification of the holdout validation method was used to ensure that the model selection process was independent of the model validation. Our holdout approach began with an initial split that randomly partitioned all . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 15, 2020. . encounters into a training set and 2 test sets (henceforth referred to as test-1 set and test-2 set). The distribution of encounters and frames after this split are shown in Table 1 . It must be emphasized that all splits were conducted at the encounter level, not across all frames, thus ensuring that frames from the same clip did not appear in more than one partition. Test-1 was used to evaluate all of the candidate models so that a final model architecture and set of hyperparameters could be chosen. Test-2 was considered to be the holdout set, since it remained untouched throughout the model selection process and was only used during the final validation phase. A full account of validation methods can be found in the Supplementary Appendix. The final model performance was determined by its results on our hold-back, independent dataset (test-2). The results were analyzed both at the individual frame level and at the encounter level. The latter was achieved through averaging the classifier's predicted probabilities across all images from within that encounter. We assessed the model's performance by calculating the area under the receiver operating characteristic curve (AUC), analyzing a confusion matrix, and calculating metrics derived from the confusion matrix. Benchmarking human performance for comparison to our model was undertaken using a survey featuring a series of 25 lung ultrasound clips, sourced and labelled with agreement from 3 ultrasound fellowship trained physicians (MW, TA, RA, see Supplementary Appendix for complete survey). The survey was distributed to 100 LUS-trained acute care physicians from across Canada. Respondents were asked to identify the findings in a series of LUS loops according to the presence B lines vs normal lung (A line pattern), the characteristics of the pleural line (smooth or irregular) as well as the cause of the lung ultrasound findings (hydrostatic pulmonary edema, non-COVID pneumonia or COVID pneumonia). Responses were compared to the true, expert-defined labels consistent with our data curation process described above. Since the data used for modelling did not include normal lungs, it was decided that those four clips were discarded from analysis. Any normal diagnoses (37 of 1281 diagnoses) for the remaining clips were replaced with uniformly randomly generated diagnoses for the remaining causes. We utilized the Grad-CAM method to visually explain the model's predictions. 21 Grad-CAM involves visualizing the gradients of the prediction of a particular image with respect to the activations of the final convolutional layer of the CNN. A heatmap is produced that is upsampled to the original image dimensions and overlaid onto the original image. The resultant heatmap highlights the areas of the input image that were most contributory to the model's classification decision. The GitHub link to the code used to generate the DL model and the full survey data results can be found in our supplementary appendix. Patients or the public were not involved in the design, conduct, reporting or dissemination plans of this work. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 15, 2020. . https://doi.org/10.1101/2020.10.13.20212258 doi: medRxiv preprint The data extraction process resulted in 84 cases of COVID which, as part of our effort to balance the groups for unbiased training, led to 78 of NCOVID and 81 of HPE. Further characteristics of the data are summarized in Table 2 . The benchmarking survey was completed by 61 physicians with a median of 3-5 years of ultrasound experience the majority of whom had done at least a full, dedicated month of ultrasound training (80.3%) and who described their comfort with LUS use as "very comfortable" (72.1%). See Supplementary Appendix for a full summary of survey data. The results of this survey highlight that the physicians were adept at distinguishing the HPE class of B lines from COVID and NCOVID causes of B lines. For the COVID and NCOVID cases, however, significant variation and uncertainty was demonstrated. See Table 3 and "Comparing Human and Neural Networks" section below. The model's predictions were evaluated at both the image and encounter level. The prediction for an image is the probability vector is the average predicted probability for class c over the predictions for all images within that encounter. Encounter-level predictions were computed and presented to (1) replicate the method through which real time interpretation (by clinician or machine) occurs with ultrasound by aggregating images within one or more clips to form an interpretation and (2) closely simulate a physician's classification procedure, since the physicians who participated in our benchmarking survey were given entire clips to classify. Three models fit with our chosen architecture and set of hyperparameters were evaluated on test-1, achieving mean AUCs on the encounter level of 0.966 (COVID), 0.815 (NCOVID), and 0.902 (HPE). The model's ultimate ability was to be determined on the 10.1% of our images that constituted the holdback data (test-2) data. On this independent data, the model demonstrated a strong ability to distinguish between the 3 relevant causes of B lines with AUCs at the encounter level of 1.0 (COVID), 0.934 (NCOVID), and 1.0 (HPE), producing an overall AUC of 0.978 for the classifier. Confusion matrices on the test-2 set at the frame and encounter level (Table 3) show strong diagonals that form the basis of these results and the performance metrics seen in Table 4 . Since AUC measures a classifier's ability to rank observations, the raw survey data (in the form of classifications, not probabilities) was processed to permit an AUC computation by considering physician-predicted probability of a LUS belonging to a specific class as the proportion of physicians that assigned the LUS to that class. The AUCs for the physicians, at face value, were 0.697 (COVID), 0.704 (NCOVID), and 0.967 (HPE), leading to an overall AUC of 0.789 (as compared to 0.978 for our model). A comparison of the human and model AUCs is graphically displayed in Figure 3 . We took note of the the AUC of approximately 0.7 for the physicians when the positive class is COVID or NCOVID, as distinguishing between these classes is not known to be possible by humans. In examining the raw confusion matrix data (Table 3) , this suggests near random classification (which corresponds to an AUC of 0.5) between these 2 classes -see Supplementary Appendix for a complete explanation. Given the important implications of the performance gap observed, we employed an additional step of statistical validation for our findings through a Monte Carlo Simulation (MCS, see Supplementary Appendix for full details) of human performance, based on our survey results, across one million exposures to our test-2 data. 22 After simulating this . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 15, 2020. . https://doi.org/10.1101/2020.10.13.20212258 doi: medRxiv preprint performance one million times, the MCS yielded an average AUC of 0.840 across all three classes, with very few cases matching or exceeding the performance of the CNN. Thus, we can conclude that our model exceeds human performance, and in particular that the model can distinguish between COVID and NCOVID (p < 0.01). The Grad-CAM explainability algorithm was applied to the output from the model on the holdback data. The results are conveyed by color on the heatmap, overlaid on the test-2 input images. Blue and red regions correspond to highest and lowest prediction importance respectively. As the results seen in Figure 4 demonstrate, the key activation areas for all classes were centered around the pleura and the pleural line. In this study, a deep learning model was successfully trained to distinguish the underlying pathology in similar point-of-care lung ultrasound images containing B lines. The model was able to distinguish COVID-19 from other causes of B lines and outperformed ultrasound trained clinician benchmarks across all categories. Our results are the first of their kind to support that digital biomarker profiles may exist within lung ultrasound images. Our model was developed using a data set of 243 patients (600 video loops/121,381 frames) which is modest by machine learning standards. Owing to the scarcity of labelled lung ultrasound data, this data volume does compare favorably to other published lung ultrasound work. [23] [24] [25] Given the implications of successfully classifying LUS images, it was essential for us to protect against overfitting. While many approaches exist to avoid an overfit model, we, in addition to multiple data augmentation techniques, reserved 10% of our data (test-2) as a hold-back set, not involved in model fitting or selection. This approach mimics the unbiased, generalizable performance desired of an image classifier and is familiar to other notable deep learning vision research in medicine. [9] [10] [11] Deep learning has shown similarly favorable results in recent CXR and CT studies of COVID-19. 26, 27 Given LUS image creation is fundamentally different (producing artifacts, rather than anatomic images of the lung), it could not be expected that our work with LUS would have yielded such similar results. The value of identifying such accuracy in a LUS model rests in the ability of LUS (unlike CT or CXR) to be delivered by limited personnel, at low cost and in any location. Lung ultrasound artifact analysis has existed for several years in some commercially available ultrasound systems and has also been described using various methods in the literature. 23, 28, 29 Automating the detection of canonical findings of LUS, these techniques are convenient and serve to achieve what clinicians may be trained to do with minimal training. 30 With attention to COVID-19, LUS has been shown to inform clinical course and outcome 31 , creating some further momentum toward broader LUS competence. As our work opens the door toward plausible early, automated COVID identification using LUS, DL techniques to auto-generate clinical severity score for COVID has also recently been described. 24 The eventual integration of various DL models into ultrasound hardware seems plausible as a method to achieve real time, point-of-care diagnosis and prognosis of COVID or other specific respiratory illnesses. The implications of our work, at time of writing, are strongly attached to the current challenges and importance of COVID-19 diagnosis. Our results point to a unique, pixel-level signature within the COVID-19 image. Though the exact mechanism of distinction is unknown, the heat map results suggest that subvisible variations in the pleural line itself is most active in driving the model's performance. The precise taxonomical implications of our findings, whether they are driven by COVID-19, coronaviruses or viruses a whole, will require additional research. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 15, 2020. . Our study has some important limitations. The first relates to the opaqueness that is implicit to deep neural networks. Despite using Grad-CAM, the decisions by the trained model are not outwardly justified and we are unable to critique its methods and must trust its predictions. Our benchmarking survey did not exactly replicate the questions posed to our neural network which made our statistical analysis more complex than it might have needed to be otherwise. The other limitations of our study related to the data size and sources. Though our model performance signal was strong, the addition of further training data can only aid with generalizability of the model. Future work will focus on validation through multi-center collation of COVID LUS encounters. Lastly, our data were all from hospitalized patients and our results may not generalize to those who are less ill. With strong performance in distinguishing lung ultrasound images of COVID-19 from mimicking pathologies, a trained neural network exceeded human interpretation ability and raises the possibility of disease-specific, subvisible features contained within lung ultrasound images. Further research using well-labelled, multi-center data is indicated. All We declare no competing interests. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 15, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 15, 2020. . https://doi.org/10.1101/2020.10.13.20212258 doi: medRxiv preprint Lung ultrasound for the diagnosis of pneumonia in adults: A meta-analysis Relevance of lung ultrasound in the diagnosis of acute respiratory failure the BLUE protocol Trauma ultrasound examination versus chest radiography in the detection of hemothorax COVID-19 outbreak: less stethoscope, more ultrasound Lung B-line artefacts and their use Chest sonography: A useful tool to differentiate acute cardiogenic pulmonary edema from acute respiratory distress syndrome. Cardiovasc Ultrasound Lung ultrasound and B-lines quantification inaccuracy: B sure to have the right solution Expert Agreement in the Interpretation of Lung Ultrasound Studies Performed on Mechanically Ventilated Patients Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning Real-time electronic interpretation of digital chest images using artificial intelligence in emergency department patients suspected of pneumonia Artificial Intelligence Distinguishes COVID-19 from Community Acquired Pneumonia on Chest CT Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images. medRxiv [Internet The utility of remote supervision with feedback as a method to deliver high-volume critical care ultrasound training Transfer learning with deep convolutional neural network for liver steatosis assessment in ultrasound images Xception: Deep learning with depthwise separable convolutions ImageNet: A large-scale hierarchical image database Early Stopping -But When? Visual Explanations from Deep Networks via Gradient-based Localization An Introduction to MCMC for Machine Learning Automated Lung Ultrasound B-line Assessment Using a Deep Learning Algorithm Deep learning for classification and localization of COVID-19 markers in point-of-care lung ultrasound Automatic Detection of COVID-19 From a New Lung Ultrasound Imaging Dataset (POCUS). arXiv [Internet Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks Lung Ultrasound in COVID-19 Pneumonia: Correlations with Chest CT on Hospital admission. Respiration Quantitative lung ultrasonography: a putative new algorithm for automatic detection and quantification of B-lines. Crit Care Computer-Aided Quantitative Ultrasonography for Detection of Pulmonary Edema in Mechanically Ventilated Cardiac Surgery Patients Can Limited Education of Lung Ultrasound Be Conducted to Medical Students Properly? A Pilot Study The authors would like to acknowledge the computational and technical support from CENGN (Canada's Centre of Excellence in Next Generation Networks), Mr. Matt Ross from the City of London, Mrs. Kristine Van Arsen from the Division of Emergency Medicine and the clinician-sonographers at London Health Sciences Centre who faithfully record and annotate their lung ultrasound studies. NCOVID 177 (4) 163 (1)