key: cord-0478523-oz6ntk9i authors: Zhong, Aoxiao; Li, Xiang; Wu, Dufan; Ren, Hui; Kim, Kyungsang; Kim, Younggon; Buch, Varun; Neumark, Nir; Bizzo, Bernardo; Tak, Won Young; Park, Soo Young; Lee, Yu Rim; Kang, Min Kyu; Park, Jung Gil; Kim, Byung Seok; Chung, Woo Jin; Guo, Ning; Dayan, Ittai; Kalra, Mannudeep K.; Li, Quanzheng title: Deep Metric Learning-based Image Retrieval System for Chest Radiograph and its Clinical Applications in COVID-19 date: 2020-11-26 journal: nan DOI: nan sha: 1b8fa039268950968424bc49d91a9223eb465d07 doc_id: 478523 cord_uid: oz6ntk9i In recent years, deep learning-based image analysis methods have been widely applied in computer-aided detection, diagnosis and prognosis, and has shown its value during the public health crisis of the novel coronavirus disease 2019 (COVID-19) pandemic. Chest radiograph (CXR) has been playing a crucial role in COVID-19 patient triaging, diagnosing and monitoring, particularly in the United States. Considering the mixed and unspecific signals in CXR, an image retrieval model of CXR that provides both similar images and associated clinical information can be more clinically meaningful than a direct image diagnostic model. In this work we develop a novel CXR image retrieval model based on deep metric learning. Unlike traditional diagnostic models which aims at learning the direct mapping from images to labels, the proposed model aims at learning the optimized embedding space of images, where images with the same labels and similar contents are pulled together. It utilizes multi-similarity loss with hard-mining sampling strategy and attention mechanism to learn the optimized embedding space, and provides similar images to the query image. The model is trained and validated on an international multi-site COVID-19 dataset collected from 3 different sources. Experimental results of COVID-19 image retrieval and diagnosis tasks show that the proposed model can serve as a robust solution for CXR analysis and patient management for COVID-19. The model is also tested on its transferability on a different clinical decision support task, where the pre-trained model is applied to extract image features from a new dataset without any further training. These results demonstrate our deep metric learning based image retrieval model is highly efficient in the CXR retrieval, diagnosis and prognosis, and thus has great clinical value for the treatment and management of COVID-19 patients. In recent years, thanks to the combined advancement of computational power, accumulated high-quality medical image datasets, and the development of novel deep learning-based artificial intelligence (AI) algorithms, there has been a widespread applications of AI in radiology and clinical practice (Thrall et al., reading of digital mammography where the mammogram retrieval system can provide them with intuitive visual aids for easier diagnosis (Müller et al., 2004; Müller and Unay, 2017) . We thus hypothesize that a CBIR system, which can achieve near real-time medical image retrieval from massive and multi-site database for both physician/radiologist examination and computer-aided diagnosis, could be very helpful in dealing with COVID-19 pandemic. CBIR system can provide visually and semantically relevant images from a database with labels matching the query image. Thus, the label or diagnosis of the matched image can provide a clue for the queried image. The key component of a CBIR system is the embedding of images i.e. transformation of images from native (Euclidean) domain to a more representative, lowerdimension manifold, as effective image representation can enable more accurate and faster retrieval. Various image embedding methodologies specifically tailored to biomedical images have been proposed, including kernel methods such as hashing , hand-crafted image filters such as filter banks (Foran et al., 2011) and SIFT (Kumar et al., 2016) . Recent advancement of deep learning has also inspired CBIR systems developed based on deep neural networks (Wan et al., 2014) , such as CNN for classification (Qayyum et al., 2017) and deep autoencoder (Çamlica et al., 2015) which has shown superior performance than other methods. However, current deep learning-based schemes of directly learning image representations (i.e. embeddings) based on the relationship between image features and image labels might not be the optimized approach for image retrieval task. As pointed out in (Khosla et al., 2020) , comparing with cross-entropy loss which is widely adopted in current deep learning methods, pair-wise contrastive loss can be more effective in leveraging label information. Thus, in recent years, metric learning based CBIR systems for analyzing histopathological images have been developed (Yang et al., 2019 . Traditional (non-deep learning) metric learning methods have also been proposed for analyzing CT (Wei et al., 2017) and magnetic resonance imaging (MRI) images (Cheng et al., 2016) . To the best of our knowledge, there is no such metric learning studies for CXR images in a clinical setting. To this end, we propose a deep learning-based CBIR system for analyzing chest radiographs, specifically for images from potential COVID-19 patients. The core algorithm of the proposed model is deep metric learning with multi-similarity loss and hard-mining sampling strategy to learn a deep neural network that embeds the CXR images into a low-dimensional feature space. The embedding module has the backbone network structure of Resnet-50 (He et al., 2016) . In addition, the proposed CBIR model features an attention branch using spatial attention mechanism to extract localized embeddings and provide local visualization (i.e. attention map) of the disease labels, in order to provide visual guidance to the readers and improve model performance. This design allows us to ensure both content-and semantic-similarity between the query images and the returned images. The model is trained and validated on a multi-site COVID-19 dataset, consisting of totally 18,055 CXR images from three sources: the public open benchmark dataset COVIDx (Wang et al., 2020b) , 5 hospitals from the Partners HealthCare system in MA, U.S., and 4 hospitals in Daegu, South Korea. Performance of the model is evaluated by its capability of retrieving the correct images and diagnosing the correct disease types. The proposed model is further evaluated by transferring it to a different task, where it is utilized to extract informative features from new, independently collected CXR images. Extracted features are then combined with the electronic health record (EHR) features to predict the need of intervention within 72 hours, which serves as a clinical decision support tool for COVID-19 management in emergency department. Key contributions of this work are summarized as follows: 1) we develop a CBIR system that includes a novel embedding model with spatial attention mechanism which is trained with adjusted multi-similarity loss and hard-mining sampling strategy; 2) in both image retrieval and diagnosis tasks, the model achieves state-of-the-art performance, and shows superior performance than the Resnet-50 network which is a widely-applied method in medical image analysis; 3) the model shows high accuracy in prognosis task, and demonstrate its potential clinical values for many tasks in clinical decision support. In the workflow of our proposed CBIR system, for an incoming query CXR image, we will first extract its low-dimensional feature embedding using a deep neural network, which is trained using deep metric learning. After that, top-k images which are closest to the query image in the embedding space will be retrieved and displayed together with associated electronic health record (EHR). COVID-19 diagnosis of the query image can be then inferred by labels from the retrieved images. Embeddings of CXR images can be also used for other purposes such as clinical decision support. An overview of the model pipeline is illustrated in Fig. 1 , details of each step especially notations for network structures can be found in section 2.3 and 2.4. Figure. 1 Computational pipeline of CXR image retrieval model in a COVID-19 diagnosis context. In this study, we collected CXR images from 9 hospitals of 2 countries (5 hospitals from Partners HealthCare system in U.S, 4 hospitals from South Korea), and combined them with the public COVIDx dataset to form a multi-site dataset for training and validation. In all the three data sites, CXR images other than in the anterior-posterior (AP) or posterior-anterior PA view (e.g. in lateral view), or images with significant distortion because of on-board postprocessing (e.g. strong edge-enhancement), are excluded. Descriptions of the three data sites can be found below. It should be noted that the definition of "control" in this study includes patient with no diagnosed pneumonia nor positive PCR test results. We specifically include the type of "non-COVID pneumonia", which can be caused by a wide spectrum of reasons including bacteria, virus and fungi into this study, because it leads to similar patterns on CXR images with COVID-19, e.g. both demonstrate ground glass opacities and consolidation (Jacobi et al., 2020) . In addition to the non-COVID pneumonia images in COVIDx dataset, CXR images from totally 212 patients with diagnosis of non-COVID pneumonia admitted to Partners HealthCare system during the study period were collected and included in the dataset. A brief summary and basic demographic information of this multi-site dataset can be found in Table 1 . (Shim et al., 2020) . There are totally 3,262 CXR images from hospitalized COVID-19 patients in this dataset. windowing and lung segmentation. The major reason of including lung segmentation in the preprocessing is to prohibit the model from learning to distinguish the source of the data by features such as letters put onto CXRs, since the data collected from different sites have imbalanced label distribution. The whole lung region is automatically segmented by an ensemble of five deep neural networks. These networks have the same backbone structure of EfficientNet (Tan and Le, 2019) , but with different architectures and parameters. The ensemble segmentation model is trained on one MGH dataset with 100 annotated CXRs and two public datasets: the tuberculosis CXRs from Montgomery County (Jaeger et al., 2014) , and the Shenzhen and JSRT (Japanese Society of Radiological Technology) CXRs (Shiraishi et al., 2000) . Denote a set of data ( , ), where is the CXR image of one patient, is the label of the patient. In this work, the label is a ternary value indicating whether the patient is from control group, has non-COVID pneumonia or COVID-19. Our goal is to learn a function : → ℝ that embeds the given CXR image into a d-dimensional embedding feature space, which ensures: 1) semantically same images (i.e. with the same label) shall be closer in the embedded space, and vice versa; 2) patients with similar image content, especially around lesion regions related to the disease, shall be closer in the embedded space. We employ the contrast learning scheme to find such non-linear embedding, which is a deep neural network parameterized by . It has been reported in previous literatures that learning representations by contrasting positive pairs against negative pairs can be more advantages than learning the direct mapping from data to its label for improved robustness and stability (Hadsell et al., 2006) . To achieve these two goals, we adopt a metric learning scheme to train the network with paired input images as input and multi-similarity loss between the image pairs as loss. We also exploit spatial attention mechanism to focus the model on potential lesion regions. Attention mechanism allows salient features to be dynamically localized to the forefront as needed (Xu et al., 2015) and has been widely used in many applications such as image segmentation (Fu et al., 2019) and classification (Wang et al., 2017a) . In this work, we use the cosine similarity S between embedded features to measure the similarity between pairs of images, namely: where f is the embedding function we aim to learn. Following the common practice in metric learning, we will normalize the embeddings at the end, letting ‖ ( )‖ 2 = 1 for all x. We employ the multi-similarity loss for the "paired metric learning" step in Fig. 1 , which has achieved state-of-the-art performance on several image retrieval benchmarks. The loss function L is adjusted to our setting by: where Pi and Ni are the indices set of selected "same type" (i.e. images with the same label) and "different types" (i.e. images with different labels) pairs of samples regarding to the anchor image , m is the batch size and α, β, λ are hyperparameters. For each minibatch during training, we randomly select N samples from each class, forming a minibatch of size T×N, where T is the number of classes. Every two samples in the batch can be used as a pair in the calculation of the loss function. Training with random sampling may harm the capacity of the model and slows the convergence (Wu et al., 2017) , since pair-based metric learning often generates large number of sample pairs which can include informative easy or redundant pairs. We use a hard-mining strategy to improve model performance and speedup training convergence: each "same type"/"different types" pair will be compared to the hardest pairs in the whole batch to mine the hard pairs, as performed in . Spatial attention mechanism is adopted in our embedding model to obtain disease-localized embeddings of the patients and to provide interpretable output at the stage of image retrieval. Specifically, an attention module is plugged into the network in parallel with feature extraction route represented by (•), which generates a mask with the value from 0 to 1 and the same spatial dimension of network's intermedia feature map. The attention route in Fig. 1 illustrates how the attention module is plugged into the backbone network. Element-wise multiplication will be performed between the output attention mask and intermedia feature map of the network to obtain a localized feature map. This localized feature map is then sent to the projection head to get the final embedding. In other words, by writing the embedding function as: where f1 and f2 are different stages of the feature extractor (i.e. convolutional layers) and g is the projection head which projects the representations into lower-dimensional embedding space, shown as the corresponding lettered blocks in Fig. 1 . As the embedding will be served as input to the later metric learning module, the projection aims to reduce the dimension of embedding for improved performance. The final embedding with plugged spatial attention module is: In Eq. 4, output of the network 1 ( ) goes through attention module (⋅) to generate the attention mask ( ) = ( 1 ( )) , which localizes the intermediate feature map (i.e. 2 ( 1 ( )) ) in Fig. 1 before feeding into the projection head g. This whole embedding model will be then optimized by the metric learning scheme as introduced previously. This design is inspired by the work in (Kim et al., 2018) , in which attention modules enables computer vision algorithm to attend to specific parts of the object. We use Resnet-50 (He et al., 2016) 256 with the aspect ratio fixed for both training and testing. We randomly crop images to 256×256 during training but use the whole image during testing. We used Adam optimizer with default parameters. The learning rate is set to 3e -5 . We trained our model for 2,000 iterations with batch size T×N=3×16=48, which is roughly equivalent to 5 epochs, using pretrained model from ImageNet (Deng et al., 2009 ) as initialization. Parameters in the loss function are set as λ=0.5, α=2 and β=20, derived from grid-search. For the purpose of classification, we employ the K-nearest Neighbor (KNN) classifier (i.e. returning k nearest images based on distance in the embedding space) with distance weighting (i.e. closer neighbors of a query point have larger weight). In this work we set k=10, that is, for each query image 10 neighbor images will be retrieved by the model, which then make the weighted majority vote to determine the label of query image. Label of the query image is then determined by the weighted majority vote from the label of returned k images. The weighted voting also avoids a tie. Source code of the model, also including trained network and CXR preprocessing modules, will be published on a public repository (GitHub), available to be downloaded and used by public. Images from the COVIDx dataset used in this work will be shared along with the codes for easy replication and testing. Here we present our results of CBIR-based modelling and processing of COVID-19 CXR images in three perspectives: validity of the model by its capability of performing correct image retrieval and comparison with baseline method; clinical value of the model by its multi-site diagnostic performance; and finally transferability of the model by using its embedding function for a different clinical decision support task. The multi-site dataset is split into training and validation part according to Table 2 . Patient types are varying across different data sites, so we performed the splitting to ensure that maximum number of sites are presented in both training and validation data, to remove potential site-wise bias. As there is no label of "non-COVID pneumonia" in the Partners data site (labels are determined based on PCR test), and no "control" nor "non-COVID pneumonia" in the Korean data (all COVID-19 patients), there are several "N/A (not available)" entries in Table 2 . After training the proposed model to learn the feature embeddings, we performed the image retrieval task using a neighbourhood size k=10 (i.e. ten images will be returned by the model for each query). Due to the space limit, we only demonstrate and analyze the results using the top 4 returned images. Sample query/return CXR images and clinical information of the returned images are visualized in Fig. 2 . Because of limited space, we only show important clinical information here, including patient gender, age, Radiographic Assessment of Lung Oedema (RALE) score (Warren et al., 2018) , SpO2 (oxygen saturation), WBC (white blood cell count), admission to ICU (intensive care unit). RALE is originally designed for evaluating CXRs of acute respiratory distress syndrome (ARDS). As COVID-19 is similar and will potentially lead to ARDS, we are using RALE here to roughly assign COVID-19 images to "mild" cases as in Fig. 2(a) , and "severe" cases as in Fig. 2(b) . It should be noted that RALE scores of each CXR image are manually assessed by two senior radiologists in Partners healthcare group, thus they are only available in the "Partners" and "Korean" data for the purpose of validating our results. In the future, an AI based model will be used to automatically estimate RALE score in the EHR system so that the score also appears in the retrieved clinical information. Also, these is no clinical information available in the public "COVIDx" data site. From the returned CXR images it can be found that: 1) CXR images from different data sites with the query image but of the same label can be correctly retrieved, indicating that there is little site-wise bias of the learned embedding; 2) The model can handle image with heterogenous patient characteristics e.g. varying sizes of the lung and varying locations of lesion regions, as well as heterogenous imaging conditions; and 3) We can observe a strong similarity of patient's severity among the retrieved images, as shown in panel (a) and (b) in Fig. 2 . Specifically, both the RALE score and patient's admission to ICU indicate that the four returned images in Fig. 2(b) are consistently more severe than the returned images in Fig. 2(a) . As the RALE score of the query images in Fig. 2(a) and (b) are 2 and 34 respectively, we find the severity of returned images are also related to the patient's condition of the query image. Considering the fact that the model is trained without patient's severity of disease (i.e. only based on three types of image labels), its ability in retrieving severity-associated images shows that it can correctly extract CXR features that are sensitivity to COVID-19's disease progression. To quantitatively evaluate the performance of the proposed model on this query by example task, we calculate the averaged recall rate of the k returned images over all test samples. A query sample is defined as "successfully recalled" if at least one image in the k returned images has the same label of query image. For reference, as the dataset involved in this work has a single label of three classes, a random retrieving model will have averaged recall rate of 33.3% when k=1, 55.6% when k=2, 81.0% when k=4, and 95% when k=10 on a balanced dataset. Recall rates of the proposed model with different parameter k are listed in Table 3 (left). For comparison, the baseline image retrieval model was developed based on a raw Resnet-50 network following traditional classification scheme. The network was trained using CXR images as input and the ternary image labels as output, with cross-entropy loss. We then extract the intermediate output from the last global average pooling layer and use it as feature embeddings for the input images. The same cosine similarity in Eq. 1 is used to measure the similarity between embeddings, which is then used for image retrieval. Pipeline of this baseline image retrieval model based on Resnet-50 is illustrated in the top panel of Fig. 4 , with comparison of example retrieved images in the bottom panel of Fig. 4 . As shown in the example retrieval task, our proposed model can retrieve more similar images with the correct labels, comparing with the baseline model. Performance of the baseline model for the same image retrieval task is listed in Table 3 (right). The quantitative evaluation in Table 3 shows that our proposed model achieves higher recall rate in the task of retrieving non-COVID pneumonia and COVID-19 CXR images, which is a more important task for COVID-19 screening and resource management. For the task of retrieving normal control images, the proposed model performs slightly worse than the baseline model. Investigation into model outputs reveal that the baseline model is more likely to retrieve images from the same dataset to the query image. Because the majority of normal control images come from a single (COVIDx) dataset, the baseline model can achieve better recall rate. That is also the reason why the proposed model has better performance for non-COVID pneumonia and COVID patients. Image retrieved by the proposed model (the same as in Fig. 2 ) are also listed here for reference. We further evaluate the potential clinical value of the proposed model by its diagnostic performance. In the proposed model, label of the query image was determined by the majority vote of labels from the returned neighbour images. Diagnosis results are listed in Table 4 . Sensitivities and positive predictive values (PPVS) of non-COVID pneumonia and control are not available to the Partners and Korean dataset as there are no images with the corresponding labels in these two sites. Overall, the proposed model can achieve >83% accuracy in performing COVID-19 diagnosis. Most notably, it achieves very high sensitivity for non-COVID pneumonia and COVID-19 (>85%), indicating that the model can potentially serve as a screening and prioritization tool right after the chest radiography scan is performed. We also evaluate the performance of baseline method, the raw Resnet-50 network described in section 3.1, by applying it on the validation data. Its performance is listed to the right panel of Table 4 . As the raw Resnet-50 network is trained for the very purpose of classifying images by their labels, it is expected that the baseline method can achieve good performance on this diagnosis task. However, comparison between the two models shows that the proposed model outperforms baseline Resnet-50 model in overall performance in all three types (control, non-COVID pneumonia and COVID-19) of images. While in the COVIDx dataset the two models have very similar performance, the proposed model achieved better accuracy in classifying non-COVID pneumonia patients in Partners dataset. Such task is specifically difficult as data from non-COVID pneumonia patients were acquired together with COVID patients using the same machine and protocols, thus they are more homogenous and difficult to separate. On the contrary, non-COVID pneumonia population in COVIDx dataset are acquired from separate sources than COVID patients. Also, it should be noted that while the images in Korean data (totally 222 images, all COVID-19) for validation can be easily diagnosed by both models, overall diagnosis of COVID-19 only by CXR images is still a difficult task, which has also been recognized by radiologists (Murphy et al., 2020) . In summary, the result indicates that the proposed metric learning scheme has a higher level of capability to learn a label-discriminative embedding from the input images. As described in Section 2.4.2, spatial attention mechanism is utilized in this work to focus the image embedding towards disease-specific regions. In order to investigate the effectiveness of attention mechanism, we implement the CBIR-based model using the identical model structure and hyperparameters and train it on the same dataset, but without the attention module (⋅) and the corresponding attention mask ( ) . Comparison between the model with and without attention mechanism on the testing dataset shows that attention mechanism can lead to a near 1% performance improvement in classification task (accuracy of 82.95% without, 83.94% with attention module). We also investigate how the cross-entropy based image retrieval model (i.e. baseline model) can benefit from attention mechanism by similarly implement and train a Resnet-50 network without attention module. Results show that attention module can contribute to near 5% performance improvement to the baseline model (classification accuracy of 76.99% without, 81.46% with attention module). Finally, it is found that attention can improve the recall rate, as described in section 3.1. Using k=4, the proposed model can achieve recall rates of 84.4%, 91.8% and 90.1% for control, non-COVID pneumonia and COVID, respectively (listed in Table 3 ), while the corresponding recall rates of the model without attention are 67.0%, 89.3% and 91.4%. We utilize the multi-similarity loss in this work for training the image retrieval network. As there exists other type of contrastive loss functions, here we investigate the performance of an alternative model using the Noise-Contrastive Estimation (InfoNCE) loss (Oord et al., 2018) , which has been widely applied in both self-supervised and supervised contrastive learning. InfoNCE loss is based on optimizing categorical cross-entropy for classifying one positive sample from N random samples consisting of N-1 negative samples, where these N samples were sampled from a proposal noise distribution. Comparison between the proposed model and the model using InfoNCE loss (with everything else remained the same) show similar performance (accuracy of 83.94% by proposed model, 82.78% by InfoNCE). As the CBIR-based model relies on KNN to obtain labels for the query images, we investigate how the number of returned nearest neighbors to be considered in making the weighted majority vote (i.e. value of k for KNN) can affect model performance. By trying different values of k from 1 to 30, it is found that classification accuracy is stable when the k is within a reasonable range (5~20), as illustrated in Fig. 5(a) . This is mainly because we weight the returned neighbors based on their distance to the query image for making the majority vote, thus the increased neighbors because of a larger k will have reduced impact on voting results. Thus, we use k=10 for the proposed model based on empirical experiment and efficiency. As introduced in section 2.4.2 and 2.4.3, we use the projection head g to project the learned image representations into lower-dimensional embedding space. As the feature extracted by Resnet-50 has dimension of 2048, the projection head will project this 2048-D feature into a smaller size, where we have investigated different possible size ranging from 32 to 512. Model performance by using different embedding size are illustrated in Fig. 5(b) . As there exists a trade-off between image information preserved after embedding (which prefers a larger embedding space) and the dimensionality problem for later metric learning (which prefers a smaller embedding space), the optimized size for embedding space is highly relied on the later task and data distribution thus can only be determined empirically. In the current model setting we use the embedding size of 64, based on the consideration of both model performance and efficiency. As introduced in the methodology development section, the proposed model is developed with the aim of learning both content-and semantic-rich embeddings from the input images. Thus, after training, the model can be also used as an effective image feature extraction tool for other tasks based on the learned embeddings. In order to test the feasibility of the proposed model on such premise, we employ the pretrained model on a new task of clinical decision making. The task is part of our Partners healthcare institution's goal of predicting the emergency department (ED) COVID-19 patient's risk of receiving intervention (e.g. receiving oxygen therapy or under mechanical ventilator) within 72 hours. Such prediction is strongly correlated to prognosis and is vital for the early response to patients and management of resources, which can be beneficial for both patients and hospital. On one hand, intervention measures especially ventilators have been recommended as a crucial for the countering the hypoxia of COVID-19 patients (Orser, 2020) , where timely application of intervention has been considered as an important factor to patient's prognosis (Meng et al., 2020) . On the other hand, effective resource allocation of oxygen supplement and mechanical ventilator has become a major challenge during COVID-19 epidemics, thus the knowledge of equipment needs in advance will be helpful for the hospitals, especially in emergency department. Electronic health record (EHR) data and CXR images were collected from 1,589 COVID-19 PCR test positive patients who has been admitted to emergency department of the hospitals affiliated with Partners group before April 28 th , 2020. In total 17 EHR-derived features were used in this study after a feature selection using random forest. These features include patient's demographic information (e.g. age), vitals (e.g. temperature, blood pressure, temperature, respiratory rate, oxygen saturation, etc.), and basic lab tests (e.g. glomerular filtration rate, white blood cell, etc.). 2048-dimensional CXR-derived image features (i.e. features extracted by Resnet-50 backbone with attention, before processed by projection head g) were extracted using the proposed model, which has been pre-trained as in section 3.1 without any further calibration to the data in this task. Types of intervention the patients have received for breathing, including high flow oxygen through nasal cannula, non-invasive ventilation through face mask and mechanical ventilators in 72 hours, were recorded as the prediction target. We then trained 3 binary classifiers to predict whether the patient will be receiving any types of interventions. In this work we proposed a metric learning based CBIR model for analyzing chest radiograph images. Based on the experiments, we show that the proposed model can handle a variety of CXR-related clinical problems in COVID-19, including but not limited to CXR image feature extraction and image retrieval, diagnosis, and clinical decision support. Comparison with traditional classification-based deep learning method shows that the metric learning scheme adopted in this work can help improving effectiveness of image retrieval and diagnosis while at the same time providing rich insights into the analysis procedure, thanks to the model's capability in learning both semantic and content discriminative features from input images. In addition, the clinical information returned by the retrieval model, as illustrated in Fig. 2 , can provide reference for the radiologists and physicians in determining the query patient's condition to assist decision making. Such capability of linking image and clinical information through content-based retrieval will be extremely helpful for the radiologists and physicians in facing the potential threat of a COVID-19 resurgence. The superior performance of the proposed model in retrieving images for radiologists and physicians, and its value in diagnosis/prognosis has motivated our Partners healthcare consortium to start deploying the model into clinical workflow and integrating it in the EHR system (e.g. EPIC system as used in Partners healthcare). Significant amount of engineering and integration work has been done in this effort. In addition to data routing, series selection and interface development for the system integration, we have been specifically working on: 1) improving the model for a more comprehensive query strategy i.e. incorporating keyword-and clause-based query; 2) establishment of a standardized definition of COVID-19 clinically relevant patient features, which will be identified from patient's EHR data, extracted and routed by the system, and displayed to the human readers along with returned images; 3) the development of institutional-level COVID-19 data warehouse to support large-scale, holistic coverage for COVID-19 data collection within the Partners healthcare system. In the current study, the proposed model is applied on a single-label, three-classes task. As the multisimilarity loss enforced during metric learning process is intrinsically designed for learning from multilabeled data, the model can be easily adapted to more challenging, multi-label tasks such as identifying lung-related comorbidities in COVID-19 patients. As comorbidities such as chronic obstructive pulmonary disease (COPD) and emphysema can interfere with the severity assessment of COVID-19, correct identification of those conditions during image retrieval will be very important and useful. Towards this purpose, richer semantic information (i.e. more disease labels) and data collection from a larger population will be included in our future study. Further, we are extending the current patient types (control, non-COVID pneumonia, COVID-19) into a wider range of definition. By incorporating severity level of COVID-19 as reported by the physicians into analysis, we can develop an improved version of the model with capability of discriminating and predicting patient severity. Another major challenge of the content-based image retrieval is the definition of "similarity". As discussed in (Smeulders et al., 2000) , there exists "semantic gap" between information extracted by computer algorithms from an image and perception of the same image by human observer. Such gap is more prominent in medical domain, as semantic disease-related features are usually localized with very specific texture definition, while visual perception of the image is more focused on global shape and position of the lung in CXR images. Thus, it will be difficult to interpret image retrieving results by the radiologists, especially when multiple labels are involved in the reading. To address this challenge, we are working on the development of a more user-friendly system, in which human readers can obtain different outputs by adjusting a hyperparameter to control the balance between semantic and visual similarities. ACR recommendations for the use of chest radiography and computed tomography (CT) for suspected COVID-19 infection Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks Autoencoding the retrieval relevance of medical images Retrieval of Brain Tumors by Adaptive Spatial Pooling and Fisher Vector Representation Chinese management guideline for COVID-19 Extension of Coronavirus Disease 2019 (COVID-19) on Chest CT and Implications for Chest Radiograph Interpretation ImageNet: A large-scale hierarchical image database Impact of hybrid supervision approaches on the performance of artificial intelligence for the classification of chest radiographs Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images ImageMiner: a software system for comparative analysis of tissue microarrays using content-based image retrieval, high-performance computing, and grid technology Dual attention network for scene segmentation Dimensionality Reduction by Learning an Invariant Mapping Accurate Screening of COVID-19 using Attention Based Deep 3D Multiple Instance Learning Deep Residual Learning for Image Recognition A role for CT in COVID-19? What data really tell us so far Squeeze-and-Excitation Networks Portable chest X-ray in coronavirus disease-19 (COVID-19): A pictorial review Two public chest X-ray datasets for computer-aided screening of pulmonary diseases How far have we come? Artificial intelligence for chest radiograph interpretation Diagnosis of Coronavirus Disease 2019 (COVID-19) with Structured Latent Multi-View Representation Learning Supervised Contrastive Learning Attention-based ensemble for deep metric learning Adapting content-based image retrieval techniques for the semantic annotation of medical images Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks A survey on deep learning in medical image analysis Content-Based Image Retrieval in Medical Domain: A A review of content-based image retrieval systems in medical applications-clinical benefits and future directions Retrieval From and Understanding of Large-Scale Multi-modal Medical Datasets: A Review COVID-19 on the Chest Radiograph: A Multi-Reader Evaluation of an AI System Imaging Profile of the COVID-19 Infection: Radiologic Findings and Literature Review Deep Learning COVID-19 Features on CXR using Limited Training Data Sets Representation learning with contrastive predictive coding Recommendations for Endotracheal Intubation of COVID-19 Patients Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia Learning to detect chest radiographs containing pulmonary lesions using visual attention networks Medical image retrieval using deep convolutional neural network Computer-aided detection in chest radiography based on artificial intelligence: a survey Transmission potential and severity of COVID-19 in South Korea Development of a Digital Image Database for Chest Radiographs With and Without a Lung Nodule Content-based image retrieval at the end of the early years EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Artificial Intelligence and Machine Learning in Radiology: Opportunities, Challenges, Pitfalls, and Criteria for Success Deep Learning for Content-Based Image Retrieval: A Comprehensive Study Residual attention network for image classification Prior-Attention Residual Learning for More Discriminative COVID-19 Screening in CT Images COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-Ray Images Deep & Cross Network for Ad Click Predictions A Weakly-supervised Framework for COVID-19 Classification and Lesion Localization from Chest CT Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS Content-based image retrieval for Lung Nodule Classification Using Texture Features and Learned Distance Metric Sampling Matters in Deep Embedding Learning Show, attend and tell: Neural image caption generation with visual attention Liver Histopathological Image Retrieval Based on Deep Metric Learning A deep metric learning approach for histopathological image retrieval Mining histopathological images via hashing-based scalable image retrieval